Data Integration and Metadata Management

Data on the Web is not only unstructured and of diverse types, but also often comes without metadata that can be used in data retrieval schemes. This is a complicating aspect in particular for querying (distributed) scientific data sources containing huge amounts of image, text, and raw experiment data that exhibit no database-like schema structures. Over the last years we have developed and extended a data annotation model that allows users to associate semantic-rich metadata with (remote) Web data at different levels of granularity (whole documents or fragments (regions of interest) of documents). Metadata schemes underlying annotations are based on conceptual structures such as ontologies and standard vocabularies. This ensures that only well-defined metadata can be associated with data. Conceptual structures in combination with data annotations allow users to query heterogeneous data sources in a uniform and integrated fashion at an abstract, conceptual level. Since the initial proposal of conceptualized data annotations, we have made several contributions in the area of metadata management and data integration.


Integrating Scientific Data
With faculty from the Department of Computer Science at UC Davis and the Center for Neuroscience at UC Davis, we are working on the development of architectures and models for integrating, managing, and querying heterogeneous forms of Neuroscience data in collaborative research environments. The integration approach utlizes a so-called annotation graph model, which is based on representing and querying graph structures, and turns out to be extremely useful in presenting, managing and querying metadata schemes, data annotations, and Web-accessible documents in a uniform and transparent manner. While some works on ontologies and metadata simply focus on associating concepts with data, in our approach includes checking the consistent usage of metadata (schemes) in data annotations.

Personnel:
Michael Gertz (Computer Science)
Jan-Marco Bremer (Ph.D. student, Computer Science)
Cheryl Kang (M.S. student, Computer Science)
Mike Hogarth (School of Medicine and Graduate Group of Medical Informatics),
Fredric Gorin (Center for Neuroscience)

Funding:
Human Brain Project: "Informatics of Human and Monkey Brain Atlases", (PI Edward G. Jones, Center for Neuroscience) at a level of about $7,000,000 for 5 years

Publications:

  • Mike Hogarth, Michael Gertz, Fred Gorin: Terminology Query Language: A Sever Interface for Concept Oriented Terminology Systems. In American Medical Informatics Association (AMIA) Annual Symposium On Health Care Informatics, 2000. [.ps] [.pdf]

  • Fred Gorin, Mike Hogarth, Michael Gertz: The Challenges and Rewards of Integrating Diverse Neuroscience Information. To appear in The Neuroscientist, Sage Publications.

  • Marco Bremer, Michael Gertz: Web Data Indexing Through External Semantic-carrying Annotations. In 11th IEEE International Workshop on Research Issues in Data Engineering: Document Management for Data Intensive Business and Scientific Applications (RIDE 2001), 69-76, IEEE Computer Society, 2001.

  • Michael Gertz, Kai-Uwe Sattler: A Model and Architecture for Conceptualized Data Annotations. Technical Report CSE-2001-11, Department of Computer Science, University of California, Davis, December 2001. [.pdf] [.ps]

  • Michael Gertz, Kai-Uwe Sattler, Fred Gorin, Michael Hogarth, Jim Stone: Annotating Scientific Images: A Concept-based Approach. In 14th International Conference on Scientific and Statistical Database Management, IEEE Computer Society, 2002. [.pdf]

  • Michael Gertz, Kai-Uwe Sattler: Integrating Scientific Data through External, Concept-based Annotations. In Second International Workshop Data Integration over the Web, 87-102, 2002.