XML Data Management

Although in several of our research projects we are already using the eXtended Markup Language (XML) as standard representation and exchange format for metadata management and data integration, it turned out that there are still many open problems in actually building, managing, and querying large scale repositories of XML data. We are currently working in several areas of XML data management to address these issues; some of these projects just started in Spring 2002.

XQuery and Information Retrieval
An interesting problem we were facing is that though there are several sophisticated XML query languages, none of these languages sufficiently supports an XML document view in which documents mainly contain text besides XML element structures. Thus data (or information) retrieval schemes that support conditions on text become equally important to query schemes that focus on path patterns. In a WebDB 2002 paper Jan-Marco Bremer and Michael Gertz propose an extension of the XML query language XQuery by a powerful information retrieval component, dubbed XQuery/IR, providing a well-defined and easy to use model for integrating XML data and document retrieval through dynamic ranking of document fragments. This is the first work that not only has a well-defined semantics for such a information retrieval operator in XQuery, but also outlines a complete framework for its realization. We are currently completing a journal paper that details the full implementation of the new operator in XQuery, with a particular focus on space and access efficient index structures to support full-text indexing of XML documents and XQuery optimization schemes. A first prototype is currently used in the context of the Human Brain Project in which text-rich XML documents are integrated into an XML document repository using the above document conversion approach.

Personnel:
Michael Gertz (Computer Science)
Jan-Marco Bremer (Ph.D. student, Computer Science)

Funding:
In the context of the Human Brain Project (see Data Intergration and Metadata Management Research Web page).

Publications:

  • Jan-Marco Bremer, Michael Gertz: Query Processing and Index Structures for Integrated XML Document and Data Retrieval. Technical Report CSE-2002-22, June 2002.
    [.ps] [.pdf]

  • Jan-Marco Bremer, Michael Gertz: XQuery/IR: Integrating XML Document and Data Retrieval. In Fifth International Workshop on the Web and Databases, Madison, Wisconsin, June 6-7, pp. 1-6, 2002. [.ps] [.pdf]