Tuesday, May 26, 2015

The Semi structured Data Model

Consider a set of documents on the Web that contain hyperlinks to other documents. These documents, although not completely unstructured, cannot be modelled naturally in the relational data model because the pattern of hyperlinks is not regular across documents.

While some data is completely unstructured -- for example video streams, audio streams, and image data -- lot of data is neither completely unstructured nor completely structured. We refer to data with partial structure as semi structured data.


There are many reasons why data might be semi structured. First, the structure of data might be implicit, hidden, unknown, or the user might choose to ignore it. Second, consider the problem of integrating data from several heterogeneous sources where data exchange and transformation are important problems. Third, we cannot query a structured database without knowing the schema, but sometimes we want to query the data without full knowledge of the schema.


All data models proposed for semi structured data represent the data as some kind of labelled graph. Nodes in the graph correspond to compound objects or atomic values and edges correspond to attributes.

We now discuss one of the proposed data models for semi structured data, called the object exchange model (OEM). Each object is described by a triple consisting of a label, a type, and the value of the object.  Since each object has a label that can be thought of as the column name in the relational model, and each object has a type that can be thought of as the column type in the relational model, the object exchange model is basically self-describing. Labels in the object exchange model should be as informative as possible, since they can serve two purposes: They can be used to identify an object as well as to convey the meaning of an object. For example, we can represent the last name of an author as follows: <lastName, string, "Feynman">

No comments:

Post a Comment