MongoDB Data Models

When creating MongoDB data models, besides knowing internal details of how MongoDB database engine works, there are few other factors that should be considered first:

  • How your data will grow and change over time?
  • What is the read/write ratio?
  • What kinds of queries your application will perform?
  • Are there any concurrency related constrains you should look at?

These factors very much affect what type of model you should create. There are several types of MongoDB models you can create:

  • Embedding Model
  • Referencing Model
  • Hybrid Model that combines embedding and referencing models.

There are also other factors that can affect your decision regarding the type of the model that will be created. These are mostly operational factors and they are documented at Data Modeling Considerations for MongoDB Applications

The key question is:

  • should you embed related objects within one another or
  • should you reference them by their identifier (ID)?

You will need to consider performance, complexity and flexibility of your solution in order to come up with the most appropriate model.

Embedding Model (De-normalization)

Embedding model enables de-normalization of data what means that two or more related pieces of data will be stored in a single document. Generally embedding provides better read operation performance since data can be retrieved in a single database operation. In other words, embedding supports locality. If you application frequently access related data objects the best performance can be achieved by putting them in a single document which is supported by the embedding model.

MongoDB provides atomic operations on a single document only. If fields of a document have to be modified together all of them have to be embedded in a single document in order to guarantee atomicity. MongoDB does not support multi-document transactions. Distributed transactions and distributed join operations are two main challenges associated with distributed database design. By not supporting these features MongoDB has been able to implement highly scalable and efficient atomic sharding solution.

Embedding has also its disadvantages. If we keep embedding related data in documents or constantly updating this data it may cause the document size to grow after the document creation. This can lead to data fragmentation. At the same time the size limit for documents in MongoDB is determined by the maximum BSON document size (BSON doc size) which is 16 MB. For larger documents, you have to consider using GridFS.

On the other hand, if documents are large the fewer documents can fit in RAM and the more likely the server will have to page fault to retrieve documents. The page faults lead to random disk I/O that can significantly slow down the system.

Referencing Model (Normalization)

Referencing model enables normalization of data by storing references between two documents to indicate a relationship between the data stored in each document. Generally referencing models should be used when embedding would result in extensive data duplication and/or data fragmentation (for increased data storage usage that can also lead to reaching maximum document size) with minimal performance advantages or with even negative performance implications; to increase flexibility in performing queries if your application queries data in many different ways, or if you do not know in advance the patterns in which data may be queried; to enable many-to-many relationships; to model large hierarchical data sets (e.g., tree structures)

Using referencing requires more roundtrips to the server.

Hybrid Model

Hybrid model is a combination of embedding and referencing model. It is usually used when neither embedding or referencing model is the best choice but their combination makes the most balanced model.

Polymorphic Schemas

MongoDB does not enforce a common structure for all documents in a collection. While it is possible (but generally not recommended) documents in a MongoDB collection can have different structures.

However our applications evolve over time so that we have to update the document structure for the MongoDB collections used in applications. This means that at some point documents related to the same collection can have different structures and the application has to take care of it. Meanwhile you can fully migrate the collection to the latest document structure what will enable the same application code to manage the collection.

You should also keep in mind that the MongoDB’s lack of schema enforcement requires the document structure details to be stored on a per-document basis what increases storage usage. Especially you should use a reasonable length for the document’s field names since the field names can add up to the overall storage used for the collection.

MongoDB Indexes

MongoDB indexes are based on B-tree data structure. Indexes are important elements in maintaining MongoDB performance if they are properly used. On the other hand, indexes have associated costs that include memory usage, disk usage and slower updates. MongoDB provides explain plan capability and database profiler utility to collect data about database operations that can be used for database tuning.

Memory

Ideally the entire index should be resident in RAM. If the number of distinct values for index is high, we have to ensure that index fits in RAM. Otherwise performance will be impacted.

The parts of index related to the recently inserted data will always be in active RAM. If you query on recent data, MongoDB index will perform well and MongoDB will use less memory. For example, this could be a case when index is based on a time/date field.

Compound Index

Besides single field indexes you can also create compound indexes containing more than one field. The order of fields in a compound index can significantly impact performance. It will improve performance if you place more selective element of your query first in compound index. At the same time, your other queries may be impacted by this choice. This is an example that shows that you have to analyze your entire application in order to make appropriate design decisions regarding indexes.

Each index is stored in a sorted order on all fields in the index and these rules should be followed to provide efficient indexing:

  • Fields that will be queried by equality should be the first fields in the index.
  • The next should be fields used to sort query results. If sorting is based on multiple fields they should occur in the same order in the index definition
  • The last filed in the index definition should be the one queried by range.

It is also good to know that an additional benefit of a compound index is that a leading field within the index can also be used. So if we query with a condition on a single field that is a leading field of an index, the index will be used.

On the other hand an index will be less efficient if we do not range and sort on a same set of fields.

MongoDB provides hint() method to force use of a specific index.

Unique Index

You can create a unique index that will enable uniqueness of the index field value. A compound index can also be specified as unique in which case each combination of index field values has to be unique.

Fields with No Values and Sparse Index

If an index field does not have a value then the index entry with value null will be created. Only one document can have a null value for an index field unless the sparse option is specified for the index in which case the index entries are not created for documents that do not have the field.

You should be aware that using a sparse index will sometime produce an incomplete result when index-based operations (e.g., sorting, filtering, etc.) are used.

Geospatial Index

MongoDB provides geospatial indexes that are used to optimize queries including locations within a two-dimensional space. When dealing with locations documents must have a field with a two-element array (latitude and longitude) to be indexed with a geospatial index.

Array Index

Fields that are arrays can be also indexed in which case each array value is stored as a separate index entry.

Create Index Operation

Creation of an index can be either a foreground or background operation. Foreground operation intensively consume resources and require lots of time in some cases. They are blocking operations in MongoDB. When indexes are created via a background operation more time is needed to create an index but database is not blocked and can be used while index is being created.

AND and OR Query Tips

If you know that a certain criteria in a query will be matching less documents and if this criteria is indexed, make sure that this criteria goes first in your query. This will enable a selection of a smallest number of documents needed to retrieve the data.

OR-style queries are opposite of AND queries. The most inclusive clauses (returning the largest number of documents) should go first since MongoDB has to check documents that are not part of the result set yet for every match.

Useful Information

  •  MongoDB optimizer generally uses one index at a time. If more than one predicate is used in a query then a compound index should be created based on rules previously described.
  • The maximum length of index name is 128 characters and an index entry size Cannot exceed 1,024 bytes.
  • Index maintenance during the add, remove or update operations will slow down these operations. If your application performs heavy updates you should carefully select indexes.
  • Ideally indexes should reduce a set of possible documents to select from so it is important to create high selectivity indexes. For example, an index based on a phone number is more selective than an index based on a ‘yes/no’ flag.
  • Indexes are not efficient in inequality queries.
  • When regular expressions are used leading wildcards will downgrade query performance because indexes are ordered.
  • Indexes are generally useful when we are retrieving a small subset of the total data. They usually stop being useful when we return half of the data or more in a collection.
  • A query that returns only a few fields should be fully covered by an index.
  • Whenever it is possible, create a compound index that can be used by mutiple queries.

Tomcat Security Realm with MongoDB

1. User and Role Document Model

Daprota User Model

2. web.xml

 We have two roles, ContentOwner and ServerAdmin. This is how we set up form-based authentication in web.xml:

  …
 <security-constraint>
     <web-resource-collection>
         <url-pattern>/*</url-pattern>
     </web-resource-collection>
     <auth-constraint>
         <role-name>ServerAdmin</role-name>
         <role-name>ContentOwner</role-name>
     </auth-constraint>
 </security-constraint>

 <login-config>
     <auth-method>FORM</auth-method>
     <realm-name> MongoDBRealm</realm-name>
     <form-login-config>
         <form-login-page>/login.jsp</form-login-page>
         <form-error-page>/login_error.jsp</form-error-page>
     </form-login-config>
 </login-config>

 <!-- Security roles referenced by this web application -->
 <security-role>
     <role-name>ServerAdmin</role-name>
 </security-role>
 <security-role>
 <role-name>ContentOwner</role-name>
 </security-role>

3. Create passwords for admin and test users

($CATALINA_HOME/bin/digest.[bat|sh] -a {algorithm} {cleartext-password})

os-prompt> digest -a SHA-256 manager
manager: 6ee4a469cd4e91053847f5d3fcb61dbcc91e8f0ef10be7748da4c4a1ba382d17

os-prompt> digest -a SHA-256 testpwd
testpwd:a85b6a20813c31a8b1b3f3618da796271c9aa293b3f809873053b21aec501087

Execute this JavaScript  code in MongoDB JS shell:

use mydb
usr = { userName: 'admin',
        password: '1a8565a9dc72048ba03b4156be3e569f22771f23',
        roles: [ { _id: ObjectId(),
                   name: 'ServerAdmin'}
               ]
}
db.user.insert(usr);
usr = { userName: 'test',
        password: '05ec834345cbcf1b86f634f11fd79752bf3b01f3',
        roles: [ { _id: ObjectId(),
                   name: 'ContentOwner'}
               ]
}
db.user.insert(usr);
db.user.find().pretty();

role = { name: 'ServerAdmin',
         description: 'Server administrator role'
}
db.role.insert(role);
role = { name: 'ContentOwner',
         description: 'End-user (client) role'
}
db.role.insert(role);
db.role.find().pretty();

mydb is a MongoDB database name we use in this example.

4. Realm element setup

Set up Realm element, as showed below, in your $CATALINA_HOME/conf/server.xml file:

      <Host name="localhost"  appBase="webapps"
            unpackWARs="true" autoDeploy="true">
        ...
        <Realm className="com.daprota.m2.realm.MongoDBRealm"
               connectionURL="mongodb://localhost:27017/mydb"
               digest="SHA-256"/>
      </Host>
 </Engine>

5. How to encrypt user’s password

The following Java code snippet is an example of how to encrypt a user’s password:

String password =  "password";

MessageDigest messageDigest = java.security.MessageDigest.getInstance("SHA-256");        
messageDigest.update(password.getBytes());            
byte byteData[] = messageDigest.digest();
//Convert byte data to hex format 
StringBuffer hexString = new StringBuffer();
for (int i = 0; i < byteData.length; i++) {            
    String hex=Integer.toHexString(0xff & byteData[i]);                
    if (hex.length()==1) 
        hexString.append('0');                
    hexString.append(hex);                
}

When you store password in MongoDB, store it via hexString.toString().

6. MongoDB realm source code

The source code of the Tomcat security realm implementation with MongoDB and ready to use m2-mongodb-realm.jar are available at

https://github.com/gzugic/mongortom

You just need to copy m2-mongodb-realm.jar to your $CATALINA_HOME/lib.

Semantic Wikis

Semantic wikis combine wikis that enable simple and quick collaborative text editing over the Web and Semantic Web that enriches the data on the Web with well-defined meaning to provide easier way to find, share, and combine information. Semantic wiki extends a classical wiki by integrating it with the management capabilities for the formal knowledge representations. The long-term goal of semantic wiki should be to provide well structured knowledge representations and sound reasoning based on these representations in a user friendly way.

We can classify semantic wikis into two groups [5]:

  • Text-centered
  • Logic-centered

The text-centered semantic wikis enrich classical wiki environments with semantic annotations relating the textual content to a formal ontology. These wikis are text oriented. The goal of these semantic wikis is not to manage ontologies but to provide a formal backbone to wiki articles. Semantic MediaWiki, KiWi (continuation of IkeWiki), and KnowWE are examples of this semantic wiki type.

The logic-centered semantic wikis are designed and used as ontology engineering platforms. AceWiki and OntoWiki are examples of this semantic wiki type.

There are other applications that also exhibit semantic wiki characteristics (e.g. Freebase, Twine, Knoodl, and others).

Text-Centered Wikis

We will describe some key features of three text-centered wikis: Semantic MediaWiki, KiWi, and KnowWE.

Semantic MediaWiki (SMW)

SMW (Fig. 1) [7] (http://semantic-mediawiki.org/wiki/Semantic_MediaWiki) is an extension to MediaWiki that enables wiki users to semantically annotate wiki pages, based on which the wiki contents can be browsed, searched, and reused in novel ways. RDF and OWL are used in the background to formally annotate information in wiki pages.

The integration between MediaWiki and SMW is based on MediaWiki’ extentsion mechanism: SMW registers for certain events or requests, and MediaWiki calls SMW functions when needed.

SMW organizes content within wiki pages. These pages are further classified into namespaces which differentiate pages according to their function. The namespaces are defined through the wiki configuration. They cannot be defined by users. Examples of namespace are: “User:” for user home pages, “Help:” for documentation pages, etc. Every page belongs to an ontological element (including classes and properties) that can be further described by annotations on that page. The semantic roles that wiki pages can play are distinguished by the namespaces. The wiki pages can be:

  • Individual elements of a domain of interest
  • Categories (used to classify individual elements and to create sub-categories)
  • Properties (relationships between two pages or a page and a data value)
  • Types (used to distinguish various kinds of properties)

Each page in SMW can be assigned to one or more categories where each category is associated with a page in the “Category:” namespace. Category pages can be used to browse the classified pages and to organize categories hierarchically.

SMW collects information about the concept represented by a wiki page, not about the associated text. SMW collect semantics data via semantic annotations (markups) added to the wiki text by users. The markups processing is done by the SMW components for parsing and rendering (Fig.1).

Figure 1: Semantic MediaWiki Architecture (copied from [7])

The underlying SMW semantic conceptual framework based on properties and types is the core component of the SMW’s semantic processing. Properties are used to specify relationships between one entity (as represented by a wiki page) and other entities and data values. SMW lets wiki users control the set of available properties since each community is interested in different types of relationships in its domain of interest. Properties are used to augment a wiki page content in a structured way. SMW characterises hyperlinks between wiki pages as properties (relationships) where the link’s target becomes the value of a user-provided property. It does not mean that all properties take links’ targets as their values. The properties’ values could also be in a form of geographical coordinates, numeric values, dates, etc.

SMW also provides use of a special type of a wiki page just for properties. For example, a wiki might contain a page “Property:Population” where “Property:” is a namespace prefix. A property page can contain textual description of the page, data type for property values, etc. SMW provides a number of data types that can be used with properties (e.g. “String”, “Date”, “Geographic coordinate”, etc.).

Semantic annotations of a subject described by a wiki page are mapped to OWL DL [8] ontology language. Most annotations are mapped to OWL statements similar to RDF triples: wiki pages to abstract individuals, properties to OWL properties, categories to OWL classes, and property values to either abstract individuals or typed literals. Since OWL further distinguishes object properties, data properties and annotation properties, SMW properties can map to any of those depending on their type. SMW also provides built-in properties that may have a special semantic meaning. SMW can also be configured to interpret MediaWiki hierarchial organisation of categories as an OWL class hierarchy.

SMW is not intended as a general purpose ontology platform and because of that the semantic information representable in SMW is of limited scope.

SMW has a proprietary query language which syntax is closely related to wiki text and its semantics corresponds to curtain class expressions in OWL DL.

Generally speaking queries have one of the highest performance cost in any system. It is the same with SMW.  SMW has features and best practice recommendations to help users manage query performances: caching mechanism, SMW parameters to restrict query processing and complexity, limited size of query constructs, individual reasoning features disabled, etc. SMW uses two separate data stores, one for MediaWiki pages and another one for semantic data related to subjects (concepts) described by these pages. Both stores are based on MySQL database. However, the MySQL semantic data store can be replaced by faster data stores if they are available.

KiWi

KiWi (Knowledge in a Wiki) [6] (continuation of IkeWiki) (http://www.kiwi-project.eu) aims at providing a collaborative knowledge management based on semantic wikis. It augments existing informal articles (e.g. from Wikipedia) with formal annotations.

KiWi provides a platform for building different kinds of social semantic software powered by Semantic Web technologies. It enables content versatility what is the reuse of the same content in different kinds of social software applications.

In KiWi, every piece of information is a combination of human-readable content and associated metadata. The same piece of information can be presented to the user in many different forms such as a wiki page, a blog post, a comment to a blog, etc.. The display of the information is determined by the metadata of the content and a context in which the content is used (e.g., user preferences, device used, type of application, etc.). Since metadata in KiWi is represented by RDF it does not require a-priori schema definitions and the meta-model of a system can be extended in real-time.

“Content item” is the smallest piece of information in KiWi. It consits of human readable content in XHTML, associated metadata in RDF and it is identified by a URI. KiWi creates, stores, updates, versions, searches, and queries content items. The core properties of a content item (e.g. content, author, and creation date) are represented in XML and persisted in a relational database, all other properties can be defined using RDF properties and relations.  It is possible to make a KiWi system part of the Linked Open Data cloud based on the way how the content item’s URI is generated.

With regards to search, KiWi supports a combination of full-text search, metadata search (tags, types, persons), and database search (date, title).

The KiWi (Fig. 2) platform is structured into layers: Model Layer,Service Layer, Controller Layer, View Layer.

Figure 2: KiWi Architecture (copied from [6])

The Model Layer manages content and its related metadata. Itis implemented via a relational database, a triple store, and a full-text index. Entities are persisted (in a relational database) using the Hibernate framework7. The KiWi triple (RDF) store is an implementation based on the relational database. The full-text index is implemented using Hibernate Search.

The Service Layer provides services to upper layers. For example, the EntityManager service provides a unified access to content items in an entity database while the TripleStoreService provides a unified access to an RDF store.

The controller layer includes action components that implement a specific Kiwi functionality. Action components mostly implement functionalities offered in the user interface (e.g. view, edit, annotate, etc.). They use service components to access the Kiwi content and metadata.

The view layer enables user interactions with Kiwi via browser and it also offers web services for accessing the triple store and SKOS thesauruses. There is also a linked open data service that provides the KiWi content to linked open data clients.

KiWi core data model includes three concepts: Content Item, Tag, and Triple. Additional functionality can be added by KiWi Facades. The Content Item is a core concept in KiWi. It represents a “unit of information”. A user always interacting with a primary Content Item. Content Items could be any type of content (e.g. wiki page, user profile, rule definition, etc.). KiWi is not restricted to specific content formats. All Content Items are identified by Uniform Resource Identifiers (URIs). The textual or media content that belongs to a resource is for human consumption. Generally speaking each resource has both machine readable and human readable form (description). Machine readable content (semantic data) is represented as RDF and stored in a triple (RDF) store. Human readable (text) content is internally structured as an XML document that can be queried and transformed to orther representations like HTML, XSL-FO (for PDF and other printable formats). Tags are used to annotate Content Items. For example, they can be used to associate a Content Item with a specific topic or to group Content Items in knowledge spaces. Tags are mapped to an RDF structure. Machine readable metadata is stored in a form of extended Triples that contains additional information related to internal maintenance (e.g. versioning, transactions, associations between triples and other resources in KiWi, etc.).

KnowWE

KnowWE (http://www.is.informatik.uni-wuerzburg.de/en/research/applications/knowwe/ ) [3,4] is a knowledge wiki. In a semantic wiki every wiki page represents a distinct concept from a specific domain of interest. Knowledge wikis further represent a possible solution with every wiki page. On every page the content is described by semantically annotated text and/or by multimedia content (e.g., pictures, diagrams, etc.). The embedded knowledge can be used to derive the concept related to the particular wiki article.

In KnowWE the knowledge base is entered together with the standard text by using appropriate textual knowledge markups. When a wiki page is saved, the included markups are extracted and compiled into an executable knowledge base corresponding to the wiki page and stored in a knowledge base repository.

In KnowWE, a user is able to start an interactive interview by entering a problem description. When the user enters his/her inputs, an appropriate list of solutions is presented. These solutions are linked to the wiki pages representing the presented solution concepts. Every solution represented in the wiki is considered during the problem-solving process.

KnowWE uses a problem-solving ontology as an upper ontology of all concepts defined in an application project. All concepts and properties are inherited by concepts and properties of the upper ontology.

A new solution is added to KnowWE by creating a new wiki page having the solution’s name. The wiki page contains human readable text and explicit knowledge for deriving the new solution.

KnowWE is intended for small closed communities and semi-open medium-sized communities. It is based on the implementation of JSPWiki. Its parsing engines and problem-solvers are based on the d3web project [9].

Logic-Centered Wikis

AceWiki and OntoWiki will be presented below as examples of logic-centered wikis.

AceWiki

AceWiki (http://attempto.ifi.uzh.ch/acewiki/) uses the controlled natural language Attempto Controlled English (ACE) for representing its content. The logic-centered (ontology engineering) semantic wikis’ goal is to make acquisition, maintenance, and analysis of formal knowledge simpler and faster. According to the AceWiki team, most of the existing semantic wikis “have a very technical interface and are restricted to a relatively low level of expressivity” [5]. AceWiki provides two advantages: first, improve usability and achieve a shallow learning curve since the controlled natural language is used; second, ACE is more expressive than existing semantic formal languages (e.g. RDF, OWL, etc.).

The use of a controlled natural language allows ordinary users who are not experts in ontologies and formal semantic languages to create, understand, and modify formal wiki content. The main goals of AceWiki are to improve knowledge aggregation, knowledge representation, and to support a higher degree of expressivity. The key design principles of AceWiki are: naturalness (formal semantics is in a form of a natural language), uniformity (only one language is used at the user interface layer), and strict user guidance (a predictive editor enables well-formed statements created by users).

All other semantic wikis used as ontology engineering platforms (e.g. OntoWiki and others) allow creation of formal statements (annotations, metadata) that are not considered the main content but rather the enrichment of it which is not the case when the controlled natural language is used.

ACE looks like English but it is fully controlled without ambiguities of natural language. Every ACE text has a well-defined formal meaning since the ACE parser translates the ACE text into Discourse Representation Structures which are a syntactical variant of the  first-order logic. Semantic Web generally classifies semantic languages into three main high-level categories: ontology languages (e.g. OWL), rule languages (e.g. SWRL, RIF), and query languages (e.g. SPARQL). ACE plays role of all these languages. AceWiki translates ACE sentences into OWL what enables reasoning with existing OWL reasoners.

OntoWiki

OntoWiki [1] (http://ontowiki.net/Projects/OntoWiki) provides support for distributed knowledge engineering scenarios. OntoWiki does not mix text editing with knowledge engineering. It applies the wiki paradigm to knowledge engineering only. OntoWiki facilitates the visual presentation of a knowledge base as an information map, with different views on instance data. It enables authoring of semantic content, with an inline editing mode for editing RDF content, similar to WYSIWIG for text documents. OntoWiki regards RDF-based knowledge bases as “information maps”. Each node (represented by an OntoWiki Web page) in the information map is interlinked with related digital resources. The details of the OntoWiki Web page structure are provided in [1].

OntoWiki supports collaboration aspects through tracking changes, allowing comments and discussions on every part of a knowledge base, enabling to rate and measure the popularity of content and honoring the activity of users.

The main goal of the OntoWiki approach is to rapidly simplify the presentation and acquisition of knowledge.

OntoWiki pages that represent nodes in the information map, are divided into three sections: a left sidebar, a main content section and a right sidebar. The left sidebar offers selections that include knowledge bases, a class hierarchy, and a full-text search. When a selection is made, the main content section will show matching content in a list view linking to individual views for matching nodes. The right sidebar provides tools and complementary information specific to the selected content.

OntoWiki provides reusable user interface components for data editing, called widgets.  These are some of the provided widgets: Statements to edit subjects, predicates, and objects, Nodes to edit literals or resources, File to upload files, etc.

Solical collaborations within OntoWiki is one of its main characteristics. According to OntoWiki this eases the exchange of meta-information about the knowledge base and promotes collaboration scenarios where face-to-face communication is hard. This also contributes in creating  an ”architecture of participation” that enables users to add value to the system as they use it. The main social collaboration features of OntoWiki are:

  • Change tracking – All changes applied to a knowledge base are tracked.
  • Commenting –  Statements presented to the user can be annotated, and commented.
  • Rating –  OntoWiki allows to rate instances.
  • Popularity – All knowledge base accesses are logged what allows to arrange views on the content based on popularity.
  • Activity/Provenance – OntoWiki keeps track of what was contributed and by whom.

OntoWiki has been implemented as an alternative user interface for the schema editor in pOWL [2] that is a platform for Semantic Web application development.

OntoWiki is designed to work with knowledge bases of arbitrary size. OntoWiki loads only those parts of the knowledge base into main memory which are required to display the requested information.

References

1. Auer, S., Dietzold, S., Lehmann, J., Riechert, T.: OntoWiki: A Tool for Social, Semantic Collaboration. CKC, 2007. http://www2007.org/workshops/paper_91.pdf
2. Auer, S.: pOWL – A Web Based Platform for Collaborative Semantic Web Development. http://powl.sourceforge.net/overview.php
3. Baumeister, J., Reutelshoefer, J., Puppe, F.: KnowWE: A Semantic Wiki for Knowledge Engineering. In: Applied Intelligence (2010)
4. Baumeister, J.; Puppe, F.: Web-based Knowledge Engineering using Knowledge Wikis. Proc. of the AAAI 2008 Spring Symposium on “Symbiotic Relationships between Semantic Web and Knowledge Engineering”, pp. 1-13, Stanford University, USA, 2008. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.143.955&rep=rep1&type=pdf
5. Kuhn, T.: How Controlled English can Improve Semantic Wikis. Proceedings of the Fourth Workshop on Semantic Wikis. European Semantic Web Conference 2009, CEUR Workshop Proceedings, 2009. http://attempto.ifi.uzh.ch/site/pubs/papers/semwiki2009_kuhn.pdf
6. Schaffert, S., Eder, J., Grünwald, S., Kurz, T., Radulescu, M., Sint, R., Stroka, S.: KiWi – A Platform for Semantic Social Software. In: 4th Workshop on Semantic Wikis (SemWiki2009) at ESWC09, Heraklion, Greece, June 2009.
7. Völkel, M., Krötzsch, M., Vrandecic, D., Haller, H., Studer, R.: Semantic Wikipedia. In Journal of Web Semantics 5/2007, pp. 251–261. Elsevier 2007. http://korrekt.org/page/Semantic_Wikipedia_(JWS2007)
8. W3C: OWL, http://www.w3.org/standards/techs/owl#w3c_all
9.d3web, http://sourceforge.net/apps/mediawiki/d3web/index.php?title=Main_Page

Uniform Resource Identifier (URI) – Definition and Types

Uniform Resource Identifier (URI) is one of the most important concepts of Semantic Web. An introduction of knowledge representation technologies and inference to operate over Web resources identified and indirectly defined via Uniform Resource Identifiers (URIs) is one of the key features of the Semantic Web (as stated in [3]).

URI Definition

A Uniform Resource Identifier (URI) is a string of characters that is about naming, identifying, addressing, and defining resources.

What really matters here is what we can do with a URI [2]. If we can dereference the URI it will indirectly give us authoritative information about a resource it identifies. On the other hand, the URI is also useful to others if it uniquely identifies the resource even if the resource is not completely described. This unique resource identification enables others to provide more information about the resource and it creates a network effect.

URIs were originally defined as two types:

  • Uniform Resource Locators (URLs) which are addresses with network locations, and
  • Uniform Resource Names (URNs) which are persistent names that are address independent.

A URL is a URI that identifies a network-homed resource and also specifies the means of acting upon or obtaining the representation, either through description of the primary access mechanism, or through network “location”. For example, the URL http://www.yahoo.com identifies a resource (Yahoo’s home page). It also implies that a representation of that resource (Yahoo home page’s current HTML code) is obtainable via HTTP from a network host named http://www.yahoo.com.

A Uniform Resource Name (URN) is a URI that identifies a resource by name, in a particular namespace. A URN can be used to identify a resource without implying its location or how to access it. For example, the URN urn:isbn:0-112-99333-8 is a URI that specifies the unique reference within the International Standard Book Number (ISBN) identifier system. It references a book, but doesn’t suggest where and how to obtain an actual copy of the book.

The URN defines a resource’s identity, while the URL provides a method for finding it.

Today most of the informational sources about URI reference URI only. For example, URL now serves only as a reminder that the some URIs act as addresses because they have schemes that imply some kind of network accessibility.

Clear Up Confusion Between the Uniform and Universal

Originally Tim Berners Lee used the word “Universal” in naming the Universal Resource Identifier (URI). Later on, the publication of RFC 2396 [4] in August 1998 changed the significance of the “U” in “URI” from “Universal” to “Uniform”. Unfortunately these two words are mixed in URI articles and papers today. To be consistent we will reference URI as Uniform Resource Identifier.

Resources and URIs

To publish data, we first have to identify the items of interest in our domains. The items of interest are the things whose attributes (properties) and relationships we want to describe in the data. We refer to these items of interest as resources.

We distinguish between two types of resources:

  • Information resources
  • Non-information resources

Information resources are files located on the Web (Internet and/or Intranet) and they include documents, images, and other media files.

Non-information resources are real-world resources that exist outside of the Web. The non-information resources can be classified into two groups: physical objects (i.e., people, books, buildings, etc.) and abstract concepts (i.e., color, height, weight, etc.).

Resource Identification

Each resource can be identified using a URI. We recommend that HTTP URIs are used only and to avoid other URI schemes such as URNs.

Widely available mechanisms (DNS and web servers, respectively) exist to support the use of HTTP URIs to not only globally identify resources without centralized management but also retrieve representations of information resources.

HTTP also provides substantial benefits, in terms of installed software base, scalability and, security, at low cost.

Resource Representation

Information resources can have representations. A representation is a stream of bytes in a certain format, such as HTML, JPEG, or RDF/XML. A single information resource can have many different representations (i.e., different content format, natural languages, etc.).

Dereferencing HTTP URIs

URI dereferencing is the process of looking up a URI on the Web in order to get information about the referenced resource. Since we have two types of resources (information and non-information), this is how the URIs identifying these resources are de-referenced:

  • Information Resources – A server that is used to manage URIs generates a new information resource representation and sends it back to the client using the HTTP response code 200 OK.
  • Non-Inofrmation Resources – They cannot be de-referenced directly. Instead of sending a representation of the resource, the server by using the HTTP response code 303 See Other sends the URI of an information resource which describes the non-information resource. With one more step, the client de-references this new URI and gets a representation that describes the original non-information resource.

For data publishing we can use two approaches to provide clients with URIs of information resources describing non-information resources:

  • Hash URIs
  • 303 redirects

These two methods are described in [1].

Hash URIs are better choice for small and stable sets of resources that evolve together. An ideal case are RDF Schema vocabularies and OWL ontologies. Their terms are used together and the number of terms usually does not grow much.

Hash URIs without content negotiation can be implemented by simply uploading static RDF files to a Web server, without any special server configuration. This makes them popular for quick-and-dirty RDF publication.

303 URIs are used for large sets of data that may grow and when it becomes unpractical to serve all related resources in a single document.

If in doubt, it’s better to use the more flexible 303 URI approach.

URI Types

HTTP-based URIs are mostly used as unambiguous names of non-information resources. At the same time, the same URI can be used as a document (information resource) locator. For example, if you use a URI to name a guitar, you can also provide a document, accessible via that URI (document locator), that describes (in a formal or in an informal way) the guitar the URI names.

However, using the same URI for the name of a resource and for the location of a document describing the resource creates an ambiguity. To avoid this situation, the W3C Technical Architecture Group (W3C TAG) recommends that the URI naming the non-information resource should forward (using an HTTP 303 See other status code) to a related URI for retrieving the descriptive document (information resource) about the resource.

What does this mean now? This means that for each non-information resource we need two URIs at least: one URI (Identifier URI) to name the resource and another URI (Document URI) for the location of its related descriptive document.

Each resource should have a concept in an ontology that models a specific domain that resource belongs to. It means that we also need a URI (Concept URI) to name the concept that models the resource and another URI (Ontology URI) that names the domain ontology this concept belongs to. The Ontology URI should be directly derived from the Concept URI since the Concept URI should fully contain the Ontology URI it belongs to. All this belongs to a case when a concept is fully defined in the domain ontology. However, if a concept is referenced from another domain ontology, its Ontology URI should still belong to the current domain ontology but other semantic metadata details of the concept should be extracted from its original domain ontology when needed.

There is also one more aspect of a document describing the resource. That aspect belongs to the document representations since each document can have one or more representations (e.g., Text, HTML, RDF, OWL, etc.). Each document representation needs its own URI. We call this URI type Document Representation URI.

Finally, this is a list of all URI types related to single resource and described above:

  • Identifier URI
  • Document URI
  • Document Representation URI (one per each document representation)
  • Concept URI
  • Ontology URI

URI Denoting

Two approaches are available for denoting four core different URI uses (name, concept, document location, and document representation(s)):

  • Different URI types
  • Different context

The different URI types are already explained above. The different context approach requires syntactic conventions for indicating the intended context in which the URI is referenced. Both approaches have their advantages and disadvantages. We recommend the use of the different URI types approach since it is emerging as a common approach that is also used in the Linked Data area.

Some pros and cons of both approaches described in [5] are:

Different Names
Pro: Name “shows” what a given URI identifies. Consistent meaning across languages.
Con: It requires people to agree on which of these four things a URI should indicate.

Different Context
Pro: It does not require everyone to agree on which of these four things a URI should indicate.
Con: Each Semantic Web language must have a language construct to clearly specify which of these four things is intended when a URI is written in that language.

Reference Material

1. Cool URIs for the Semantic Web
W3C Working Draft, 17 December, 2007, http://www.w3.org/TR/2007/WD-cooluris-20071217/
2. URIs and the Myth of Resource Identity
David Booth, HP Software, 2006, http://www.dbooth.org/2006/identity/
3. An Ontology of Resources for Linked Data
Harry Halpin, Valentina Presutti, 2009, http://events.linkeddata.org/ldow2009/papers/ldow2009_paper19.pdf
4. RFC 2396, http://www.ietf.org/rfc/rfc2396.txt
5. Four Uses of a URL: Name, Concept, Web Location and Document Instance
David Booth, 2003, http://www.w3.org/2002/11/dbooth-names/dbooth-names_clean.htm

Semantic Web Architecture

Before we proceed with an overview of the Semantic Web Architecture, let us first define the goals of the Semantic Web:

  • Consistent Knowledge – Knowledge about a specific thing  is consistent across distributed knowledge sources. Any knowledge change should be reflected across all mutually dependent knowledge sources.
  • Accessing Knowledge – Providing the most accurate knowledge aggregated from all available sources.
  • Reusing Knowledge – Knowledge can be fully internally and/or externally reused  by having it formally structured and open.

Architectural Principles

An architecture is a model of a system that is defined within a certain context. The model is an abstraction of a real world representation.  The context defines a view of appraisal of the system. It also determines the components necessary to implement the system, the properties of these components, the relationships between the components, and relationships with external entities.

The layered architecture is an architectural pattern widely used today to present conceptual Semantic Web Architecture.  Generally in a layered architecture, the components of the system are arranged in a layered structure where each layer represents a group of elements providing related services. There are two types of this architecture: open and closed. In the open architecture, higher layers can use all or subset of services from all lower layers. In the closed architecture a higher layer can use only service from the immediate lower layer. In both type of the layered architecture lower layers cannot use services from higher layers.

A typical widely used layered architecture is ISO/OSI (International Standards / Open Systems Interconnect) architecture [1]. This architecture specifies functional foundation needed to define protocols for network interoperability between applications.  The Semantic Web architecture has the same purpose as ISO/OSI architecture in that it specifies the languages required for data interoperability between applications.

The context of the Semantic Web Architecture are languages required for meta-data specifications and reasoning about information (knowledge) specified via the meta-data specifications.

Towards Technology Agnostic Conceptual Semantic Web Architecture

A precise definition of architectural layers’ functionalities and their interfaces has to be provided for the Semantic Web applications to interoperate.

Gerber at al. [2] defined Comprehensive, Functional, Layered (CFL) architecture for the Semantic Web. This architecture is based on the previous versions of the Semantic Web Architecture defined by Tim Berners-Lee [3,4,5,6,7,8,9].  It defines related Semantic Web functionalities rather than the W3C technologies as the architectural layers. The CFL architecture enables use of different technologies to implement the functionality of a specific layer. However the specification of the functionality interfaces needed to clearly define interfaces between the layers has not been defined yet.

The CFL architecture proposes two orthogonal architecture stacks, the language stack and the security stack (Fig. 1).


Figure 1: CFL Architecture

The Layers of the CFL Semantic Web Architecture

The layers of the CFL architecture include:

Unicode Identification Mechanism

Used to uniquely identify resources. It also provides a mechanism for uniquely identifying all the characters in all written languages. Examples of this layer technologies are: Unicode [10]and Uniform Resource Identifier (URI)  [11,12,13].

Syntax Description Language

Provides a language for specifying syntax of various data formats. The most used technologies include XML [14], XML Schema [15], and Namespaces [16].

Meta-data Data Model

Provides a mechanism to model the meta-date required to implement the Semantci Web. The most used technologies for this layer include: RDF and RDF Schema (RDFS) [17].

Ontology

This layer provides language support for creation of ontologies. It is instantiated with either RDFS or RDFS and OWL [18].

Rules

This layer belongs to rule-based languages and their processing. Rule Interchange Format (RIF) is the W3C rule-based  language and it is compatible with RDF and OWL [19]. There is also an important group of rule-based languages [20] that cannot be layered on top of OWL. They should be also included in the Semantic Web Architecture together with other RDF and OWL related rule-based languages. I will review these languages in one of the future posts. In addition to allow query and filtering the rules layer also supports inference.

Logic Framework

This layer provides an answer about the reasoning of why the information is taken or appear to the user. Currently the technology specification for this layer does not exist.

Proof

Thids layer provides an answer for the question of why agents should believe the provided information.  Currently the technology specification for this layer does not exist. The Knowledge Systems Laboratory  at Stanford has been developing a proof language called PML [21]

Trust

This layer ensures that he information provided is valid and there is a degree of confidence in the resource that provides the information . Currently the technology specification for this layer does not exist.

Identity Verification and Encryption

This layer belongs to security (especially identification and encryption at least).  It is not part of the language stack. It should be developed as a separate Security Architecture that will interface with the language stack. The W3C technology specs that belong to this layer are XML Signature [22] and XML Encryption [23].

References

1. OSI Reference Model – The ISO Model of Architecture for Open Systems Interconnection
Hubbert Zimmerman
http://dret.net/biblio/reference/zim80

2. A Functional Semantic Web Architecture
Aurona Gerber, Alta van der Merwe, and Andries Barnard, 2008
http://ksg.meraka.org.za/~agerber/Paper152.pdf

3. Semantic Web – XML2000. W3C Web site 2000
Tim Berners-Lee
http://www.w3.org/2000/Talks/1206-xml2k-tbl/slide10-0.html

4. WWW2005 Keynote. W3C Web site 2005
Tim Berners-Lee,
http://www.w3.org/2005/Talks/0511-keynote-tbl/

5. Artificial Intelligence and the Semantic Web: AAAI2006 Keynote. W3C Web site 2006
Tim Berners-Lee
http://www.w3.org/2006/Talks/0718-aaai-tbl/Overview.html

6. The Semantic Web and Challenges. W3C Website Slideshow 2003
Tim Berners-Lee
http://www.w3.org/2003/Talks/01-sweb-tbl/Overview.html

7. Standards, Semantics and Survival. SIIA Upgrade 2003, pp. 6-10.
Tim Berners-Lee

8. WWW Past and Future. W3C Web site 2003
Tim Berners-Lee
http://www.w3.org/2003/Talks/0922-rsoc-tbl/slide30-0.html

9. SIIA. Website
http://www.siia.net/

10. Unicode
http://www.unicode.org/

11. Designing URI Sets for the UK Public Sector, V1.0, October, 20009
Paul Davidson, CIO, Sedgemoor District Council
http://www.cabinetoffice.gov.uk/media/301253/puiblic_sector_uri.pdf

12. Cool URIs for the Semantic Web
W3C Working Draft, 17 December, 2007
http://www.w3.org/TR/2007/WD-cooluris-20071217/

13. Cool URIs dont’ change
Tim Berners-Lee
http://www.w3.org/Provider/Style/URI.html

14. XML, W3C
http://www.w3.org/XML/

15. XML Schema, W3C
http://www.w3.org/XML/Schema

16. Namespaces in XML 1.1, W3C
http://www.w3.org/TR/2006/REC-xml-names11-20060816/

17. RDF Standards, W3C
http://www.w3.org/TR/#tr_RDF

18. OWL, W3C
http://www.w3.org/standards/techs/owl#w3c_all

19. RIF RDF and OWL Compatibility, W3C
http://www.w3.org/TR/2010/REC-rif-rdf-owl-20100622/

20. A Realistic Architecture for the Semantic Web
Michael Kifer, Jos de Bruijn, Harold Boley, and Dieter Fensel
http://www.kr.tuwien.ac.at/staff/bruijn/priv/publications/msa-ruleml05.pdf

21. A Proof Markup Language for Semantic Web Services
Paulo Pinheiro da Silva, Deborah L. McGuinness, Richard Fikes
http://ftp.ksl.stanford.edu/pub/KSL_Reports/KSL-04-01.pdf

22. XML Signature, W3C
http://www.w3.org/TR/#tr_XML_Signature

23. XML Encryption, W3C
http://www.w3.org/TR/#tr_XML_Encryption

Information Modeling

Different categories of information modeling exist. Descriptions below summarize definitions from different sources [1,2,3]. The following categories are described:

  • Controlled vocabulary
  • Taxonomy
  • Thesaurus
  • Ontology
  • Metamodel

Controlled Vocabulary

Controlled vocabularies provide a way to organize knowledge for subsequent retrieval [3]. A controlled vocabulary is a list of explicitly enumerated terms [2]. All terms in a controlled vocabulary should have an unambiguous, non-redundant definition. Controlled vocabulary schemes mandate the use of predefined, authorised terms that have been pre-selected by the registration authority (or designer) of the vocabulary, in contrast to natural language vocabularies, where there is  no restriction on the vocabulary.

Sometimes this condition is eased depending on how strict the controlled vocabulary registration authority is.

The following two rules should be enforced at least:

  • If the same term is commonly used to mean different concepts in different contexts, then its name has to be explicitly qualified to resolve this ambiguity.
  • If multiple terms are used to mean the same concept, one of the terms is identified as the preferred term in the controlled vocabulary and the other terms are identified as synonyms or aliases.

In information science, a controlled vocabulary is a selected list of words and phrases, which are used to tag units of information so that they may be more easily retrieved by a search [3]. Controlled vocabularies reduce ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency.

Controlled vocabularies tagged to documents are metadata.

The use of controlled vocabulary ensures that everyone is using the same word for the same concept throughout an organization.

A controlled vocabulary for describing Web pages can dramatically improve Web searching. This is being culminated in Semantic Web where the content of Web pages is described using a machine-readable metadata scheme (i.e., Dublin Core Initiative [5]) in RDFa [4] (Resource Description Framework – in – attributes). The content of the entire Web cannot be described using a single metadata scheme. More metadata schemes like Dublin Core Initiative are needed in different areas of knowledge management.

Controlled vocabularies are used in taxonomies and thesaurus.

Taxonomy

A taxonomy or a taxonomic scheme is a collection of controlled vocabulary terms organized into a hierarchical relationship structure [2]. Each term in a taxonomy is in one or more relationships to other terms in the taxonomy. These relationships are called generalization-specialization relationships, or type-subtype relationships, or less formally, parent-child relationships [6]. The subtype has the same properties, behaviours, and constraints as the supertype plus one or more additional properties, behaviours, or constraints.

Most taxonomies limit all parent-child relationships to a single parent to be of the same type. Some taxonomies allow poly-hierarchy, which means that a term (concept) can have multiple parents. This means that if a term appears in multiple places in a taxonomy, then it is the same term. Specifically, if a term has children in one place in a taxonomy, then it has the same children in every other place where it appears.

A hierarchical taxonomy is a tree structure of classifications for a given set of terms (concepts). The root of this structure is called a classification scheme and it applies to all terms. Nodes below the classification scheme are more specific classifications that apply to subsets of the total set of classified terms. The progress of taxonomy reasoning proceeds from the general to the more specific.

Sometimes the term taxonomy could also be applied to relationship schemes other than type-subtype hierarchies (i.e., network structures with other types of relationships). In these cases, taxonomies may then include single subtype with multi-types.

Thesaurus

A thesaurus is a networked collection of controlled vocabulary terms [2]. The thesaurus uses associative relationships in addition to type-subtype relationships. The expressiveness of the associative relationships in a thesaurus vary and can be as simple as “related to” (i.e., concept X is related to concept Y).

Thesauri for information retrieval are typically constructed by information specialists, and have their own unique vocabulary defining different kinds of terms and relationships [7].

Terms are the basic semantic units for conveying concepts. They are single-word or multi-word nouns. Verbs can be converted to nouns (i.e., “reads” to “reading”, “paints” to “painting”, etc.). Adjectives and adverbs are not usually used.

When a term is ambiguous, a Scope Note can be added to ensure consistency, and give direction on how to interpret the term. The use of scope notes is not mandatory for each term by having them provides correct thesaurus use and correct understanding of the given field of knowledge. Generally, a Scope Note is a brief statement of the intended usage of a term.

Relationships are links between terms. The relationships can be divided into three types: hierarchical, equivalency, or associative [7].

Hierarchical relationships are used to indicate terms which are narrower and broader in scope. The Broader term (BT) and Narrower Term (NT) notations are used to indicate a hierarchical relationship between terms [8]. Narrower terms follow the NT notaion and are included in the broader class represented by the main term. For example:

Libraries
NT Academic Libraries

  • Branch Libraries
  • Childrens Libraries
  • Depository Libraries
  • Electronic Libraries
  • Public Libraries
  • Research Libraries
  • School Libraries
  • Special Libraries

The equivalency relationship is used primarily to connect synonyms and near-synonyms.

The Used For (UF) reference is used generally to resolve synonymy problems in natural languages. Terms following the UF notation are not to be used. They represent either (1) synonymous or variant forms of the main term, or (2) specific terms that, for purposes of storage and retrieval, are indexed under a more general term [8]. The example below [8]  illustrates the use of UF:

Lifelong Learning
UF Continuous Learning (1967 1980)

  • Education Permanente
  • Life Span Education
  • Lifelong Education
  • Permanent Education
  • Recurrent Education

The Broader Term (BT) is the opposite of the NT. Terms that follow the BT notation include as a subtype the concept represented by the main (narrower) term:

School Libraries
BT Libraries

Mathematical Models
BT Models

It is also possible for a term to have more than one broader term:

Remedial Reading
BT Reading

  • Reading Instruction
  • Remedial Instruction

The former term “Continuous Learning” that has been downgraded to the status of a UF term is followed by a “life span” notation in parentheses (1967 1980). This indicates the time period during which the term was used in indexing.

Sometimes a UF needs more than one descriptor to represent it adequately [8] when a pound sign (#) following the UF term specifies that two or more main terms are to be used in coordination. For example [8]:

Folk Culture
UF Folk Drama (1969 1980) #

  • Folklore
  • Folklore Books (1968 1980) #
  • Traditions (Cultura)

Drama
UF Dramatic Utilities (1970 1980)

  • Folk Drama (1969 1980) #
  • Outdoor Drama (1968 1980) #
  • Plays (Theatrical)

The USE reference (opposite of UF) refers an indexer or searcher from a nonusable (nonindexable) term to the preferred indexable term or terms. For example [8]:

Regular Class Placement (1968 1978)
USE    Mainstreaming

Continuous Learning (1967 1980)
USE    Lifelong Learning

A coordinate or multiple USE reference supports the use of two or more main terms together to represent a single term [8]:

Folk Drama (1969 1980)
USE       Drama
AND     Folk Culture

Associative relationships are used to connect two related terms whose relationship is neither hierarchical nor equivalent. This relationship is described by the indicator Related Term (RT). Terms following the RT notation have a close conceptual relationship to the main term but not the direct type/subtype relationship specified by BT/NT. Part-whole relationships, near-synonyms, and other conceptually related terms, appear as RTs.

Associative relationships should be applied with caution, since excessive use of RTs will reduce specificity in searches. Consider the following: if the typical user is searching with term “X”, would they also want resources tagged with term “Y”? If the answer is no, then an associative relationship should not be established.

This is an example of the RT relationships [8]:

High School Seniors
RT College Bound Students

  • Grade 12
  • High School freshmen
  • High School Graduates
  • Noncollege Bound Students

Ontology

What is an ontology? Very short answer from Tom Gruber [12] is:

“An ontology is a specification of a conceptualization.”

More detailed definition follows.

An ontology is a formal representation of knowledge by a set of concepts, their properties, relationships, and other distinctions within a domain. It is used to describe the domain and to reason about the properties of the domain.

Ontologies are used in Semantic Web, systems engineering, software engineering, process modeling, biomedical informatics, library science, enterprise bookmarking, artificial intelligence, and information architecture as a form of knowledge representation about the world or some part of it. The creation of domain ontologies is also fundamental to the definition and use of an enterprise architecture framework.

This is a formal ontology definition, provided by Tom Gruber, from the Encyclopedia of Database Systems [9]:

“In the context of computer and information sciences, an ontology defines a set of representational primitives with which to model a domain of knowledge or discourse.  The representational primitives are typically classes (or sets), attributes (or properties), and relationships (or relations among class members).  The definitions of the representational primitives include information about their meaning and constraints on their logically consistent application.  In the context of database systems, ontology can be viewed as a level of abstraction of data models, analogous to hierarchical and relational models, but intended for modeling knowledge about individuals, their attributes, and their relationships to other individuals.  Ontologies are typically specified in languages that allow abstraction away from data structures and implementation strategies; in practice, the languages of ontologies are closer in expressive power to first-order logic than languages used to model databases.  For this reason, ontologies are said to be at the “semantic” level, whereas database schema are models of data at the “logical” or “physical” level.  Due to their independence from lower level data models, ontologies are used for integrating heterogeneous databases, enabling interoperability among disparate systems, and specifying interfaces to independent, knowledge-based services.  In the technology stack of the Semantic Web standards [1], ontologies are called out as an explicit layer.  There are now standard languages and a variety of commercial and open source tools for creating and working with ontologies. “

W3C Semantic Web standard specifies a formal language for encoding ontologies (OWL), in several variants that vary in expressive power [10].  This reflects the intent that an ontology is a specification of an abstract data model (the domain conceptualization) that is independent of its particular form [9]. Tara Ontology Language [11] is an example of another ontology language.

Metamodel

Metamodeling belongs to the creation of metamodels that are collections of concepts, their relationships, and rules within a certain domain. Metamodels are created in metamodeling languages. Some of these languages are:

OWL 2 Web Ontology Language
http://www.w3.org/TR/#tr_OWL_Web_Ontology_Language

RDF Vocabulary Description Language and RDF Schema
http://www.w3.org/TR/#tr_RDF

Models, that are abstractions of phenomena in the real world, are based on metamodels. Models conform to metamodels the same way programs conform to programming languages in which they are written. The similar analogy exists between a logical data model (metamodel) and a dataset (model) based on the logical data model.

Common uses [1] for metamodels are:

  • As a schema for semantic data that needs to be exchanged or stored.
  • As a language that supports a particular method or process.
  • As a language to express additional semantics of existing information.

A valid metamodel is an ontology.

References

  1. http://en.wikipedia.org/wiki/Metamodeling
  2. http://infogrid.org/wiki/Reference/PidcockArticle?story=20030115211223271
  3. http://en.wikipedia.org/wiki/Controlled_vocabulary
  4. http://www.w3.org/TR/#tr_RDFa
  5. http://dublincore.org/
  6. http://en.wikipedia.org/wiki/Taxonomies
  7. http://en.wikipedia.org/wiki/Thesauri
  8. Thesaurus of ERIC (Educational resources Information Center) Descriptors, 14th Edition: http://books.google.ca/books?id=_I8Q-DjLNToC&printsec=frontcover#v=onepage&q&f=false
  9. Ontology Definition from the Encyclopedia of Database Systems http://tomgruber.org/writing/ontology-definition-2007.htm
  10. http://www.w3.org/TR/owl-features/
  11. http://www.semantion.com/documentation/SBP/metamodeling/TaraOntologyLanguage_V1.2.pdf
  12. http://www-ksl.stanford.edu/kst/what-is-an-ontology.html

Enterprise Information Architecture

Information Architecture (IA) models information and knowledge, and specifies architectural components and processes for information and knowledge management.

Some of the IA activities include:

  • Analyze processes that use and create information
  • Analyze information
  • Create data models, metamodels, taxonomies, ontologies, and other information modeling concepts to model information about physical and abstract things.
  • Analyze quality of data
  • Design user interfaces for managing information
  • etc.

There are two main types of IA:

  • Enterprise Information Architecture (EIA)
  • Large Scale Information Architecture (LSIA)

Enterprise Information Architecture (EIA) provides support for applications of information architecture in enterprises.  Information and knowledge is the foundation of any enterprise which EIA is responsible for. EIA has to operate in the context of enterprise processes.

Understanding of enterprise’s strategy and goals is an important factor for a successful application of EIA.

While it is based on the same principles as the EIA, LSIA besides the EIA common activities also addresses large scale issues mostly related to service oriented (free of charge or commercial) Internet applications that manage very large data sets that are concurrently accessed by the large number of users.

 

Why Is It Important to Understand Business Strategy and Goals?

Detailed understanding of the overall business strategy is one of the key factors of the successful EIA. Enterprise Architects conduct research to identify needs and issues from staff what is then used to identify goals and scope of the projects. Enterprise Information Architects should provide inputs into the business strategy development. These inputs belong to enterprise information and how its management affects the overall business strategy. Integrating business strategy into EIA is one of the important architectural aspects. Enterprise Information Architecture as its name says puts the enterprise not the specific project into the center of its scope. It provides an enterprise-centric model for modeling, managing, and analyzing enterprise information.

The business strategy of the enterprise should be based on detailed analysis of the enterprise processes and processes that shape current and future directions of the market the enterprise supports. This way the enterprise will make sure that it  understands the current functioning of its business in a context of the current market. Based on that, enterprises come up with strategies very much needed to clearly communicate the next direction(s) of the enterprise.

Strategic goals and business objectives are communicated by the senior management of the enterprise. Enterprise Information Architects and other architects have to make sure that these goals and objectives are fully supported and articulated in the architecture of enterprise systems that will be providing support for the business strategy implementations.

Why Processes are Important?

Processes are fundamental concepts of any architecture on any level. Everything is a product of a process. A process can be of any type (i.e., business process, manufacturing process, social process, political process, natural process, etc.). Besides the overall enterprise strategy and clear understanding of directions, Enterprise Information Architects need detailed understanding of processes they have to support to be able to properly model, use, manage, and analyze information created and controlled by these processes. It is important that the managed information is modeled from the process-based context rather than from the disconnected no process based pure non-empirical information context.

IA Framework

IA needs an Information Architecture Framework (IAF) that enables a full integration of information over projects and enterprise systems. IAF provides a common established methodology and tools to model, manage, integrate, and analyze information. Information can be in both structured and un-structured form. Best practices and standards are also part of the framework. IAF itself is agnostic to any specific type of information architecture (i.e., enterprise or large scale).

Measure Results

Enterprise information management projects based on a solid EIA have to demonstrate measurable results. Everything should end up with a successful implementation of information management to support enterprise strategies and meet goals of enterprises. All these strategies and goals are related to people either in an internal enterprise context or an external enterprise context in the enterprise’s dealing with customers and partners. Metrics to measure the success have to be developed as well