Information Modeling

Different categories of information modeling exist. Descriptions below summarize definitions from different sources [1,2,3]. The following categories are described:

  • Controlled vocabulary
  • Taxonomy
  • Thesaurus
  • Ontology
  • Metamodel

Controlled Vocabulary

Controlled vocabularies provide a way to organize knowledge for subsequent retrieval [3]. A controlled vocabulary is a list of explicitly enumerated terms [2]. All terms in a controlled vocabulary should have an unambiguous, non-redundant definition. Controlled vocabulary schemes mandate the use of predefined, authorised terms that have been pre-selected by the registration authority (or designer) of the vocabulary, in contrast to natural language vocabularies, where there is  no restriction on the vocabulary.

Sometimes this condition is eased depending on how strict the controlled vocabulary registration authority is.

The following two rules should be enforced at least:

  • If the same term is commonly used to mean different concepts in different contexts, then its name has to be explicitly qualified to resolve this ambiguity.
  • If multiple terms are used to mean the same concept, one of the terms is identified as the preferred term in the controlled vocabulary and the other terms are identified as synonyms or aliases.

In information science, a controlled vocabulary is a selected list of words and phrases, which are used to tag units of information so that they may be more easily retrieved by a search [3]. Controlled vocabularies reduce ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency.

Controlled vocabularies tagged to documents are metadata.

The use of controlled vocabulary ensures that everyone is using the same word for the same concept throughout an organization.

A controlled vocabulary for describing Web pages can dramatically improve Web searching. This is being culminated in Semantic Web where the content of Web pages is described using a machine-readable metadata scheme (i.e., Dublin Core Initiative [5]) in RDFa [4] (Resource Description Framework – in – attributes). The content of the entire Web cannot be described using a single metadata scheme. More metadata schemes like Dublin Core Initiative are needed in different areas of knowledge management.

Controlled vocabularies are used in taxonomies and thesaurus.

Taxonomy

A taxonomy or a taxonomic scheme is a collection of controlled vocabulary terms organized into a hierarchical relationship structure [2]. Each term in a taxonomy is in one or more relationships to other terms in the taxonomy. These relationships are called generalization-specialization relationships, or type-subtype relationships, or less formally, parent-child relationships [6]. The subtype has the same properties, behaviours, and constraints as the supertype plus one or more additional properties, behaviours, or constraints.

Most taxonomies limit all parent-child relationships to a single parent to be of the same type. Some taxonomies allow poly-hierarchy, which means that a term (concept) can have multiple parents. This means that if a term appears in multiple places in a taxonomy, then it is the same term. Specifically, if a term has children in one place in a taxonomy, then it has the same children in every other place where it appears.

A hierarchical taxonomy is a tree structure of classifications for a given set of terms (concepts). The root of this structure is called a classification scheme and it applies to all terms. Nodes below the classification scheme are more specific classifications that apply to subsets of the total set of classified terms. The progress of taxonomy reasoning proceeds from the general to the more specific.

Sometimes the term taxonomy could also be applied to relationship schemes other than type-subtype hierarchies (i.e., network structures with other types of relationships). In these cases, taxonomies may then include single subtype with multi-types.

Thesaurus

A thesaurus is a networked collection of controlled vocabulary terms [2]. The thesaurus uses associative relationships in addition to type-subtype relationships. The expressiveness of the associative relationships in a thesaurus vary and can be as simple as “related to” (i.e., concept X is related to concept Y).

Thesauri for information retrieval are typically constructed by information specialists, and have their own unique vocabulary defining different kinds of terms and relationships [7].

Terms are the basic semantic units for conveying concepts. They are single-word or multi-word nouns. Verbs can be converted to nouns (i.e., “reads” to “reading”, “paints” to “painting”, etc.). Adjectives and adverbs are not usually used.

When a term is ambiguous, a Scope Note can be added to ensure consistency, and give direction on how to interpret the term. The use of scope notes is not mandatory for each term by having them provides correct thesaurus use and correct understanding of the given field of knowledge. Generally, a Scope Note is a brief statement of the intended usage of a term.

Relationships are links between terms. The relationships can be divided into three types: hierarchical, equivalency, or associative [7].

Hierarchical relationships are used to indicate terms which are narrower and broader in scope. The Broader term (BT) and Narrower Term (NT) notations are used to indicate a hierarchical relationship between terms [8]. Narrower terms follow the NT notaion and are included in the broader class represented by the main term. For example:

Libraries
NT Academic Libraries

  • Branch Libraries
  • Childrens Libraries
  • Depository Libraries
  • Electronic Libraries
  • Public Libraries
  • Research Libraries
  • School Libraries
  • Special Libraries

The equivalency relationship is used primarily to connect synonyms and near-synonyms.

The Used For (UF) reference is used generally to resolve synonymy problems in natural languages. Terms following the UF notation are not to be used. They represent either (1) synonymous or variant forms of the main term, or (2) specific terms that, for purposes of storage and retrieval, are indexed under a more general term [8]. The example below [8]  illustrates the use of UF:

Lifelong Learning
UF Continuous Learning (1967 1980)

  • Education Permanente
  • Life Span Education
  • Lifelong Education
  • Permanent Education
  • Recurrent Education

The Broader Term (BT) is the opposite of the NT. Terms that follow the BT notation include as a subtype the concept represented by the main (narrower) term:

School Libraries
BT Libraries

Mathematical Models
BT Models

It is also possible for a term to have more than one broader term:

Remedial Reading
BT Reading

  • Reading Instruction
  • Remedial Instruction

The former term “Continuous Learning” that has been downgraded to the status of a UF term is followed by a “life span” notation in parentheses (1967 1980). This indicates the time period during which the term was used in indexing.

Sometimes a UF needs more than one descriptor to represent it adequately [8] when a pound sign (#) following the UF term specifies that two or more main terms are to be used in coordination. For example [8]:

Folk Culture
UF Folk Drama (1969 1980) #

  • Folklore
  • Folklore Books (1968 1980) #
  • Traditions (Cultura)

Drama
UF Dramatic Utilities (1970 1980)

  • Folk Drama (1969 1980) #
  • Outdoor Drama (1968 1980) #
  • Plays (Theatrical)

The USE reference (opposite of UF) refers an indexer or searcher from a nonusable (nonindexable) term to the preferred indexable term or terms. For example [8]:

Regular Class Placement (1968 1978)
USE    Mainstreaming

Continuous Learning (1967 1980)
USE    Lifelong Learning

A coordinate or multiple USE reference supports the use of two or more main terms together to represent a single term [8]:

Folk Drama (1969 1980)
USE       Drama
AND     Folk Culture

Associative relationships are used to connect two related terms whose relationship is neither hierarchical nor equivalent. This relationship is described by the indicator Related Term (RT). Terms following the RT notation have a close conceptual relationship to the main term but not the direct type/subtype relationship specified by BT/NT. Part-whole relationships, near-synonyms, and other conceptually related terms, appear as RTs.

Associative relationships should be applied with caution, since excessive use of RTs will reduce specificity in searches. Consider the following: if the typical user is searching with term “X”, would they also want resources tagged with term “Y”? If the answer is no, then an associative relationship should not be established.

This is an example of the RT relationships [8]:

High School Seniors
RT College Bound Students

  • Grade 12
  • High School freshmen
  • High School Graduates
  • Noncollege Bound Students

Ontology

What is an ontology? Very short answer from Tom Gruber [12] is:

“An ontology is a specification of a conceptualization.”

More detailed definition follows.

An ontology is a formal representation of knowledge by a set of concepts, their properties, relationships, and other distinctions within a domain. It is used to describe the domain and to reason about the properties of the domain.

Ontologies are used in Semantic Web, systems engineering, software engineering, process modeling, biomedical informatics, library science, enterprise bookmarking, artificial intelligence, and information architecture as a form of knowledge representation about the world or some part of it. The creation of domain ontologies is also fundamental to the definition and use of an enterprise architecture framework.

This is a formal ontology definition, provided by Tom Gruber, from the Encyclopedia of Database Systems [9]:

“In the context of computer and information sciences, an ontology defines a set of representational primitives with which to model a domain of knowledge or discourse.  The representational primitives are typically classes (or sets), attributes (or properties), and relationships (or relations among class members).  The definitions of the representational primitives include information about their meaning and constraints on their logically consistent application.  In the context of database systems, ontology can be viewed as a level of abstraction of data models, analogous to hierarchical and relational models, but intended for modeling knowledge about individuals, their attributes, and their relationships to other individuals.  Ontologies are typically specified in languages that allow abstraction away from data structures and implementation strategies; in practice, the languages of ontologies are closer in expressive power to first-order logic than languages used to model databases.  For this reason, ontologies are said to be at the “semantic” level, whereas database schema are models of data at the “logical” or “physical” level.  Due to their independence from lower level data models, ontologies are used for integrating heterogeneous databases, enabling interoperability among disparate systems, and specifying interfaces to independent, knowledge-based services.  In the technology stack of the Semantic Web standards [1], ontologies are called out as an explicit layer.  There are now standard languages and a variety of commercial and open source tools for creating and working with ontologies. “

W3C Semantic Web standard specifies a formal language for encoding ontologies (OWL), in several variants that vary in expressive power [10].  This reflects the intent that an ontology is a specification of an abstract data model (the domain conceptualization) that is independent of its particular form [9]. Tara Ontology Language [11] is an example of another ontology language.

Metamodel

Metamodeling belongs to the creation of metamodels that are collections of concepts, their relationships, and rules within a certain domain. Metamodels are created in metamodeling languages. Some of these languages are:

OWL 2 Web Ontology Language
http://www.w3.org/TR/#tr_OWL_Web_Ontology_Language

RDF Vocabulary Description Language and RDF Schema
http://www.w3.org/TR/#tr_RDF

Models, that are abstractions of phenomena in the real world, are based on metamodels. Models conform to metamodels the same way programs conform to programming languages in which they are written. The similar analogy exists between a logical data model (metamodel) and a dataset (model) based on the logical data model.

Common uses [1] for metamodels are:

  • As a schema for semantic data that needs to be exchanged or stored.
  • As a language that supports a particular method or process.
  • As a language to express additional semantics of existing information.

A valid metamodel is an ontology.

References

  1. http://en.wikipedia.org/wiki/Metamodeling
  2. http://infogrid.org/wiki/Reference/PidcockArticle?story=20030115211223271
  3. http://en.wikipedia.org/wiki/Controlled_vocabulary
  4. http://www.w3.org/TR/#tr_RDFa
  5. http://dublincore.org/
  6. http://en.wikipedia.org/wiki/Taxonomies
  7. http://en.wikipedia.org/wiki/Thesauri
  8. Thesaurus of ERIC (Educational resources Information Center) Descriptors, 14th Edition: http://books.google.ca/books?id=_I8Q-DjLNToC&printsec=frontcover#v=onepage&q&f=false
  9. Ontology Definition from the Encyclopedia of Database Systems http://tomgruber.org/writing/ontology-definition-2007.htm
  10. http://www.w3.org/TR/owl-features/
  11. http://www.semantion.com/documentation/SBP/metamodeling/TaraOntologyLanguage_V1.2.pdf
  12. http://www-ksl.stanford.edu/kst/what-is-an-ontology.html