Java Code Generation from MongoDB Data Models Created in Daprota M2

Daprota just released a new version of its MongoDB data modeling service M2 that now provides a code generator for Java and JSON from MongoDB data models created by it.

The generated code includes persistence APIs based on MongoDB Java driver, NoSQLUnit tests, and test data in JSON format.

When code is generated you can download it and then use Apache Maven to build the software and run tests via single Maven command (mvn install).

You can save a significant amount of time and effort by creating MongoDB data models in M2 and then generate Java code via just a single click. The quality of your code will also be improved. It will be unit tested and all this will be done for you by M2 in a fully automated fashion.

This kind of service can also be very useful for a quick creation of disposable schemas (data models) in an agile environment when you want to quickly create schemas, generate Java persistence code from it, immediately test it, and repeat this procedure starting over with the current schema update or a completely new schema creation.

As soon as you become familiar with data models creation in M2, which is very intuitive, the speed of the software creation and build from it will be instantaneous.

All your models are fully managed in M2 where you can also store your models’ documentation in M2 repository or you can provide links for external documentations.

Daprota has documented its MongoDB Data Modeling Advisor  which you can access to find out more about MongoDB schema design (data modeling) patterns and best practices.

M2 is a free service including the Java code generator.

Daprota M2 Modeling of MongoDB Manual References and DBRefs – Part 2/2

This series of posts provides details with examples for modeling MongoDB Manual References and DBRefs by Daprota M2 service. You can access M2 service using this link:

The previous part of the series (Daprota M2 Modeling of MongoDB Manual References and DBRefs – Part 1/2) covered Manual References. In this part of the series we will look at DBRefs.

Database references (DBRefs) are references from one document to another using the value of the referenced (parent) document’s _id field, its collection name, and the database name. While the MongoDB allows DBRefs without the database name provided, M2 models require the database name to be provided. The reason for this is because a Manual Reference in an M2 model must specify the collection name for the model to be complete in which case the DBRef without the database name from the M2 model point of view is the same as the Manual Reference. The database name in DBRef is more of an implementation aspect of the model and it is needed in order to make the DBRef definition complete. Otherwise, without the database name, the DBRef is the same as the Manual Reference to M2.

To resolve DBRefs, your application must perform additional queries to return the referenced documents. Many language drivers supporting MondoDB have helper methods that form the query for the DBRef automatically. Some drivers do not automatically resolve DBRefs into documents. Please refer to MongoDB language drivers documentation for more details.

The DBRef format provides common semantics for representing links between documents if your database must interact with multiple frameworks and tools.

Most of the data model design patterns can be supported by Manual References. Generally speaking you should use Manual References unless you have a firm reason for using DBRefs.

The example below is taken from MongoDB’s DBRef documentation page:

            “_id” : ObjectId(“5126bbf64aed4daf9e2ab771”),
            // .. application fields
            “creator” : {
                  “$ref” : “creators”,
                  “$id” : ObjectId(“5126bc054aed4daf9e2ab772”),
                  “$db” : “users”

The DBRef in this example references the creators collection’s document that has ObjectId(“5126bc054aed4daf9e2ab772”) value for its _id field. The creators collection is stored in the users database.

Let us model a sample collection Object in M2.

First we will create a model with the name DBRef Sample Model:


Click the Create Model button to create the model. When the model is created, the M2 home page will be reloaded:


Click the DBRef Sample Model link to load the model page and then click the Add Collection tab to load the section for the collection creation. Enter the name and description of the Object collection:


Click the Add Collection button to create the collection. M2 will also automatically create the collection’s document:


Click the Object collection link to load the collection page and then click the Object document link in the Documents section to load the document page:


Click the Add Field tab to load the section for the field creation. Enter the name and description of the creator field and select DBRef for the field’s type. When the DBRef is selected as the field’s type, M2 will also require selection of the field’s value type which belongs to the value type of the referenced document’s _id field. It will be ObjectId in this example:


Click the Add Field button to create the field. When the field is created it will be listed on the document page:


Click the creator field link to load the field page:


Click the DBRef tab to load the DBRef section and specify the referenced collection name (creators) and its database (users) to complete the creator field creation:


As you can see, you can either specify a collection name if it is not included in the model (as in this case) or select a collection from the Collections list if it is included in the model. Click the Add DBRef button to update the creator field definition:


Click the model link above to load the DBRef Sample Model page:


The References section of the page, as represented above, lists the reference that was just created. The Target (Child) column has the format: Collection –> Document –> Field. It contains the Object –> Object –> creator value which means that the Object is the target (child) collection and the creator is the field in the Object document of the Object collection whose value will reference the _id field value of the parent Collection (creators) document. The Database column specifies the database of the source (parent) collection.

It is also possible that the target (child) document, in the Collection –> Document –> Field value, is not the target collection document but an embedded document (on any level) in the target collection.

Daprota M2 Modeling of MongoDB Manual References and DBRefs – Part 1/2

This series of posts provides details with examples for modeling MongoDB Manual References and DBRefs by Daprota M2 service. You can access M2 service using this link:

For some data models, it is fine to model data with embedded documents (de-normalized model), but in some cases referencing documents (normalized model) is a better choice.

A referenced document can be

  • in the same collection, or
  • in a separate collection in the same database, or
  • in a separate collection in another database.

MongoDB supports two types of references:

  • Manual Reference
  • DBRef

Manual References are used to reference documents either in the same collection or in the separate collection in the same database. The parent documents are referenced via the value of their primary key’s  _id field.

Database references are references from one document to another using the value of the referenced (parent) document’s _id field, its collection name, and the database name.

In this part of the series we will look at the Manual Reference only. The second part will provide insights into DBRefs.

The Manual Reference MongoDB type indicates that the associated field references another document’s _id. The _id is a unique ID field that acts as a primary key. Manual references are simple to create and they should be used for nearly every use case where you want to store a relationship between two documents.

We will use MongoDB’s Publisher-Book example of the Referenced One-to-Many model. This model comes as a pre-created public model in M2:

Referenced One-to-Many V2 Model


Publisher‘s id is of type String and it is referenced in the Book document by the publisher_id field of the type Manual reference:String. This means that the values of the publisher_id field will be referencing the values of the Publisher document _id field.

Now, we will demonstrate how we created this model in M2. We will concentrate only on the Publisher and Book collections creation and the creation of their relevant fields (_id and publisher_id) for this example.

First we will create the Referenced One-to-Many model in M2. Enter the name and description of the model and click the Create Model button to create the model as shown below:


When the model is created, the M2 models page will be loaded and we will click the Referenced One-to-Many model link to load the model’s page:


When the model page is loaded, click the Add Collection tab in order to add the Publisher collection to the model:


Enter the Publisher name and description and click the Add Collection button to create it:


Also create the Book collection.

When both collections are created


we will continue with the Publisher document’s _id field creation. Click the Publisher collection link to load the Publisher collection page and then click the Publisher document link to load the Publisher document page. When the Publisher document page is loaded click the Add Field tab to add the _id field first:


Click the Add Field button to add the field. When the field is added, the Publisher document page will be reloaded:


Click the Full Model View to load the full model view page and then click the Book document link, as depicted below, to load the Book document page:


When the Book document page is loaded, click the Add Field tab to add the publisher_id field. First we will select the Manual Reference as its type:


and then we will add the String as the second part of its composite type which belongs to its values:


Click the Add Field button to add the field. The document page will be reloaded when the field is added:


Click the publisher_id link to load the field page and then click the Manual Reference tab to specify reference details:


When the Manual Reference section is loaded, select the Publisher collection’s document and click the Reference Collection button to complete the Manual reference setup for the publisher_id field:


M2 will create the manual reference and reload the Manual Reference section:


Click the Referenced One-to-Many model link above to load the Referenced One-to-Many model page:


The References section of the page (please see above) lists the reference that was just created. Both the Source (Parent) and Target (Child) column has the format: Collection –> Document –> Field. For example, the Target (Child) column contains the Book –> Book –> publisher_id value which means that Book is the target (child) collection and publisher_id is the field in the Book document of the Book collection whose value will reference the _id field value of the parent Collection (Publisher) document. The Database column is reserved for DBRefs only.

It is also possible that the target (child) document, in the Collection –> Document –> Field value, is not the target collection document but an embedded document (on any level) in the target collection. For example, the Role document in the User –> Role –> _id target reference value in the RBAC model below


is not related to the Role collection but to the Role embedded document of the User document’s roles field:


If you click the embedded Role document’s _id field link, the field page will be loaded with the full path for the _id field:


Daprota M2 Cloud Service for MongoDB Data Modeling

The data model design is one of the key elements of the overall application design when MongoDB is used as a back-end (database management) system. While some people would still argue that data modeling is not needed with MongoDB since it is schemaless, the more you develop, deploy and manage applications using MongoDB technology, the more a need for the data model design becomes obvious. At the same time, while the format of documents in a single collection can change over time, in most cases in practice, collections are highly homogeneous. Even with the more frequent structural collection changes, the modeling tool can help you in properly documenting these changes.

Daprota just released the M2 cloud service which is the first service for the MongoDB data modeling. It enables the creation and management of data models for MongoDB.

Only a free M2 service plan is provided for now. It enables the creation and management of up to five private data models and an unlimited access to public data models provided by Daprota and M2 service users. Plan upgrades with either larger or unlimited number of private models to be managed will be available in the near future.

The current public models include Daprota models and models based on design patterns and use cases provided by MongoDB via the MongoDB website.

M2 features include:

  • Management of models and their elements (Collections, Documents, and Fields)
  • Copying and versioning of Models, Collections and Documents via related Copy utilities
  • Export/Import Models
  • Full models view in JSON format
  • Public models sharing
  • Models documentation repository
  • Messaging between M2 users

Daprota plans on adding more features to the service in the near future.

MongoDB Data Models

When creating MongoDB data models, besides knowing internal details of how MongoDB database engine works, there are few other factors that should be considered first:

  • How your data will grow and change over time?
  • What is the read/write ratio?
  • What kinds of queries your application will perform?
  • Are there any concurrency related constrains you should look at?

These factors very much affect what type of model you should create. There are several types of MongoDB models you can create:

  • Embedding Model
  • Referencing Model
  • Hybrid Model that combines embedding and referencing models.

There are also other factors that can affect your decision regarding the type of the model that will be created. These are mostly operational factors and they are documented at Data Modeling Considerations for MongoDB Applications

The key question is:

  • should you embed related objects within one another or
  • should you reference them by their identifier (ID)?

You will need to consider performance, complexity and flexibility of your solution in order to come up with the most appropriate model.

Embedding Model (De-normalization)

Embedding model enables de-normalization of data what means that two or more related pieces of data will be stored in a single document. Generally embedding provides better read operation performance since data can be retrieved in a single database operation. In other words, embedding supports locality. If you application frequently access related data objects the best performance can be achieved by putting them in a single document which is supported by the embedding model.

MongoDB provides atomic operations on a single document only. If fields of a document have to be modified together all of them have to be embedded in a single document in order to guarantee atomicity. MongoDB does not support multi-document transactions. Distributed transactions and distributed join operations are two main challenges associated with distributed database design. By not supporting these features MongoDB has been able to implement highly scalable and efficient atomic sharding solution.

Embedding has also its disadvantages. If we keep embedding related data in documents or constantly updating this data it may cause the document size to grow after the document creation. This can lead to data fragmentation. At the same time the size limit for documents in MongoDB is determined by the maximum BSON document size (BSON doc size) which is 16 MB. For larger documents, you have to consider using GridFS.

On the other hand, if documents are large the fewer documents can fit in RAM and the more likely the server will have to page fault to retrieve documents. The page faults lead to random disk I/O that can significantly slow down the system.

Referencing Model (Normalization)

Referencing model enables normalization of data by storing references between two documents to indicate a relationship between the data stored in each document. Generally referencing models should be used when embedding would result in extensive data duplication and/or data fragmentation (for increased data storage usage that can also lead to reaching maximum document size) with minimal performance advantages or with even negative performance implications; to increase flexibility in performing queries if your application queries data in many different ways, or if you do not know in advance the patterns in which data may be queried; to enable many-to-many relationships; to model large hierarchical data sets (e.g., tree structures)

Using referencing requires more roundtrips to the server.

Hybrid Model

Hybrid model is a combination of embedding and referencing model. It is usually used when neither embedding or referencing model is the best choice but their combination makes the most balanced model.

Polymorphic Schemas

MongoDB does not enforce a common structure for all documents in a collection. While it is possible (but generally not recommended) documents in a MongoDB collection can have different structures.

However our applications evolve over time so that we have to update the document structure for the MongoDB collections used in applications. This means that at some point documents related to the same collection can have different structures and the application has to take care of it. Meanwhile you can fully migrate the collection to the latest document structure what will enable the same application code to manage the collection.

You should also keep in mind that the MongoDB’s lack of schema enforcement requires the document structure details to be stored on a per-document basis what increases storage usage. Especially you should use a reasonable length for the document’s field names since the field names can add up to the overall storage used for the collection.

Information Modeling

Different categories of information modeling exist. Descriptions below summarize definitions from different sources [1,2,3]. The following categories are described:

  • Controlled vocabulary
  • Taxonomy
  • Thesaurus
  • Ontology
  • Metamodel

Controlled Vocabulary

Controlled vocabularies provide a way to organize knowledge for subsequent retrieval [3]. A controlled vocabulary is a list of explicitly enumerated terms [2]. All terms in a controlled vocabulary should have an unambiguous, non-redundant definition. Controlled vocabulary schemes mandate the use of predefined, authorised terms that have been pre-selected by the registration authority (or designer) of the vocabulary, in contrast to natural language vocabularies, where there is  no restriction on the vocabulary.

Sometimes this condition is eased depending on how strict the controlled vocabulary registration authority is.

The following two rules should be enforced at least:

  • If the same term is commonly used to mean different concepts in different contexts, then its name has to be explicitly qualified to resolve this ambiguity.
  • If multiple terms are used to mean the same concept, one of the terms is identified as the preferred term in the controlled vocabulary and the other terms are identified as synonyms or aliases.

In information science, a controlled vocabulary is a selected list of words and phrases, which are used to tag units of information so that they may be more easily retrieved by a search [3]. Controlled vocabularies reduce ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency.

Controlled vocabularies tagged to documents are metadata.

The use of controlled vocabulary ensures that everyone is using the same word for the same concept throughout an organization.

A controlled vocabulary for describing Web pages can dramatically improve Web searching. This is being culminated in Semantic Web where the content of Web pages is described using a machine-readable metadata scheme (i.e., Dublin Core Initiative [5]) in RDFa [4] (Resource Description Framework – in – attributes). The content of the entire Web cannot be described using a single metadata scheme. More metadata schemes like Dublin Core Initiative are needed in different areas of knowledge management.

Controlled vocabularies are used in taxonomies and thesaurus.


A taxonomy or a taxonomic scheme is a collection of controlled vocabulary terms organized into a hierarchical relationship structure [2]. Each term in a taxonomy is in one or more relationships to other terms in the taxonomy. These relationships are called generalization-specialization relationships, or type-subtype relationships, or less formally, parent-child relationships [6]. The subtype has the same properties, behaviours, and constraints as the supertype plus one or more additional properties, behaviours, or constraints.

Most taxonomies limit all parent-child relationships to a single parent to be of the same type. Some taxonomies allow poly-hierarchy, which means that a term (concept) can have multiple parents. This means that if a term appears in multiple places in a taxonomy, then it is the same term. Specifically, if a term has children in one place in a taxonomy, then it has the same children in every other place where it appears.

A hierarchical taxonomy is a tree structure of classifications for a given set of terms (concepts). The root of this structure is called a classification scheme and it applies to all terms. Nodes below the classification scheme are more specific classifications that apply to subsets of the total set of classified terms. The progress of taxonomy reasoning proceeds from the general to the more specific.

Sometimes the term taxonomy could also be applied to relationship schemes other than type-subtype hierarchies (i.e., network structures with other types of relationships). In these cases, taxonomies may then include single subtype with multi-types.


A thesaurus is a networked collection of controlled vocabulary terms [2]. The thesaurus uses associative relationships in addition to type-subtype relationships. The expressiveness of the associative relationships in a thesaurus vary and can be as simple as “related to” (i.e., concept X is related to concept Y).

Thesauri for information retrieval are typically constructed by information specialists, and have their own unique vocabulary defining different kinds of terms and relationships [7].

Terms are the basic semantic units for conveying concepts. They are single-word or multi-word nouns. Verbs can be converted to nouns (i.e., “reads” to “reading”, “paints” to “painting”, etc.). Adjectives and adverbs are not usually used.

When a term is ambiguous, a Scope Note can be added to ensure consistency, and give direction on how to interpret the term. The use of scope notes is not mandatory for each term by having them provides correct thesaurus use and correct understanding of the given field of knowledge. Generally, a Scope Note is a brief statement of the intended usage of a term.

Relationships are links between terms. The relationships can be divided into three types: hierarchical, equivalency, or associative [7].

Hierarchical relationships are used to indicate terms which are narrower and broader in scope. The Broader term (BT) and Narrower Term (NT) notations are used to indicate a hierarchical relationship between terms [8]. Narrower terms follow the NT notaion and are included in the broader class represented by the main term. For example:

NT Academic Libraries

  • Branch Libraries
  • Childrens Libraries
  • Depository Libraries
  • Electronic Libraries
  • Public Libraries
  • Research Libraries
  • School Libraries
  • Special Libraries

The equivalency relationship is used primarily to connect synonyms and near-synonyms.

The Used For (UF) reference is used generally to resolve synonymy problems in natural languages. Terms following the UF notation are not to be used. They represent either (1) synonymous or variant forms of the main term, or (2) specific terms that, for purposes of storage and retrieval, are indexed under a more general term [8]. The example below [8]  illustrates the use of UF:

Lifelong Learning
UF Continuous Learning (1967 1980)

  • Education Permanente
  • Life Span Education
  • Lifelong Education
  • Permanent Education
  • Recurrent Education

The Broader Term (BT) is the opposite of the NT. Terms that follow the BT notation include as a subtype the concept represented by the main (narrower) term:

School Libraries
BT Libraries

Mathematical Models
BT Models

It is also possible for a term to have more than one broader term:

Remedial Reading
BT Reading

  • Reading Instruction
  • Remedial Instruction

The former term “Continuous Learning” that has been downgraded to the status of a UF term is followed by a “life span” notation in parentheses (1967 1980). This indicates the time period during which the term was used in indexing.

Sometimes a UF needs more than one descriptor to represent it adequately [8] when a pound sign (#) following the UF term specifies that two or more main terms are to be used in coordination. For example [8]:

Folk Culture
UF Folk Drama (1969 1980) #

  • Folklore
  • Folklore Books (1968 1980) #
  • Traditions (Cultura)

UF Dramatic Utilities (1970 1980)

  • Folk Drama (1969 1980) #
  • Outdoor Drama (1968 1980) #
  • Plays (Theatrical)

The USE reference (opposite of UF) refers an indexer or searcher from a nonusable (nonindexable) term to the preferred indexable term or terms. For example [8]:

Regular Class Placement (1968 1978)
USE    Mainstreaming

Continuous Learning (1967 1980)
USE    Lifelong Learning

A coordinate or multiple USE reference supports the use of two or more main terms together to represent a single term [8]:

Folk Drama (1969 1980)
USE       Drama
AND     Folk Culture

Associative relationships are used to connect two related terms whose relationship is neither hierarchical nor equivalent. This relationship is described by the indicator Related Term (RT). Terms following the RT notation have a close conceptual relationship to the main term but not the direct type/subtype relationship specified by BT/NT. Part-whole relationships, near-synonyms, and other conceptually related terms, appear as RTs.

Associative relationships should be applied with caution, since excessive use of RTs will reduce specificity in searches. Consider the following: if the typical user is searching with term “X”, would they also want resources tagged with term “Y”? If the answer is no, then an associative relationship should not be established.

This is an example of the RT relationships [8]:

High School Seniors
RT College Bound Students

  • Grade 12
  • High School freshmen
  • High School Graduates
  • Noncollege Bound Students


What is an ontology? Very short answer from Tom Gruber [12] is:

“An ontology is a specification of a conceptualization.”

More detailed definition follows.

An ontology is a formal representation of knowledge by a set of concepts, their properties, relationships, and other distinctions within a domain. It is used to describe the domain and to reason about the properties of the domain.

Ontologies are used in Semantic Web, systems engineering, software engineering, process modeling, biomedical informatics, library science, enterprise bookmarking, artificial intelligence, and information architecture as a form of knowledge representation about the world or some part of it. The creation of domain ontologies is also fundamental to the definition and use of an enterprise architecture framework.

This is a formal ontology definition, provided by Tom Gruber, from the Encyclopedia of Database Systems [9]:

“In the context of computer and information sciences, an ontology defines a set of representational primitives with which to model a domain of knowledge or discourse.  The representational primitives are typically classes (or sets), attributes (or properties), and relationships (or relations among class members).  The definitions of the representational primitives include information about their meaning and constraints on their logically consistent application.  In the context of database systems, ontology can be viewed as a level of abstraction of data models, analogous to hierarchical and relational models, but intended for modeling knowledge about individuals, their attributes, and their relationships to other individuals.  Ontologies are typically specified in languages that allow abstraction away from data structures and implementation strategies; in practice, the languages of ontologies are closer in expressive power to first-order logic than languages used to model databases.  For this reason, ontologies are said to be at the “semantic” level, whereas database schema are models of data at the “logical” or “physical” level.  Due to their independence from lower level data models, ontologies are used for integrating heterogeneous databases, enabling interoperability among disparate systems, and specifying interfaces to independent, knowledge-based services.  In the technology stack of the Semantic Web standards [1], ontologies are called out as an explicit layer.  There are now standard languages and a variety of commercial and open source tools for creating and working with ontologies. “

W3C Semantic Web standard specifies a formal language for encoding ontologies (OWL), in several variants that vary in expressive power [10].  This reflects the intent that an ontology is a specification of an abstract data model (the domain conceptualization) that is independent of its particular form [9]. Tara Ontology Language [11] is an example of another ontology language.


Metamodeling belongs to the creation of metamodels that are collections of concepts, their relationships, and rules within a certain domain. Metamodels are created in metamodeling languages. Some of these languages are:

OWL 2 Web Ontology Language

RDF Vocabulary Description Language and RDF Schema

Models, that are abstractions of phenomena in the real world, are based on metamodels. Models conform to metamodels the same way programs conform to programming languages in which they are written. The similar analogy exists between a logical data model (metamodel) and a dataset (model) based on the logical data model.

Common uses [1] for metamodels are:

  • As a schema for semantic data that needs to be exchanged or stored.
  • As a language that supports a particular method or process.
  • As a language to express additional semantics of existing information.

A valid metamodel is an ontology.


  8. Thesaurus of ERIC (Educational resources Information Center) Descriptors, 14th Edition:
  9. Ontology Definition from the Encyclopedia of Database Systems