MongoDB Data Models

When creating MongoDB data models, besides knowing internal details of how MongoDB database engine works, there are few other factors that should be considered first:

  • How your data will grow and change over time?
  • What is the read/write ratio?
  • What kinds of queries your application will perform?
  • Are there any concurrency related constrains you should look at?

These factors very much affect what type of model you should create. There are several types of MongoDB models you can create:

  • Embedding Model
  • Referencing Model
  • Hybrid Model that combines embedding and referencing models.

There are also other factors that can affect your decision regarding the type of the model that will be created. These are mostly operational factors and they are documented at Data Modeling Considerations for MongoDB Applications

The key question is:

  • should you embed related objects within one another or
  • should you reference them by their identifier (ID)?

You will need to consider performance, complexity and flexibility of your solution in order to come up with the most appropriate model.

Embedding Model (De-normalization)

Embedding model enables de-normalization of data what means that two or more related pieces of data will be stored in a single document. Generally embedding provides better read operation performance since data can be retrieved in a single database operation. In other words, embedding supports locality. If you application frequently access related data objects the best performance can be achieved by putting them in a single document which is supported by the embedding model.

MongoDB provides atomic operations on a single document only. If fields of a document have to be modified together all of them have to be embedded in a single document in order to guarantee atomicity. MongoDB does not support multi-document transactions. Distributed transactions and distributed join operations are two main challenges associated with distributed database design. By not supporting these features MongoDB has been able to implement highly scalable and efficient atomic sharding solution.

Embedding has also its disadvantages. If we keep embedding related data in documents or constantly updating this data it may cause the document size to grow after the document creation. This can lead to data fragmentation. At the same time the size limit for documents in MongoDB is determined by the maximum BSON document size (BSON doc size) which is 16 MB. For larger documents, you have to consider using GridFS.

On the other hand, if documents are large the fewer documents can fit in RAM and the more likely the server will have to page fault to retrieve documents. The page faults lead to random disk I/O that can significantly slow down the system.

Referencing Model (Normalization)

Referencing model enables normalization of data by storing references between two documents to indicate a relationship between the data stored in each document. Generally referencing models should be used when embedding would result in extensive data duplication and/or data fragmentation (for increased data storage usage that can also lead to reaching maximum document size) with minimal performance advantages or with even negative performance implications; to increase flexibility in performing queries if your application queries data in many different ways, or if you do not know in advance the patterns in which data may be queried; to enable many-to-many relationships; to model large hierarchical data sets (e.g., tree structures)

Using referencing requires more roundtrips to the server.

Hybrid Model

Hybrid model is a combination of embedding and referencing model. It is usually used when neither embedding or referencing model is the best choice but their combination makes the most balanced model.

Polymorphic Schemas

MongoDB does not enforce a common structure for all documents in a collection. While it is possible (but generally not recommended) documents in a MongoDB collection can have different structures.

However our applications evolve over time so that we have to update the document structure for the MongoDB collections used in applications. This means that at some point documents related to the same collection can have different structures and the application has to take care of it. Meanwhile you can fully migrate the collection to the latest document structure what will enable the same application code to manage the collection.

You should also keep in mind that the MongoDB’s lack of schema enforcement requires the document structure details to be stored on a per-document basis what increases storage usage. Especially you should use a reasonable length for the document’s field names since the field names can add up to the overall storage used for the collection.

Advertisements

MongoDB Indexes

MongoDB indexes are based on B-tree data structure. Indexes are important elements in maintaining MongoDB performance if they are properly used. On the other hand, indexes have associated costs that include memory usage, disk usage and slower updates. MongoDB provides explain plan capability and database profiler utility to collect data about database operations that can be used for database tuning.

Memory

Ideally the entire index should be resident in RAM. If the number of distinct values for index is high, we have to ensure that index fits in RAM. Otherwise performance will be impacted.

The parts of index related to the recently inserted data will always be in active RAM. If you query on recent data, MongoDB index will perform well and MongoDB will use less memory. For example, this could be a case when index is based on a time/date field.

Compound Index

Besides single field indexes you can also create compound indexes containing more than one field. The order of fields in a compound index can significantly impact performance. It will improve performance if you place more selective element of your query first in compound index. At the same time, your other queries may be impacted by this choice. This is an example that shows that you have to analyze your entire application in order to make appropriate design decisions regarding indexes.

Each index is stored in a sorted order on all fields in the index and these rules should be followed to provide efficient indexing:

  • Fields that will be queried by equality should be the first fields in the index.
  • The next should be fields used to sort query results. If sorting is based on multiple fields they should occur in the same order in the index definition
  • The last filed in the index definition should be the one queried by range.

It is also good to know that an additional benefit of a compound index is that a leading field within the index can also be used. So if we query with a condition on a single field that is a leading field of an index, the index will be used.

On the other hand an index will be less efficient if we do not range and sort on a same set of fields.

MongoDB provides hint() method to force use of a specific index.

Unique Index

You can create a unique index that will enable uniqueness of the index field value. A compound index can also be specified as unique in which case each combination of index field values has to be unique.

Fields with No Values and Sparse Index

If an index field does not have a value then the index entry with value null will be created. Only one document can have a null value for an index field unless the sparse option is specified for the index in which case the index entries are not created for documents that do not have the field.

You should be aware that using a sparse index will sometime produce an incomplete result when index-based operations (e.g., sorting, filtering, etc.) are used.

Geospatial Index

MongoDB provides geospatial indexes that are used to optimize queries including locations within a two-dimensional space. When dealing with locations documents must have a field with a two-element array (latitude and longitude) to be indexed with a geospatial index.

Array Index

Fields that are arrays can be also indexed in which case each array value is stored as a separate index entry.

Create Index Operation

Creation of an index can be either a foreground or background operation. Foreground operation intensively consume resources and require lots of time in some cases. They are blocking operations in MongoDB. When indexes are created via a background operation more time is needed to create an index but database is not blocked and can be used while index is being created.

AND and OR Query Tips

If you know that a certain criteria in a query will be matching less documents and if this criteria is indexed, make sure that this criteria goes first in your query. This will enable a selection of a smallest number of documents needed to retrieve the data.

OR-style queries are opposite of AND queries. The most inclusive clauses (returning the largest number of documents) should go first since MongoDB has to check documents that are not part of the result set yet for every match.

Useful Information

  •  MongoDB optimizer generally uses one index at a time. If more than one predicate is used in a query then a compound index should be created based on rules previously described.
  • The maximum length of index name is 128 characters and an index entry size Cannot exceed 1,024 bytes.
  • Index maintenance during the add, remove or update operations will slow down these operations. If your application performs heavy updates you should carefully select indexes.
  • Ideally indexes should reduce a set of possible documents to select from so it is important to create high selectivity indexes. For example, an index based on a phone number is more selective than an index based on a ‘yes/no’ flag.
  • Indexes are not efficient in inequality queries.
  • When regular expressions are used leading wildcards will downgrade query performance because indexes are ordered.
  • Indexes are generally useful when we are retrieving a small subset of the total data. They usually stop being useful when we return half of the data or more in a collection.
  • A query that returns only a few fields should be fully covered by an index.
  • Whenever it is possible, create a compound index that can be used by mutiple queries.

Tomcat Security Realm with MongoDB

1. User and Role Document Model

Daprota User Model

2. web.xml

 We have two roles, ContentOwner and ServerAdmin. This is how we set up form-based authentication in web.xml:

  …
 <security-constraint>
     <web-resource-collection>
         <url-pattern>/*</url-pattern>
     </web-resource-collection>
     <auth-constraint>
         <role-name>ServerAdmin</role-name>
         <role-name>ContentOwner</role-name>
     </auth-constraint>
 </security-constraint>

 <login-config>
     <auth-method>FORM</auth-method>
     <realm-name> MongoDBRealm</realm-name>
     <form-login-config>
         <form-login-page>/login.jsp</form-login-page>
         <form-error-page>/login_error.jsp</form-error-page>
     </form-login-config>
 </login-config>

 <!-- Security roles referenced by this web application -->
 <security-role>
     <role-name>ServerAdmin</role-name>
 </security-role>
 <security-role>
 <role-name>ContentOwner</role-name>
 </security-role>

3. Create passwords for admin and test users

($CATALINA_HOME/bin/digest.[bat|sh] -a {algorithm} {cleartext-password})

os-prompt> digest -a SHA-256 manager
manager: 6ee4a469cd4e91053847f5d3fcb61dbcc91e8f0ef10be7748da4c4a1ba382d17

os-prompt> digest -a SHA-256 testpwd
testpwd:a85b6a20813c31a8b1b3f3618da796271c9aa293b3f809873053b21aec501087

Execute this JavaScript  code in MongoDB JS shell:

use mydb
usr = { userName: 'admin',
        password: '1a8565a9dc72048ba03b4156be3e569f22771f23',
        roles: [ { _id: ObjectId(),
                   name: 'ServerAdmin'}
               ]
}
db.user.insert(usr);
usr = { userName: 'test',
        password: '05ec834345cbcf1b86f634f11fd79752bf3b01f3',
        roles: [ { _id: ObjectId(),
                   name: 'ContentOwner'}
               ]
}
db.user.insert(usr);
db.user.find().pretty();

role = { name: 'ServerAdmin',
         description: 'Server administrator role'
}
db.role.insert(role);
role = { name: 'ContentOwner',
         description: 'End-user (client) role'
}
db.role.insert(role);
db.role.find().pretty();

mydb is a MongoDB database name we use in this example.

4. Realm element setup

Set up Realm element, as showed below, in your $CATALINA_HOME/conf/server.xml file:

      <Host name="localhost"  appBase="webapps"
            unpackWARs="true" autoDeploy="true">
        ...
        <Realm className="com.daprota.m2.realm.MongoDBRealm"
               connectionURL="mongodb://localhost:27017/mydb"
               digest="SHA-256"/>
      </Host>
 </Engine>

5. How to encrypt user’s password

The following Java code snippet is an example of how to encrypt a user’s password:

String password =  "password";

MessageDigest messageDigest = java.security.MessageDigest.getInstance("SHA-256");        
messageDigest.update(password.getBytes());            
byte byteData[] = messageDigest.digest();
//Convert byte data to hex format 
StringBuffer hexString = new StringBuffer();
for (int i = 0; i < byteData.length; i++) {            
    String hex=Integer.toHexString(0xff & byteData[i]);                
    if (hex.length()==1) 
        hexString.append('0');                
    hexString.append(hex);                
}

When you store password in MongoDB, store it via hexString.toString().

6. MongoDB realm source code

The source code of the Tomcat security realm implementation with MongoDB and ready to use m2-mongodb-realm.jar are available at

https://github.com/gzugic/mongortom

You just need to copy m2-mongodb-realm.jar to your $CATALINA_HOME/lib.