HBase Schema

“billions of rows * millions of columns * thousands of versions = terabytes or petabytes of storage” (The HBase project)

Apache HBase is an open source implementation of Google’s BigTable. It is built atop Apache Hadoop and is tightly integrated with it. It is a good choice for applications requiring fast random access to very large amounts of data.

HBase stores data in a form of a distributed sorted multidimensional persistence maps called Tables. The table terminology makes it easier for people coming from the relational data management world to abstract data organization in HBase. HBase is designed to manage tables with billions of rows and millions of columns.

HBase data model consists of tables containing rows. Data is organized into column families grouping columns in each row. This is where similarities between HBase and relational databases end. Now we will explain what is under the HBase table/rows/column families/columns hood.

HBase Row

This is to summarize an HBase table’s mappings:

  • a row key maps to a list of column families
  • a column family maps to a list of column qualifiers (columns)
  • a column qualifier maps to a list of timestamps (versions)
  • a timestamp maps to a value (the cell itself)

Based on this you will get the following:

  • if you are retrieving data that a row key maps to, you’d get data from all column families related to the row that the row key identifies
  • if you are retrieving data which a particular column family maps to, you’d get all column qualifiers and associated data (maps with timestamps as keys and corresponding values)
  • if you are retrieving data that a particular column qualifier maps to, you’d get all timestamps (versions) for that column qualifier and all associated values.

Tables are declared up front at schema definition time. Row keys are arrays of bytes and they are lexicographically sorted with the lowest order appearing first.

HBASE returns the latest version of data by default but you can ask for multiple versions in your query. HBase returns data sorted first by the row key values, then by column family, column qualifier and finally by the timestamp value, with the most recent data returned first.

What is a multidimensional map?

A map is an abstract data type that represents a collection of key-value pairs. This is how it looks like presented in JSON:

{
  "name": “Baruch Spinoza",
  "email": "baruch.spinoza@spinozahome.com"
}

We can think about a multidimensional map as a map of maps. For example:

{ 
  "rowkey1": {"cf11": {"column111": {"version1111": value1111,
                                     "version1112": value1112},
                       "column112": {"version1121": value1121,
                                     "version1122": value1122,
                                     "version1123": value1123,
                                     "version1124": value1124},
                       "column113": {"version1131": value1131}
                      },
              "cf12": {"column121": {"version1211": value1211},
                       "column122": {"version1221": value1221},
                                     "version1222": value1222}
                      }
             },
  "rowkey2": {"cf11": {"column111": {"version2111": value2111,
                                     "version2112": value2112},
                       "column112": {"version2121": value2121,
                                     "version2122": value2122,
                                     "version2123": value 2123,
                                     "version2124": value2124}
                      },
              "cf12": {"column121": {"version2211": value2211},
                       "column122": {"version2221": value2221}
                      }
             }
}

In the example above, each key (“rowkey1″,”rowkey2”) points to a map with exactly two keys, “cf11” and “cf12”.

We call the top level key/value pair rows.

The “cf11” and “cf12” mappings are Column Families.

How is it distributed?

HBase is built upon distributed filesystems with file storage distributed across commodity machines. The distributed file systems HBase works with include

  • Hadoop’s Distributed File System (HDFS) and
  • Amazon’s Simple Storage Service (SS3).

HBase Arcxhitecture

HDFS provides a scalable and replicated storage layer for HBase. It guarantees that data is never lost by writing the changes across a configurable number of physical servers.

The data is stored in HFiles, which are ordered immutable key/value maps. Internally, the HFiles are sequences of blocks with a block index stored at the end. The block index is loaded when the HFile is opened and kept in memory.  The default block size is 64 KB but it can be changed since it is configurable. HBase API can be used to access specific values and also scan ranges of values given a start and end key.

Since every HFile has a block index, lookups can be performed with a single disk seek. First, HBase does a binary search in the in-memory block index to find a block containing the given key and then the block is read from disk.

When data is updated it is first written to a commit log, called a write-ahead log (WAL) and then it is stored in the in-memory memstore.

When the data in memory exceeds a given maximum value, it is flushed as an HFile to disk and after that the commit logs are discarded up to the last unflushed modification. The system can continue to serve readers and writers without blocking them while it is flushing the memstore to disk. This is done by rolling the memstore in memory where the new empty one is taking the updates and the old full one is transferred into an HFile. At the same time, no sorting or other special processing has to be performed since the data in the memstores is already sorted by keys matching what HFiles represent on disk.

The write-ahead log (WAL) is used for recovery purposes only. Since flushing memstores to disk causes creation of HFiles, HBase has a housekeeping job that merges the HFiles into larger ones using compaction. Various compaction algorithms are supported.

Other HBase architectural components include the client library (API), at least one master server, and many region servers. The region servers can be added or removed while the system is up and running to accommodate increased workloads. The master is responsible for assigning regions to region servers. It uses Apache ZooKeeper, a distributed coordination service, to facilitate that task.

Data is partitioned and replicated across a number of regions located on region servers.

HBase Logical Architecture

As mentioned above, assignment and distribution of regions to region servers is automatic. However manual management of regions is also possible. When a region’s size reaches a pre-defined threshold, the region will automatically split into two child regions. The split happens along a row key boundary. A single region always manage an entire row. It means that a rows are never divided.

How is it sorted?

Key/value pairs in HBASE maps are kept in an alphabetical order. The amount of data you can store in HBase can be huge and the data you are retrieving via your queries should be near each other.

For example, if you run a query on an HBase table that returns thousands of rows which are distributed across many machines, the latency affected by your network can be significant. This data distribution is determined by a row key of the HBase table. Because of that the row key design is one of the most important aspects of the HBase data modeling (schema design). If a row key is not properly designed it can create hot spotting where a large amount of client traffic is directed at one or few nodes of a cluster.

The row key should be defined in a way that allows related rows to be stored near each other. These related rows will be retrieved by queries and as long as they are stored near each other you should experience good performance. Otherwise the performance of your system will be impacted.

Table

Data is stored in tables that have rows and columns. Table names are strings. They are composed of characters that are safe to use in file system paths. Tables are logically grouped into namespaces by applications, or users, or access control, etc. A namespace is analogous to a database in relational database management systems. A namespace membership is determined during the table creation when tables are fully named.

As previously mentioned, HBASE tables are multi-dimensional maps. A table has multiple rows. A row consists of a row key that is sortable and one or more columns with values associated with them. Rows are uniquely identified by their row key. A row key has no data type and is always treated as an array of bytes. A row key is an equivalent of a primary key in a relational database table. The row key is the only way to access the row. The number of columns per row is arbitrary. It can vary from row to row. Columns are organized in column families. At a conceptual level, tables may be viewed as a sparse set of rows that are physically stored by a column family.

Column Family

Column families are specified when a table is created. They should be carefully designed before a table is created since it would be either impossible or difficult to change them later.

Column families’ names are strings that are composed of characters that are safe to use in file system paths.

All columns in a column family are stored and sorted together in the same HFile.

Column families group columns together physically and logically and they are usually used for a performance reason. A column family has a set of parameters that specify its storage (e.g., caching, compression, etc.). All tuning and storage specifications are done at the column family level. It is important that all column family members have the same or similar access pattern and sizes.

Some shortcomings in the current HBase implementation do not properly support large number of column families in a single table. That number should be in low tens. Most of the time up to three column families should work fine without any significant performance drawback. Ideally you should go with a single column family. The column family names should be as small as possible, preferably one character.

A column family can have an arbitrary number of columns denoted by a column qualifier which is like a column’s label. For example:


 {
 "row1": {"1": {"color": "green",
                "size": 25},
          "2": {"weight": 52,
                "size": 18}
         },
 "row2": {"1": {"color": "blue"},
          "2": {"height": 192,
                "size": 43}
         }
 }
 

As you can see in the example above, the same column family (e.g., “1”) in two rows can have different columns. In row “row1”, it has columns “color” and “size”, while in row “row2”, it has only “color” column. It can also have a column that is none of the above. Since rows can have different columns in column families there is no a single way to query for a list of all columns in all column families. This means that you have to do a full table scan.

There is no specific limit on the number of columns in a column family. Actually you can have millions of columns in the single column family.

Column

Columns are usually physically co-located in column families. A column is identified by column family and column qualifier separated by a colon character (:). For example, courses:math. The column family prefix must be composed of printable characters. The column qualifiers (columns) do not have to be defined at schema definition time and they can be added on the fly while the database is up and running.

A column qualifier is an index for a given data and it is added to a column family. Data within a column family is addressed via the column qualifier. Column qualifiers are mutable and they may vary between rows. They do not have data types and they are always treated as arrays of bytes.

A row key, column family and column qualifier form a cell that has a value and timestamp that represents the value’s version. Values also do not have data types and they are always treated as arrays of bytes. A timestamp is recorded for each value and it is the time on the region server when the value was written.

All cell’s values are stored in a descending order by its timestamp. When values are retrieved and if the timestamp is not provided then HBase will return the cell value with the latest (the most recent) timestamp. If a timestamp is not specified during the write, the current timestamp is used.

The maximum number of versions (timestamps) for a given column to store is part of the column schema. It is specified at table creation. It can be specified via alter table command as well. The default value is 1. The minimum number of versions can be also set up per column family. You can also globally set up a maximum number of versions per column.

HBase does not overwrite row values. It stores different values per row by time and column qualifier. Extra versions above the current max version setup are removed during major compactions. If it is not necessary it is not recommended to have very high maximum number of versions since it will increase the HFile size significantly.

It is worth to mention that the column metadata is only stored in internal key/value instances for a column family. You have to keep track of the column names since HBase can support very high number of columns per row and columns can differ between the rows as well. If you do not record these column names by yourself and you forget them you will have to retrieve all rows from a column family in order to find out the column names.

Supported Data Types

Everything that can be converted to an array of bytes can be stored in HBASE. The stored values could be

  • Strings
  • Numbers
  • Counters (atomic increment of numbers)
  • Complex objects
  • Images

as long as they can be rendered as arrays of bytes.

The size of values should be reasonable. Storing millions of objects with 10-50 MB in size might not make sense.

Namespace

A namespace is a logical grouping of tables analogous to a database in relational database management systems. They are useful when you are dealing with many tables. Namespaces are also used to apply security rules to all tables in a namespace. If you do not specify a namespace when you create a table, it will be automatically added to the “default” tablespace.

A namespace membership is determined during a table creation when the table is fully named as

<table namespace>:<table qualifier>

Joins

HBase does not support joins. Generally there are two options to support joins:

  • denormalize the data upon writing to HBase
  • have lookup tables and implement the join between HBase tables either in your application code or MapReduce code.

The best join approach depends on a way how you will be using your data and run your application.

HBase Schema Creation

HBASE schema can be created either by HBASE shell or by Java API. When changes are made on a table, it has to be disabled until the changes are complete.

Changes on tables and column families take place when the next major compaction is done and HFiles re-written.

Summary

HBase’s column-oriented architecture allows for huge, wide, sparse tables.

HBase is strongly consistent on a row-level since a single region always manage an entire row.

Multiversioning can help us to avoid edit conflicts caused by concurrent data access processes and also retain data for whatever time it is needed as long as enough storage is provided.

When data schemas are properly designed, HBase provides excellent random read performance and near-optimal write operations in terms of I/O channel saturation.

HBase makes an efficient use of storage by supporting pluggable compression algorithms.

HBase extends the Bigtable model, which only considers a single index. In addition, it provides push-down predicates, that is, filters, reducing data transferred over the network.

Designing the schema in a way to completely avoid explicit locking, combined with row-level atomicity, gives you the ability to scale your system without any notable effect on read or write performance.

There are many tuning techniques that can be used to improve HBase experience and they may be a topic of some future post.

Advertisements

Synchronous and Asynchronous Service Integration

Integration of services is one of the most critical solution elements of systems based on Service Oriented Architecture (SOA). There are few influencing factors which can significantly impact system capabilities depending on the type of the integration used. The most critical factors include and they are equally relevant for both regular services and microservices:

Service changes

 

From time to time, we will make service changes that may require service consumers to change as well. We have to ensure that the selected  service integration technology requires this to happen as rarely as possible.
More service consumers down the road

 

If more service consumers will be using a service in the future, the service integration capabilities should offer an efficient integration solution that will minimize service and service consumer changes and keep performance in a range of business SLAs.
Implementation locality

 

When implementing integration interfaces for a service, its internal implementation details should be hidden internally and they should not be exposed to consumers of the service. If it is not the case, it means that if something is changed inside the service, we can break the service consumers by requiring them to change as well. That increases the cost of change and  we want to avoid that. It also increases the risk associated with the changes.
Communication

 

Should communication be synchronous or asynchronous? This choice will shape the overall service integration implementation. Synchronous communication blocks until the operation completes. With asynchronous communication, a service consumer does not wait for the operation to complete before returning.
Stateful vs. Stateless

 

While the synchronous integration mostly supports stateful services, the stateless services are mostly supported by the asynchronous integration. If the statefulness has to be supported by the asynchronous integration it may increase the complexity of the solution and data replication
Complexity of service interactions

 

The complexity of the services’ interactions can further complicate overall integration especially in a case of the asynchronous integration.
Performance

 

Performance will be especially critical for the real-time processes with direct interactions with clients. The integration types supported by the services in the real-time processes can significantly impact performance during the peak hours of operations.
Technology agnostic APIs

 

We want to ensure that the service APIs used for communication with consumer services  are technology-agnostic. This means that the integration technology that dictates what technology stacks we can use should be avoided.
System breakdowns There is a crucial question about the system availability if some of its components go down. It is acceptable for some types of applications to continue functioning in a partially available state if some of its services fail.

 

Some Facts About Synchronous and Asynchronous Communications

Since it is straight forward with the synchronous communication to find out whether things completed successfully or not, the synchronous communication can be easier to reason about. However, since the synchronous communication blocks until the operation completes it can increase computing resources consumption and slow down performance during the peak periods.

Asynchronous communication is useful with the following types of executions:

  • It works very well when we need low latency since it does not block a call while waiting for the result.
  • It also works well with long-running jobs that would otherwise keep a connection between the consumer and a service open for a long period of time what would cause inefficient utilization of resources.

These two different modes of communication enable two different types of collaboration between the system components:

  • request/response or
  • event-based.

With the request/response collaboration, a client sends a request and waits for the response. This type of collaboration aligns well to synchronous communication. However it can also work for asynchronous communication. For example, a consumer can send a request to a service and register a callback, asking the service to send a corresponding response when the operation has completed.

With the event-based collaboration pattern, system components raise events when things change. Other components can listen to events and react to them. The event-based collaboration supports a very loose coupling between services and its consumers. The addition of the new consumers of a service and under assumption that the added consumers can listen to the events, would be pretty straight forward. This type of the collaboration keeps each service simple in a context of its integration. Generally event-based collaborations have advantages such as loose coupling, simplification of the system components, better performance, and it is more resilient to breakdowns. However at the same time it can increase the complexity of the interactions and data replication in a case of stateful services since this type of collaboration is stateless.

 

Service Integration Matrix

Critical Factor

Preferable Integration Type

Service changes Asynchronous

Service decoupling is the property that is supported by the asynchronous integration. Decoupling minimizes impacts caused by service changes.

Additional service consumers Asynchronous

Decoupling property of the integration becomes important when more service consumers start to use the service. The less coupling is involved the simpler and more cost effective integration solution will be implemented.

Implementation locality Asynchronous

Since the asynchronous integration generally hides internal details of the service implementation it will better support the implementation locality.

Stateful Synchronous

The synchronous integration is the best fit for the stateful services. If the statefulness has to be supported by the asynchronous integration it would make implementation more complex and also increase data replication.

Stateless Asynchronous

The stateless services are mostly supported by the asynchronous integration.

Complexity of service interactions Synchronous

More complex service interactions will make asynchronous integration less affordable option. Synchronous integrations simplify overall solution when more complex service interactions are required. However the performance aspect of the solution has to be considered and we should analyze the weight of the solution complexity factor vs. performance since the synchronous integration with complex interactions can significantly impact performance.

Performance Asynchronous

Asynchronous integration generally  provides better performance than synchronous integration.

Technology agnostic APIs Asynchronous

The technology agnostic APIs are better supported by the asynchronous integration since it minimizes exposure of the service internal details and enables either loosely or fully de-coupled services.

System breakdowns Asynchronous

A system that uses asynchronous integration is more resilient to breakdowns because its components function in a more independent mode. In some cases that can be acceptable but we have to be carefully about how much the currency of data (up-to-date information)  is required in a process using the service.

MEAN and Full-Stack Development

JavaScript followed by Node.js enables a single language use across all application layers. Before this change emerged few years ago, we had fragmented technologies and separated teams of designers and developers working in these fragmented technology fields in order to build applications. The “JavaScript everywhere” enabled appearance of the full-stack frameworks that bring common modules from different technology layers together in order to build software in a fast and more agile way making it more efficient solution for frequently changing and highly scalable systems.

Full-stack development is about developing all parts of the application by using a single framework. It includes back-end which belongs to the database, middleware where the application logic and control reside and the last but not the least part is the user interface.

MEAN is a JavaScript and Node.js full-stack framework comprised of four main technologies:

You will need some time to learn all technologies involved in MEAN but it will be rewarding and professionally exiting. At the same time, a single language, JavaScript, is used through the framework and all parts of the application can user and/or enforce Model-View-Controller (MVC) pattern. MVC is fully data oriented. Model holds data, controller processes data and view renders data. Data marshaling is done using JSON so that the serialization and deserialization of data strictures are not needed.

The big advantage of the full-stack framework is that it has a holistic approach that looks at the system as a whole with all its components that exist on their own. In order to function together as the whole the system components have some interdependencies that need to be considered as well. These  interdependences should be minimized in order to properly support decoupling between the system components which is one of the most important aspects of the overall architecture of the system.

The framework is modular what means if tomorrow some of the components become obsolete it can be replaced with the new component. It would require some changes with some of the dependable components but they should be minimal.

The full-stack approach gives you better overall control since it helps the different parts work seamlessly together since they are built by a single developer or a small team of developers. This also supports microservices way of service design and implementation especially for systems that change frequently and/or have to be web scalable. The disposable services are the way to go in these kind of environments.

This post contains overview of MEAN applications, MEAN technologies, and MEAN architectural patterns.

If you are interested in additional details about the MEAN architectural patterns, Getting MEAN with Mongo, Express, Angular and Node book authored by Simon Holmes is a good source of information.

MEAN Applications

There are two types of MEAN applications:

  • Server Applications
  • Single Page Applications (SPA)

With a server application, each user request is routed through Express. Express finds out from its routes which controller will handle the request. This is the same process for each user request. This application type supports one-way data binding.  Node.js gets the data from MongoDB, and Express then compiles this data into HTML via provided templates and finally the HTML is delivered to the server. This implies that most of the processing is done on the server and browser just renders HTML and runs JavaScript if it is provided for interactivity.

MEAN-Architecture-NodeExpress_NEA

With SPA, the application logic is moved to the front-end away from the server and that is why it is called Single Page Application (SPA). The mostly used JavaScript frameworks for SPAs are AngularJS, Backbone and Ember. MEAN uses AngularJS. While this approach has its pros and cons, it is obvious that moving the application processing from the host (server) to the users’ browsers will lower the load on the server and network and bring the cost down. In some cases it will also improve the performance of the application. A browser sends an initial user’s request to the server and server returns AngularJS application with requested data. The subsequent user’s requests are processed most of the time by the AngularJS application running in the browser while data goes back and forth between the browser and server. SPA also supports two-way data binding where the template and data are sent independently to the browser. The browser compiles the template into a view and the data into a model. The view is “live” since it is bound to the model. If the model changes the view changes as well and if the view changes then the model also changes.

MEAN-Architecture-Angular_SPA

MEAN.IO and MEAN.JS are full-stack frameworks for developing MEAN-based applications.

 

MEAN Technologies

MEAN includes five main technologies:

  • MongoDB database and Mongoose object data modeling (ODM) tool
  • Express middleware
  • AngularJS front-end
  • Node.js server platform

MongoDB is a NoSQL document-based database management system which data model includes:

  • Collections
  • Documents
  • Fields
  • References

M2_Model

A Collection is a top model element. Each model can have one or more collections. Collections are analogous to tables in a relational database. Each collection contains documents that are analogous to records in the relational database. Collections model one or more concepts (e.g., account, user, order, publisher, book, etc.) the data is based on.

Documents are JSON-like data structures containing fields that have values of different types (e.g., String, Date,  Number, Boolean, etc.). A value can also belong to another document or an array of documents embedded in a document. Documents can have different structures in a collection. However, in most cases in practice, collections are highly homogeneous.

Fields are analogous to columns in the relational database. The field/value pairs (better known as key/value pairs) construct document’s structure.

MongoDB resolves relationships between documents by either embedding related documents or referencing related documents.

Mongoose is a MongoDB object data modeling (ODM) tool designed to work in an asynchronous environment. Besides the data modeling in Node.js, Mongoose also provides a layer of CRUD features on top of MongoDB. It also makes it easier to manage connections to MongoDB databases and perform data validations.

Express is a middleware framework for Node.js that abstracts away some common web server functionalities. Some of these functionalities include session management, routing, templating, and others.

Node.js is a foundation of the MEAN stack. Node.js is not a language. It is a software platform based on JavaScript. You will use it to build your own web server and applications that will run on top of it. Node.js applications when codded correctly are fast and they efficiently use system resources. This is supported by the core Node.js feature that it is single-threaded and executes a non-blocking event loop.

The web server running on Node.js is different from traditional multi-threaded web servers (e.g., Apache, IIS, etc.). The multi-threaded servers create new thread for each new user session and allocates memory and other computing resources for it. During the peak periods when many users access the server concurrently its resources can get exhausted in which case the system could halt its operations until the load decreases and/or more machines and resources are added. The precocious approach that many systems take is to often overpower the servers even if they do not need so much resources most of the time.  This definitely increases the cost of system operations. When Node.js is used, rather than giving each user a separate thread and pool of resources, each user joins the same thread and the interaction between the user and thread exist only when it is needed. In order to ensure that this approach works Node.js supports non-blocking by making blocking operations run asynchronously.

While you can use Node.js, Express and MongoDB to build data-driven applications, the use of AngularJS will bring more sophisticated features to the interactivity element of the MVC architectural pattern supported by MEAN. AngularJS puts HTML together based on provided data. It also supports two-way data binding by immediately updating the HTML based on changed data and also by updating the data if HTML changes.

 

MEAN Architectural Patterns

When you create MEAN-based applications, you can choose any of the architectural patterns or a combination of the architectural patterns (hybrid architectural patterns) listed here.

MEAN architectural patterns are based on the Model-View-Controller (MVC) pattern.

The MVC pattern is data oriented. Model holds data, Controller processes data and view renders data. There is also a route component between the controller and users’ browsers (Web). The route component coordinates interactions with the controller.

MVC.png

A common way to architect MEAN stack is to have a REST interface feeding a single page application (SPA).  REST interface is implemented via REST API that is built with MongoDB, Node.js and Express and SPA is built with AngularJS that runs in browser.

REST API creates a stateless interface to your database. It enables other applications to work with your data. There is also one more important technology component, Mongoose, that is a liaison between the controller and MongoDB.

Mongo-Mongoose-Express-Angular-Communications.png

MongoDB communicates with Mongoose only. Mongoose communicates with Node.js and Express and AngularJS communicates with Express only.

The REST API is a common architectural element used in all MEAN architectural patterns.

The following architectural patterns are enabled by the MEAN framework:

  • Node.js and Express Application (NEA)
  • Node.js and Express application with AngularJS addition for better interactivity (NEA2)
  • AngularJS Single Page Application (SPA)
  • Hybrid Patterns:
    • NEA and SPA
    • NEA2 and SPA

Node.js and Express application (NEA)

HTML and content are directly delivered from the server. The HTML content requires data that is delivered via REST API. REST API is developed with Node.js, Express, Mongoose and MongoDB.

MEAN-Architecture-NodeExpress_NEA

Node.js and Express Application with AngularJS Addition for Better Interactivity (NEA2)

If you need a richer interactive experience for your users, you can add AngularJS to your pages.

MEAN-Architecture-NodeExpressAngular_NEA2

AngularJS Single Page Application (SPA)

In order to implement Single Page Applications, AngularJS is needed.

MEAN-Architecture-Angular_SPA

Hybrid Patterns

The three above listed architectural patterns can also be combined into hybrid architectural patterns. The two most common combinations are:

  • NEA and SPA
  • NEA2 and SPA

NEA and SPA

This pattern is for the applications that require combination of application constraints that are best supported by both NEA and SPA. For example, NEA best supported application constraints include: short duration of user interactions, low interactions, content rich, etc. SPA best supported application constraints include: feature-rich, highly interactive, long duration of user interactions, private, fast response, etc..

MEAN-Architecture-NodeExpress_NEA-SPA.png

NEA2 and SPA

Finally, NEA2 and SPA is like NEA and SPA with a bit richer interactivity on the server side (NEA2) via AngularJS addition.

MEAN-Architecture-NodeExpress_NEA2-SPA.png

 

Java Code Generation from MongoDB Data Models Created in Daprota M2

Daprota just released a new version of its MongoDB data modeling service M2 that now provides a code generator for Java and JSON from MongoDB data models created by it.

The generated code includes persistence APIs based on MongoDB Java driver, NoSQLUnit tests, and test data in JSON format.

When code is generated you can download it and then use Apache Maven to build the software and run tests via single Maven command (mvn install).

You can save a significant amount of time and effort by creating MongoDB data models in M2 and then generate Java code via just a single click. The quality of your code will also be improved. It will be unit tested and all this will be done for you by M2 in a fully automated fashion.

This kind of service can also be very useful for a quick creation of disposable schemas (data models) in an agile environment when you want to quickly create schemas, generate Java persistence code from it, immediately test it, and repeat this procedure starting over with the current schema update or a completely new schema creation.

As soon as you become familiar with data models creation in M2, which is very intuitive, the speed of the software creation and build from it will be instantaneous.

All your models are fully managed in M2 where you can also store your models’ documentation in M2 repository or you can provide links for external documentations.

Daprota has documented its MongoDB Data Modeling Advisor  which you can access to find out more about MongoDB schema design (data modeling) patterns and best practices.

M2 is a free service including the Java code generator.

Data Modeling Adviser for MongoDB

Daprota just published Data Modeling Adviser guide for MongoDB. It documents

  • MongoDB data modeling basics;
  • Key considerations with data modeling for MongoDB;
  • Key properties of MongoDB data model types (embedded, referenced, hybrid);
  • Data design optimization patterns for these model types.

This guide is a work in progress and it will be regularly updated especially with new optimization patterns.

Its content is very much based on the following sources:

All model samples in the guide are created by the Daprota M2 service and they can be accessed via provided links.

Selecting a MongoDB Shard Key

Scalability is an important non-functional requirement for systems using MongoDB as a back-end component.  A system has to be able to scale when a single server running MongoDB cannot handle a large dataset size and/or high level of data processing. In general, there are two standard approaches addressing scalability:

  • Vertical scaling
  • Horizontal scaling

Vertical scaling is achieved by adding more computing resources (CPU, memory, storage) to a single machine. This is considered to be an expensive option. At the same time computing resources on a single machine have physical limitations.

Horizontal scaling achieves scalability by horizontally extending the system by adding commodity machines in order to distribute data processing across all these machines. This option is considered to be less expensive and in most cases meets computing needs for physical resources without limitations. MongoDB supports horizontal scaling by implementing sharding (data partitioning) across all machines (shards) clustered in a MongoDB cluster.

The success of the MongoDB sharding depends on a selected shard key that is used to partition data across multiple shards. MongoDB distributes data by a shard key at the collection level.

You should carefully amalyze all options before selecting the shard key since it can significantly affect your system performance and it cannot be changed after data is inserted in MongoDB.  Shard keys cannot be arrays and you cannot shard on a geospatial index. When selecting a shard key, you should also keep in mind that its values cannot be updated. However if you still have to change the value, you will have to remove the document first, change the key value, and reinsert it.

MongoDB divides data into chunks based on values of a shard key and distribute them evenly across the shards.

Sharding

Usually you will not shard all collections but only collections that need data to be distributed over shards to improve read and/or write performance. All un-sharded collections will be held in only one shard that is called primary shard (e.g., Shard A in the picture above). The primary shard can also contain sharded collections.

MongoDB supports three types of sharding:

  • Range-based sharding
  • Hash-based sharding
  • Tag-aware sharding

With the range-based sharding MongoDB divides datasets into ranges determined by the shard key values. With the hash-based sharding MongoDB creates chunks via hash values it computes from the field’s values of the shard key. In general, range-based sharding provides better support for range queries that need query isolation while the hash-based sharding supports write operations more efficiently.

With tag-aware sharding users associate shard key values with specific shards. This type of sharding is usually used to optimize physical locations of documents for location-based applications.

In order to properly select a shard key for your MongoDB sharded cluster, it is important to understand how your application reads and writes data. Actually the main question is

        What is more critical, query isolation, or write scaling, or both?

For the query isolation an ideal situation is when the queries are routed to a single shard or a small subset of shards. In order to select an optimal shard key for query isolation you must take into consideration the following:

  • Analyze what query operations are most performance dependent;
  • Determine which fields are used the most in these operations and include them in the shard key;
  • Make sure that the selected shard key enable even (balanced) distribution of data across shards;
  • A high cardinality field is preferable. Low cardinality fields tend to group documents on a small number of shards what would require frequent rebalancing of the chunks.

MongoDB query router (mongos) will route queries to a shard or subset of shards only when a shard key or a prefix of the shard key is used in the query. Otherwise mongos will route the query to all shards. Also all sharded collections must have an index that starts with a shard key. All documents having the same value for the shard key will reside on the same shard.

For an efficient write scaling, choose a shard key that has both high cardinality and enables even distribution of write operations across the shards.

You should keep in mind that whatever shard key you choose it should be easily divisible to enable even distribution of data across shards when data grows. Shard keys that have a limited number of possible values can result in chunks that are “unsplittable.”.

The most common techniques people use to distribute data are:

  • Ascending key distribution – The shard key field is usually of Date, Timestamp or Objectld type. With this pattern all writes are routed to one shard which MongoDB will keep splitting and spending lots of time migrating data between shards to keep data distribution relatively balanced across the shards. This pattern is not definitely good for the write scaling.
  • Random distribution – This pattern is achieved by fields that do not have an identifiable pattern in the dataset. For example, these fields include usernames, UUIDs, email addresses, or any field which value has a high level of randomness. This is a preferable pattern for write scaling since it enables balanced distribution of write operations and data across the shards. However this pattern does not work well for the query  isolation if the critical queries must retrieve large amount of “close” data based on range criteria  in which case the query will be spread across the most of the shards in the cluster.
  • Location-based distribution – The idea around the location-based data distribution pattern is that the documents with some location-related similarity will fall into the same  range. The location related field could be postal address, IP, postal code, latitude and longitude, etc.
  • Compound Shard Key – Combine more than one field into a shard key in order to come up with optimal shard key  values for high cardinality and balanced distribution of data for an efficient  write scaling and query isolation.
  • Data modeling  to the rescue – Design a data model to include a field that will be exclusively used to enable balanced distribution of data with good support for write scaling and query isolation. First analyze your application read and write operations to get a full understanding of its writing and data retrieval patterns.

The table below lists key considerations for a shard key selection regarding the query isolation and write scaling requirements.

Query isolation importance
Write scaling importance
Shard Key Selection
high
low
  • Range shard key
  • If the selected key does not provide relatively even distribution of data you can either
  • use a compound shard key (containing more than one document filed); or
  • add a special purpose field to your data model that will be used as a shard key. This is an example when data modeling comes to the rescue; or
  • for location-based applications you can manually associate specific ranges of a shard key with a specific shard or subset of shards.
low
high
  • Hashed shard key with high cardinality that will efficiently distribute write operations across the shards.
  • Having a high cardinality does not guarantee an appropriate write scaling all the time. The ascending key distribution is a good example. Write operations that require a high level of scaling should be carefully analyzed to find the best field candidate for the shard key.
  • If a selected key does not provide relatively even distribution of data you can add a special purpose field to your data model that will be used as a shard key.
high
high
  • A shard key enabling mid-high randomness and relatively even distribution of data.  A compound shard keys are usually good candidates.
  • Since an ideal shard key is almost impossible in this case, determine what shard key has the least performance affect on the most critical use cases for both query isolation and write scaling.
  • Data modeling can also help with embedding, referencing and hybrid model options to consider for improving  performance.
  • If a selected key does not provide relatively even distribution of data you can add a special purpose field to your data model that will be used as a shard key.

Daprota M2 Modeling of MongoDB Manual References and DBRefs – Part 2/2

This series of posts provides details with examples for modeling MongoDB Manual References and DBRefs by Daprota M2 service. You can access M2 service using this link:

https://m2.daprota.com

The previous part of the series (Daprota M2 Modeling of MongoDB Manual References and DBRefs – Part 1/2) covered Manual References. In this part of the series we will look at DBRefs.

Database references (DBRefs) are references from one document to another using the value of the referenced (parent) document’s _id field, its collection name, and the database name. While the MongoDB allows DBRefs without the database name provided, M2 models require the database name to be provided. The reason for this is because a Manual Reference in an M2 model must specify the collection name for the model to be complete in which case the DBRef without the database name from the M2 model point of view is the same as the Manual Reference. The database name in DBRef is more of an implementation aspect of the model and it is needed in order to make the DBRef definition complete. Otherwise, without the database name, the DBRef is the same as the Manual Reference to M2.

To resolve DBRefs, your application must perform additional queries to return the referenced documents. Many language drivers supporting MondoDB have helper methods that form the query for the DBRef automatically. Some drivers do not automatically resolve DBRefs into documents. Please refer to MongoDB language drivers documentation for more details.

The DBRef format provides common semantics for representing links between documents if your database must interact with multiple frameworks and tools.

Most of the data model design patterns can be supported by Manual References. Generally speaking you should use Manual References unless you have a firm reason for using DBRefs.

The example below is taken from MongoDB’s DBRef documentation page:

        {
            “_id” : ObjectId(“5126bbf64aed4daf9e2ab771”),
            // .. application fields
            “creator” : {
                  “$ref” : “creators”,
                  “$id” : ObjectId(“5126bc054aed4daf9e2ab772”),
                  “$db” : “users”
            }
        }              

The DBRef in this example references the creators collection’s document that has ObjectId(“5126bc054aed4daf9e2ab772”) value for its _id field. The creators collection is stored in the users database.

Let us model a sample collection Object in M2.

First we will create a model with the name DBRef Sample Model:

CreateModel-DBRef

Click the Create Model button to create the model. When the model is created, the M2 home page will be reloaded:

DBRef-ListModels

Click the DBRef Sample Model link to load the model page and then click the Add Collection tab to load the section for the collection creation. Enter the name and description of the Object collection:

DBRef-ObjectCollection

Click the Add Collection button to create the collection. M2 will also automatically create the collection’s document:

DBRef-ListCollection

Click the Object collection link to load the collection page and then click the Object document link in the Documents section to load the document page:

DBRef-Document

Click the Add Field tab to load the section for the field creation. Enter the name and description of the creator field and select DBRef for the field’s type. When the DBRef is selected as the field’s type, M2 will also require selection of the field’s value type which belongs to the value type of the referenced document’s _id field. It will be ObjectId in this example:

DBRef-AddField

Click the Add Field button to create the field. When the field is created it will be listed on the document page:

DBRef-DocWithField

Click the creator field link to load the field page:

DBRef-DBRef

Click the DBRef tab to load the DBRef section and specify the referenced collection name (creators) and its database (users) to complete the creator field creation:

DBRef-Spec

As you can see, you can either specify a collection name if it is not included in the model (as in this case) or select a collection from the Collections list if it is included in the model. Click the Add DBRef button to update the creator field definition:

DBRef-Final2

Click the model link above to load the DBRef Sample Model page:

DBRef-Final3

The References section of the page, as represented above, lists the reference that was just created. The Target (Child) column has the format: Collection –> Document –> Field. It contains the Object –> Object –> creator value which means that the Object is the target (child) collection and the creator is the field in the Object document of the Object collection whose value will reference the _id field value of the parent Collection (creators) document. The Database column specifies the database of the source (parent) collection.

It is also possible that the target (child) document, in the Collection –> Document –> Field value, is not the target collection document but an embedded document (on any level) in the target collection.