NoSQL – Karim Vaes

Introduction

A lot of workloads are driven by peak consumption. From my experience, there aren’t the amount of workloads that have a constant performance need are in the minority. Now here comes the interesting opportunity when leveraging serverless architectures… Here you only pay for your actual consumption. So if you tweak your architecture to leverage this, then you can get huge gains!

For today’s post, I’ll be using VMchooser once again as an example. A lot has changed since the last post on the anatomy of this application. Here is an updated drawing of the high level architecture ;

Underneath you can see the flow that’ll be used when doing a “Bulk Mapping” (aka “CSV Upload”). The webapp (“frontend”) will store the CSV as a blob on the storage account. Once a new blob arrives, a function will be triggered that will examine the CSV file and put every entry onto a queue. Once a message is published onto the queue, another function will start processing this message. By using this pattern, I’m transforming this job into parallel processing job where each entry is handled (about) simultaneously. The downside of this, is that there will be contention/competition for the back-end resources (being the data store). Luckily, CosmosDB can scale on the fly too… We can adapt the request units as needed; up or down! So let’s do a small PoC and see who this could work…

Continue reading “Serverless On-Demand Scaling : Pushing the pedal when you need it…” →

A first glance beyond the religion

When taking a look towards the landscape of databases, one can only accept that there has been a lot of commotion about “SQL vs NoSQL” in the last years. But what is it really about?

SQL, which stands for “Structured Query Language”, has been around since the seventies and is commonly used in relational databases. It consists of a data definition language to define the structure and a data manipulation language to alter the data within the structure. Therefore a RDBMS will have a defined structure and has been a common choice for the storage of information in new databases used for financial records, manufacturing and logistical information, personnel data, and other applications since the 1980s.

1401269083847

NoSQL, which stands for “Not only SQL”, departs from the standard relational model since it saw its first introduction in the nineties. The primary focus of these database was performance, or a given niche, and focus less consitency/transactions. These databases provide a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, graph, or document) differ from those used in relational databases, making some operations faster in NoSQL and others faster in relational databases. The particular suitability of a given NoSQL database depends on the problem it must solve.

So it depends on your need…

Do you want NoSQL, NoSQL, NoSQL or NoSQL?

NoSQL comes in various flavors. The most common types of NoSQL databases (as portrayed by Wikipedia) ;

There have been various approaches to classify NoSQL databases, each with different categories and subcategories. Because of the variety of approaches and overlaps it is difficult to get and maintain an overview of non-relational databases. Nevertheless, a basic classification is based on data model. A few examples in each category are:

Column: Accumulo, Cassandra, Druid, HBase, Vertica
Document: Clusterpoint, Apache CouchDB, Couchbase, MarkLogic, MongoDB, OrientDB
Key-value: Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike, OrientDB
Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog
Multi-model: OrientDB, FoundationDB, ArangoDB, Alchemy Database, CortexDB

Column

A column of a distributed data store is a NoSQL object of the lowest level in a keyspace. It is a tuple (a key-value pair) consisting of three elements:

Unique name: Used to reference the column
Value: The content of the column. It can have different types, like AsciiType, LongType, TimeUUIDType, UTF8Type among others.
Timestamp: The system timestamp used to determine the valid content.

Example

{
    street: {name: "street", value: "1234 x street", timestamp: 123456789},
    city: {name: "city", value: "san francisco", timestamp: 123456789},
    zip: {name: "zip", value: "94107", timestamp: 123456789},
}

Document

A document-oriented database is designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data. The central concept of a document-oriented database is that Documents, in largely the usual English sense, contain vast amounts of data which can usefully be made available. Document-oriented database implementations differ widely in detail and functionality. Most accept documents in a variety of forms, and encapsulate them in a standardized internal format, while extracting at least some specific data items that are then associated with the document.

Example

<Article>
   <Author>
       <FirstName>Bob</FirstName>
       <Surname>Smith</Surname>
   </Author>
   <Abstract>This paper concerns....</Abstract>
   <Section n="1"><Title>Introduction</Title>
       <Para>...
   </Section>
 </Article>

Key-Value

A key-value (an associative array, map, symbol table,or dictionary) is an abstract data type composed of a collection of key/value pairs, such that each possible key appears just once in the collection.

Example

{
    "Pride and Prejudice": "Alice",
    "The Brothers Karamazov": "Pat",
    "Wuthering Heights": "Alice"
}

Graph

A graph database is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A graph database is any storage system that provides index-free adjacency. This means that every element contains a direct pointer to its adjacent elements and no index lookups are necessary. General graph databases that can store any graph are distinct from specialized graph databases such as triplestores and network databases.

Example

MultiModel

Most database management systems are organized around a single data model that determines how data can be organized, stored, and manipulated. In contrast, a multi-model database is designed to support multiple data models against a single, integrated backend. Document, graph, relational, and key-value models are examples of data models that may be supported by a multi-model database.

And what flavor do I want?

Each type and implementation has its own advantages… The following chart from Shankar Sahai provides a good overview ;

Any other considerations I should take into account?

Be wary that most implementations were not designed around ~~consistency~~ integrity and more towards performance. Transactions are referential integrity are not supported by most implementations. High availability designs (including on geographic level) are possible with some implementations, though this often implies a performance impact (as one would expect).

Also check out the research made by Altoros ;

5. Conclusion
As you can see, there is no perfect NoSQL database. Every database has its advantages and disadvantages that become more or less important depending on your preferences and the type of tasks.
For example, a database can demonstrate excellent performance, but once the amount of records exceeds a certain limit, the speed falls dramatically. It means that this particular solution can be good for moderate data loads and extremely fast computations, but it would not be suitable for jobs that require a lot of reads and writes. In addition, database performance also depends on the capacity of your hardware.

They did a very decent job in performance testing various implementations!

2015-01-21 09_08_23-A_Vendor_independent_Comparison_of_NoSQL_Databases_Cassandra_HBase_MongoDB_Riak.

Category: NoSQL

Serverless On-Demand Scaling : Pushing the pedal when you need it…

Database variants explained : SQL or NoSQL? Is that really the question?