http://www.codeproject.com/Articles/375413/RaptorDB-the-Document-Store (Nice explanation of document store)
Presentation: An Introduction to FluidDB
Vertica See also: http://www.dba-oracle.com/oracle_news/news_vertica.htm
NoSQL does not mean No SQL. It stands for Not Only SQL.
Next Generation Databases mostly addressing some of the points (if not all): being non-relational, distributed, open-source and horizontally scalable, schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), huge amount of data.
The term NoSQL encompass a lot of things, and may include relational / SQL databases. Oracle also have support for NoSQL.
Wide Column Store / Column Families:
Hadoop / HBase
SciDB: Array Data Model for Scientists, paper », poster », HiScaBlog »
HPCC: from LexisNexis
Stratosphere: (research system) massive parallel & flexible execution, M/R generalization and extension.
Key Value / Tuple Store:
Dynomite: Open-Source implementation of Anazon Dynamo Key-Value Store.
Clusterpoint (XML, gear toward full-text search)
Apache CouchDB (JSON)
MongoDB (binary JSON)
RavenDB (binary JSON, full ACID)
Key-value in-memory cache:
There is a relational database named NoSQL. This database, by design, do not use SQL, but it is a relational database, nonetheless.
Beside from that, NoSQL is a generic terms that describe big distributed database systems.
Before venturing into the world of distributed database, keep this in mind: Relational databases has been the back-bone of banks, financial systems, big corporates, government, research institutions for a long time, and even some web 2.0 companies. With proper configuration, replication, and sharding, we can still use relational databases. Learning a technology will take some time. We may not know all there is to know about a technology the first time we read a book about it. We learn more about a technology when we keep reading more about it, or have some first hand experience (which may not be a good experience) with it. The point is "don't go blindly with a technology". Consider your skill set / resources. If we really need to use distributed databases, hire someone who have experience managing such databases. If we are interested in such technology, learn from that person so we have some experience with it before we use that technology for another project.
Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.
CloudBase is a data warehouse system for Terabyte and Petabyte scale analytics. It is built on top of Map-Reduce architecture. The current code has been developed to Hadoop’s map-reduce implementation. CloudBase allows you to query flat log files using ANSI SQL. It comes with JDBC driver so you can use any JDBC database manager application (e.g Squirrel) as front end.
What is polygot persistance?:
Part of the NoSQL message is: pick the right tool for the job. One part of the system can use a different database. Another part of the system can use another different database. You do not have to pick just one.
How to pick the right database for your architecture?:
To understand why NoSQL is important, consider the use cases:
- Frequently written, rarely read statistical data should use an in-memory key/value store like Redis, or an update-in-place document store like MongoDB
- Big Data (like weather stats or business analytics) will work best in a freeform, distributed db system like Hadoop
- Binary assets (such as MP3 and PDFs) find a good home in a datastore that can be served directly to the browser like Amazon S3 (or should be stored on distributed file systems)
- Transient data (web sessions, locks, or short-term stats) should be kept in a transient data store like Memcache
- If you need to be able to replicate your data set to multiple locations, you will want the replication features of CouchDB
- High availability applications, where minimizing downtime is critical, will find great utility in the automatically clustered, redundant setup of data stores like Cassandra and Riak
- MySQL for low-volume, high-value data like billing information
- MongoDB for high-volume, low-value data like hit counts and logs
- Amazon SE for user-uploaded assets like photos and documents (Amazon S3 is not a database, it is a file system)
- Memcached for temporary counters and rendered HTML
- Key-value stores offer fast lookup of data by key
- A document database can be a key-value store
- A column store is a fancy key-value store
- A graph database is a document database on steroid
- Some document databases can handle graphs
- Range queries can be hard
- Complex ad-hoc queries almost impossible. Don't scale well across N nodes
- Transactions don't scale well in a distributed system
What does BASE abbreviate for?
Basically Available, Soft-state, Eventually consistent.
Are all implementation of eventually consistent equal?
No. Not all implementations of eventually consistent are equal. Eventually consistent database may also elect to provide the following:
- Causal consistency: This involves a signal being sent from between application session indicating that a change has occurred. From that point on the receiving session will always see the updated value.
- Read your own writes: In this mode of consistency, a session that performs a change to the database will immediately see that change, even if other sessions experience a delay
- Monotonic consistency: In this mode, a session will never see data revert to an earlier point in time. Once we read a value, we will never see an earlier value.
What is the NRW notation?
NRW notation describes at a high level how a distributed database will trade off consistency, read performance and write performance. NRW stand for:
- N: the number of copies of each data item that the database will maintain
- R: the number of copies that the application will access when reading the data item
- W: the number of copies of the data item that must be written before the write can complete
When N=W, the database will always write every copy before returning the control to the client. This is more or less what traditional databases do when implementing synchronous replication. If you are more concerned about write performance, you can set W=1, and R=N. Then each read must access all copies to determine which is correct, but each write will only have to touch a single copy of the data..
Most NoSQL databases use N > W > 1: more than one write must be completed, but not all nodes need to be updated immediately. You can increase the level of consistency in roughly three stages:
- If R=1, the database will accept whatever value it reads first. This might be out-of-date if not all updates have propagated through the system.
- If R>1, the database will read more than one value and pick the most recent value
- If W+R > N, then a read will always retrieve the latest value, although it may be mixed with older values.
In other words, the number of copies you write and number of copies you read is high enough to guarantee that you will always have at least one copy of the latest version in your read set. This is sometimes referred to as quorum assembly.
|W=N R=1||Read optimized strong consistency|
|W=1 R=N||Write optimized strong consistency|
|W+R<=N||Weak eventual consistency. A read may not see latest update|
|W+R>N||Strong consistency through quorum assembly. A read will see at least one copy of the most recent update|
NoSQL databases generally try hard to be as consistent as possible, even when configured for weak consistency. For example, the read repair algorithm is often implemented to improve consistency when R=1. Although the application does not wait for all the copies of a data item to be read, the database will read all known copies in the background after responding to the initial request. If the application asks for the data item again, it will therefore see the latest version.
What is the Vector Clock(s) algorithm?
The Vector Clock(s) algorithm can be used to ensure that the updates are processed in order (monotonic consistency). With vector clocks, each node maintains a change number (or event count) similar to the System Change Number used in some RDBMS. The "vector" is a list including the current node's change number as well as the change numbers that have been received from other nodes. When an update is transmitted, the vector is included with the update and the receiving node compares that vector with other vectors that have been received to determine if updates are being received out of sequence. Out of sequence updates can be held until the preceding updates appear.