How does column-oriented NoSQL differ from document-oriented?

mongodb cassandra nosql

The three types of NoSQL databases I've read about is key-value, column-oriented, and document-oriented.

Key-value is pretty straight forward - a key with a plain value.

I've seen document-oriented databases described as like key-value, but the value can be a structure, like a JSON object. Each "document" can have all, some, or none of the same keys as another.

Column oriented seems to be very much like document oriented in that you don't specify a structure.

So what is the difference between these two, and why would you use one over the other?

I've specifically looked at MongoDB and Cassandra. I basically need a dynamic structure that can change, but not affect other values. At the same time I need to be able to search/filter specific keys and run reports. With CAP, AP is the most important to me. The data can "eventually" be synced across nodes, just as long as there is no conflict or loss of data. Each user would get their own "table".

Theo

The main difference is that document stores (e.g. MongoDB and CouchDB) allow arbitrarily complex documents, i.e. subdocuments within subdocuments, lists with documents, etc. whereas column stores (e.g. Cassandra and HBase) only allow a fixed format, e.g. strict one-level or two-level dictionaries.

In this case, mongo(document) can do what cassendra(Column) can. Why Column is needed then?

It's a trade-off between different features, with a column oriented design the storage engine can be much more efficient than a document oriented storage engine can be. MongoDB has to rewrite the whole document on disk if it grows bigger, but Cassandra doesn't have to (this is a simplification, of course, there are lots of details to this). This makes Cassandra much faster when it comes to writing.

Correction in namings and understanding: Cassandra and Hbase are Column "Family" stores and not Column "Oriented" stores (aka columnar store). CF stores data by rows (= row oriented store) and CO stores data by column. Ref: community.datastax.com/answers/6244/view.html

Community

In Cassandra, each row (addressed by a key) contains one or more "columns". Columns are themselves key-value pairs. The column names need not be predefined, i.e. the structure isn't fixed. Columns in a row are stored in sorted order according to their keys (names).

In some cases, you may have very large numbers of columns in a row (e.g. to act as an index to enable particular kinds of query). Cassandra can handle such large structures efficiently, and you can retrieve specific ranges of columns.

There is a further level of structure (not so commonly used) called super-columns, where a column contains nested (sub)columns.

You can think of the overall structure as a nested hashtable/dictionary, with 2 or 3 levels of key.

Normal column family:

row
    col  col  col ...
    val  val  val ...

Super column family:

row
      supercol                      supercol                     ...
          (sub)col  (sub)col  ...       (sub)col  (sub)col  ...
           val       val      ...        val       val      ...

There are also higher-level structures - column families and keyspaces - which can be used to divide up or group together your data.

See also this Question: Cassandra: What is a subcolumn

Or the data modelling links from http://wiki.apache.org/cassandra/ArticlesAndPresentations

Re: comparison with document-oriented databases - the latter usually insert whole documents (typically JSON), whereas in Cassandra you can address individual columns or supercolumns, and update these individually, i.e. they work at a different level of granularity. Each column has its own separate timestamp/version (used to reconcile updates across the distributed cluster).

The Cassandra column values are just bytes, but can be typed as ASCII, UTF8 text, numbers, dates etc.

Of course, you could use Cassandra as a primitive document store by inserting columns containing JSON - but you wouldn't get all the features of a real document-oriented store.

A column family is like a table. A row is like a table row. Columns are sort of like database columns, except that they can be defined on the fly, so you may have a very sparsely-populated table in some cases, or you may have different columns populated in each row.

It depends on the database. In MongoDB (document-oriented) you can also update every single key.

If that's true, how is MongoDB defined a document-oriented database whereas Cassandra is column oriented. How are they different?

@Luke Column-oriented looks pretty much like a schema-less RDBMS, but besides of its loose structure, the main difference is than it is not relationnal.

@user327961 But MongoDB is also like a schema-less RDBMS, and it's also not relational.

user327961

In "insert", to use rdbms words, Document-based is more consistent and straight foward. Note than cassandra let you achieve consistency with the notion of quorum, but that won't apply to all column-based systems and that reduce availibility. On a write-once / read-often heavy system, go for MongoDB. Also consider it if you always plan to read the whole structure of the object. A document-based system is designed to return the whole document when you get it, and is not very strong at returning parts of the whole row.

The column-based systems like Cassandra are way better than document-based in "updates". You can change the value of a column without even reading the row that contains it. The write doesn't actualy need to be done on the same server, a row may be contained on multiple files of multiple server. On huge fast-evolving data system, go for Cassandra. Also consider it if you plan to have very big chunk of data per key, and won't need to load all of them at each query. In "select", Cassandra let you load only the column you need.

Also consider that Mongo DB is written in C++, and is at its second major release, while Cassandra needs to run on a JVM, and its first major release is in release candidate only since yesterday (but the 0.X releases turned in productions of major company already).

On the other hand, Cassandra's designed was partly based on Amazon Dynamo, and it is built at its core to be an High Availibility solution, but that does not have anything to do with the column-based format. MongoDB scales out too, but not as gracefully as Cassandra.

What's wrong with a piece of software being written in C++ versus Java?

@Nayuki Now, I'm aware there are high-contention workloads where the lazy garbage collection of Java's memory management model will outperform C++'s "manual" management model in theory, but generally speaking, it's not usually difficult to outperform Java by writing an equivalent program in C++, at least as long as you disable Exceptions and RTTI. And if you make good use of stackless coroutines and resumable functions, well, I personally haven't seen Java beat my C++ yet.

Michael

I would say that the main difference is the way each of these DB types physically stores the data. With column types, the data is stored by columns which can enable efficient aggregation operations / queries on a particular column. With document types, the entire document is logically stored in one place and is generally retrieved as a whole (no efficient aggregation possible on "columns" / "fields").

The confusing bit is that a wide-column "row" can be easily represented as a document, but, as mentioned they are stored differently and optimized for different purposes.

How does column-oriented NoSQL differ from document-oriented?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Links

Contact US