Possibility of duplicate Mongo ObjectId's being generated in two different collections?

mongodb database nosql

Is it possible for the same exact Mongo ObjectId to be generated for a document in two different collections? I realize that it's definitely very unlikely, but is it possible?

Without getting too specific, the reason I ask is that with an application that I'm working on we show public profiles of elected officials who we hope to convert into full fledged users of our site. We have separate collections for users and the elected officials who aren't currently members of our site. There are various other documents containing various pieces of data about the elected officials that all map back to the person using their elected official ObjectId.

After creating the account we still highlight the data that's associated to the elected official but they now also are a part of the users collection with a corresponding users ObjectId to map their profile to interactions with our application.

We had begun converting our application from MySql to Mongo a few months ago and while we're in transition we store the legacy MySql id for both of these data types and we're also starting to now store the elected official Mongo ObjectId in the users document to map back to the elected official data.

I was pondering just specifying the new user ObjectId as the previous elected official ObjectId to make things simpler but wanted to make sure that it wasn't possible to have a collision with any existing user ObjectId.

Thanks for your insight.

Edit: Shortly after posting this question, I realized that my proposed solution wasn't a very good idea. It would be better to just keep the current schema that we have in place and just link to the elected official '_id' in the users document.

See mongodb.org/display/DOCS/Object+IDs

I've read that page before. Ironically enough I actually linked to the same page in a previous answer. And I did see the "reasonably high probability of being unique" disclaimer but was unsure if the collection being inserted into played any factor in this. I guess what I'm unsure of is what exactly the 2 byte Process ID portion of the ObjectId really represents. If it has something to do with the collection then there would be uniqueness between two different documents created at the exact same time on the exact same machine in different collections.

The 2byte process id is the pid of the process generating the ObjectID. As an example, here is the code pymongo uses to generate ObjectIDs: github.com/mongodb/mongo-python-driver/blob/master/bson/…

One gotcha I ran into is batch inserting. I was building batches of 10k documents, and colliding every time because the counter portion rolled over every time.

I know it's been a while, but 10K documents would not roll over the counter. The counter part is three bytes, not three digits. That's over 16 million.

Raj Advani

Short Answer

Just to add a direct response to your initial question: YES, if you use BSON Object ID generation, then for most drivers the IDs are almost certainly going to be unique across collections. See below for what "almost certainly" means.

Long Answer

The BSON Object ID's generated by Mongo DB drivers are highly likely to be unique across collections. This is mainly because of the last 3 bytes of the ID, which for most drivers is generated via a static incrementing counter. That counter is collection-independent; it's global. The Java driver, for example, uses a randomly initialized, static AtomicInteger.

So why, in the Mongo docs, do they say that the IDs are "highly likely" to be unique, instead of outright saying that they WILL be unique? Three possibilities can occur where you won't get a unique ID (please let me know if there are more):

Before this discussion, recall that the BSON Object ID consists of:

[4 bytes seconds since epoch, 3 bytes machine hash, 2 bytes process ID, 3 bytes counter]

Here are the three possibilities, so you judge for yourself how likely it is to get a dupe:

1) Counter overflow: there are 3 bytes in the counter. If you happen to insert over 16,777,216 (2^24) documents in a single second, on the same machine, in the same process, then you may overflow the incrementing counter bytes and end up with two Object IDs that share the same time, machine, process, and counter values.

2) Counter non-incrementing: some Mongo drivers use random numbers instead of incrementing numbers for the counter bytes. In these cases, there is a 1/16,777,216 chance of generating a non-unique ID, but only if those two IDs are generated in the same second (i.e. before the time section of the ID updates to the next second), on the same machine, in the same process.

3) Machine and process hash to the same values. The machine ID and process ID values may, in some highly unlikely scenario, map to the same values for two different machines. If this occurs, and at the same time the two counters on the two different machines, during the same second, generate the same value, then you'll end up with a duplicate ID.

These are the three scenarios to watch out for. Scenario 1 and 3 seem highly unlikely, and scenario 2 is totally avoidable if you're using the right driver. You'll have to check the source of the driver to know for sure.

Isn't the 3 bytes counter represents a capability of accepting 2^24 = 16777216 number of documents inserted per second per process per machine?

You're absolutely right, I accidentally halved the number of bits -- answer has been amended.

Since I just stepped into this, let me add that some drivers (e.g. C), though uses increments, does not increment atomically, so time to time, it generates the same oid due to race condition

You completely skipped over the fact that in 136 years you'd have another shot to generate the same ObjectId you had before as long as the machine hash, process ID, and counter all turn out the same

@jamylak We will take care of that problem when it becomes urgent (said those people who standardized YYMMDD date formats in the 70s)

mstearn

ObjectIds are generated client-side in a manner similar to UUID but with some nicer properties for storage in a database such as roughly increasing order and encoding their creation time for free. The key thing for your use case is that they are designed to guarantee uniqueness to a high probability even if they are generated on different machines.

Now if you were referring to the _id field in general, we do not require uniqueness across collections so it is safe to reuse the old _id. As a concrete example, if you have two collections, colors and fruits, both could simultaneously have an object like {_id: 'orange'}.

In case you want to know more about how ObjectIds are created, here is the spec: http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-BSONObjectIDSpecification

DenverMatt

In case anyone is having problems with duplicate Mongo ObjectIDs, you should know that despite the unlikelihood of dups happening in Mongo itself, it is possible to have duplicate _id's generated with PHP in Mongo.

The use-case where this has happened with regularity for me is when I'm looping through a dataset and attempting to inject the data into a collection.

The array that holds the injection data must be explicitly reset on each iteration - even if you aren't specifying the _id value. For some reason, the INSERT process adds the Mongo _id to the array as if it were a global variable (even if the array doesn't have global scope). This can affect you even if you are calling the insertion in a separate function call where you would normally expect the values of the array not to persist back to the calling function.

There are three solutions to this:

You can unset() the _id field from the array You can reinitialize the entire array with array() each time you loop through your dataset You can explicitly define the _id value yourself (taking care to define it in such a way that you don't generate dups yourself).

My guess is that this is a bug in the PHP interface, and not so much an issue with Mongo, but if you run into this problem, just unset the _id and you should be fine.

see here: php.net/manual/en/mongocollection.insert.php : "Note: If the parameter does not have an _id key or property, a new MongoId instance will be created and assigned to it. This special behavior does not mean that the parameter is passed by reference.", it's a feature, not a bug, it's meant to be that way

I don't understand the scenario you're describing here; perhaps you could show some code that exhibits the bug?

slacy

There's no guarantee whatsoever about ObjectId uniqueness across collections. Even if it's probabilistically very unlikely, it would be a very poor application design that relied on _id uniqueness across collections.

One can easily test this in the mongo shell:

MongoDB shell version: 1.6.5
connecting to: test
> db.foo.insert({_id: 'abc'})
> db.bar.insert({_id: 'abc'})
> db.foo.find({_id: 'abc'})
{ "_id" : "abc" }
> db.bar.find({_id: 'abc'})
{ "_id" : "abc" }
> db.foo.insert({_id: 'abc', data:'xyz'})
E11000 duplicate key error index: test.foo.$_id_  dup key: { : "abc" }

So, absolutely don't rely on _id's being unique across collections, and since you don't control the ObjectId generation function, don't rely on it.

It's possible to create something that's more like a uuid, and if you do that manually, you could have some better guarantee of uniqueness.

Remember that you can put objects of different "types" in the same collection, so why not just put your two "tables" in the same collection. They would share the same _id space, and thus, would be guaranteed unique. Switching from "prospective" to "registered" would be a simple flipping of a field...

I think you may be confusing the _id field in general with the ObjectID type. The ObjectID type was specifically designed for uniqueness with the goal that it could be treated like a UUID. However the _id field can be any type and only guarantees uniqueness on a single collection if you use other types for the key, such as a string in your example.

@mstearn (Nitpick) The notion that a UUID is inherently unique is flawed. A good UUID/sequence generation strategy may make the collision unlikely but it needs to take unique generators (e.g. unique locations) into account to guarantee absolute uniqueness between the generators. Granted, most have probabilities so low that it is of no applicable concern :-) GUID. One issue that does come up though, is the duplication/copying of ids instead of a new generation.

@pst: MongoDBs ObjectIDs include both the pid of the generating process and some bytes based on a hash of the hostname. These combined with a timestamp and incrementing counter make it extremely likely that any two separately generated ObjectIDs will be globally/universally unique. Of course as you said that only applies to freshly generated ObjectIDs.

I'm referring to the ObjectId type. Not specifying a string value for '_id'. Of course they're going to be the same and conflict if you set them to the exact same string manually.

Yeah, I clarified things in my post. _id's are certainly not unique, and since you don't control the ObjectId generation function, it's probably a bad idea to rely on it.