I’m pretty excited to announce the release UMongo 1.4. While it was a long time waiting, it was probably worth the wait! You can download UMongo from its home page. See the full list of close tickets on github.

You can now manage users with the v2.4 roles. It is much cleaner to work with than the shell’s addUser.

I am very excited about this feature, which enables you to work as efficiently as with the shell. One of the most common operation in the shell is to go back in history to a previous statement, modify it slightly and run it again. The GUI typically only remembers the last operation done in the form, which is much more limiting. UMongo now has a “magic wand” button that can re-spawn the form with all original inputs from a previous result! Once you’ve tried it, it will become a must-have feature.




UMongo is free and runs on Windows, OSX and Linux. Feedback is always very welcome. Do not hesitate to open bug reports or feature requests on github. Follow @antoinegirbal for future release announcements.
There has been a growing interest in using MongoDB as an in-memory database, meaning that the data is not stored on disk at all. This can be super useful for applications like:
That would be really neat indeed if it was possible: one could leverage the advanced querying / indexing capabilities of MongoDB without hitting the disk. As you probably know the disk IO (especially random) is the system bottleneck in 99% of cases, and if you are writing data you cannot avoid hitting the disk.
One sweet design choice of MongoDB is that it uses memory-mapped files to handle access to data files on disk. This means that MongoDB does not know the difference between RAM and disk, it just accesses bytes at offsets in giant arrays representing files and the OS takes care of the rest! It is this design decision that allows MongoDB to run in RAM with no modification.
This is all achieved by using a special type of filesystem called tmpfs. Linux will make it appear as a regular FS but it is entirely located in RAM (unless it is larger than RAM in which case it can swap, which can be useful!). I have 32GB RAM on this server, let’s create a 16GB tmpfs:
# mkdir /ramdata # mount -t tmpfs -o size=16000M tmpfs /ramdata/ # df Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvde1 5905712 4973924 871792 86% / none 15344936 0 15344936 0% /dev/shm tmpfs 16384000 0 16384000 0% /ramdata
Now let’s start MongoDB with the appropriate settings. smallfiles and noprealloc should be used to reduce the amount of RAM wasted, and will not affect performance since it’s all RAM based. nojournal should be used since it does not make sense to have a journal in this context!
dbpath=/ramdata nojournal = true smallFiles = true noprealloc = true
After starting MongoDB, you will find that it works just fine and the files are as expected in the FS:
# mongo
MongoDB shell version: 2.3.2
connecting to: test
> db.test.insert({a:1})
> db.test.find()
{ "_id" : ObjectId("51802115eafa5d80b5d2c145"), "a" : 1 }
# ls -l /ramdata/
total 65684
-rw-------. 1 root root 16777216 Apr 30 15:52 local.0
-rw-------. 1 root root 16777216 Apr 30 15:52 local.ns
-rwxr-xr-x. 1 root root 5 Apr 30 15:52 mongod.lock
-rw-------. 1 root root 16777216 Apr 30 15:52 test.0
-rw-------. 1 root root 16777216 Apr 30 15:52 test.ns
drwxr-xr-x. 2 root root 40 Apr 30 15:52 _tmp
Now let’s add some data and make sure it behaves properly. We will create a 1KB document and add 4 million of them:
> str = ""
> aaa = "aaaaaaaaaa"
aaaaaaaaaa
> for (var i = 0; i < 100; ++i) { str += aaa; }
> for (var i = 0; i < 4000000; ++i) { db.foo.insert({a: Math.random(), s: str});}
> db.foo.stats()
{
"ns" : "test.foo",
"count" : 4000000,
"size" : 4544000160,
"avgObjSize" : 1136.00004,
"storageSize" : 5030768544,
"numExtents" : 26,
"nindexes" : 1,
"lastExtentSize" : 536600560,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 129794000,
"indexSizes" : {
"_id_" : 129794000
},
"ok" : 1
}
The document average size is 1136 bytes and it takes up about 5GB of storage. The index on _id takes about 130MB. Now we need to verify something very important: is the data duplicated in RAM, existing both within MongoDB and the filesystem? Remember that MongoDB does not buffer any data within its own process, instead data is cached in the FS cache. Let’s drop the FS cache and see what is in RAM:
# echo 3 > /proc/sys/vm/drop_caches
# free
total used free shared buffers cached
Mem: 30689876 6292780 24397096 0 1044 5817368
-/+ buffers/cache: 474368 30215508
Swap: 0 0 0
As you can see there is 6.3GB of used RAM of which 5.8GB is in FS cache (buffers). Why is there still 5.8GB of FS cache even after all caches were dropped?? The reason is that Linux is smart and it does not duplicate the pages between tmpfs and its cache… Bingo! That means your data exists with a single copy in RAM. Let’s access all documents and verify RAM usage is unchanged:
> db.foo.find().itcount()
4000000
# free
total used free shared buffers cached
Mem: 30689876 6327988 24361888 0 1324 5818012
-/+ buffers/cache: 508652 30181224
Swap: 0 0 0
# ls -l /ramdata/
total 5808780
-rw-------. 1 root root 16777216 Apr 30 15:52 local.0
-rw-------. 1 root root 16777216 Apr 30 15:52 local.ns
-rwxr-xr-x. 1 root root 5 Apr 30 15:52 mongod.lock
-rw-------. 1 root root 16777216 Apr 30 16:00 test.0
-rw-------. 1 root root 33554432 Apr 30 16:00 test.1
-rw-------. 1 root root 536608768 Apr 30 16:02 test.10
-rw-------. 1 root root 536608768 Apr 30 16:03 test.11
-rw-------. 1 root root 536608768 Apr 30 16:03 test.12
-rw-------. 1 root root 536608768 Apr 30 16:04 test.13
-rw-------. 1 root root 536608768 Apr 30 16:04 test.14
-rw-------. 1 root root 67108864 Apr 30 16:00 test.2
-rw-------. 1 root root 134217728 Apr 30 16:00 test.3
-rw-------. 1 root root 268435456 Apr 30 16:00 test.4
-rw-------. 1 root root 536608768 Apr 30 16:01 test.5
-rw-------. 1 root root 536608768 Apr 30 16:01 test.6
-rw-------. 1 root root 536608768 Apr 30 16:04 test.7
-rw-------. 1 root root 536608768 Apr 30 16:03 test.8
-rw-------. 1 root root 536608768 Apr 30 16:02 test.9
-rw-------. 1 root root 16777216 Apr 30 15:52 test.ns
drwxr-xr-x. 2 root root 40 Apr 30 16:04 _tmp
# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvde1 5905712 4973960 871756 86% /
none 15344936 0 15344936 0% /dev/shm
tmpfs 16384000 5808780 10575220 36% /ramdata
And that verifies it! :)
You probably want to use replication since a server loses its RAM data upon reboot! Using a standard replica set you will get automatic failover and more read capacity. If a server is rebooted MongoDB will automatically rebuild its data by pulling it from another server in the same replica set (resync). This should be fast enough even in cases with a lot of data and indices since all operations are RAM only :)
It is important to remember that write operations get written to a special collection called oplog which resides in the local database and takes 5% of the volume by default. In my case the oplog would take 5% of 16GB which is 800MB. In doubt, it is safer to choose a fixed oplog size using the oplogSize option. If a secondary server is down for a longer time than the oplog contains, it will have to be resynced. To set it to 1GB, use:
oplogSize = 1000
Now that you have all the querying capabilities of MongoDB, what if you want to implement a large service with it? Well you can use sharding freely to implement a large scalable in-memory store. Still the config servers (that contain the chunk distribution) should be disk based since their activity is small and rebuilding a cluster from scratch is not fun.
RAM is a scarce resource, and in this case you definitely want the entire data set to fit in RAM. Even though tmpfs can resort to swapping the performance would drop dramatically. To make best use of the RAM you should consider:
Sweet, you can now use MongoDB and all its features as an in-memory RAM-only store! Its performance should be pretty impressive: during the test with a single thread / core I was achieving 20k writes per second, and it should scale linearly over the number of cores.
There are situations where your documents have many different fields and you want to be able to query efficiently on any of them. Say you have a document describing a person:
{
_id: 123,
firstName: "John",
lastName: "Smith",
age: 25,
height: 6.0,
dob: Date,
eyes: "blue",
sign: "Capricorn",
...
}
Then you may want to find all people with blue eyes, a certain height, by last name, etc. Say you have dozens of such properties, or you may not even know in advance what they will be, or they vary on a per document basis… How can you use indexing to quickly resolve these queries? Obviously it would be expensive and impractical to create an index on each of those fields.
Let’s start with the schema design and leverage the power of JSON by using a list to store all the properties:
{
_id: 123,
props: [
{ n: "firstName", v: "John"},
{ n: "lastName", v: "Smith"},
{ n: "age", v: 25},
...
]
}
The index to create here is a compound on the name and value fields within the list. To illustrate this, let’s create millions of documents with dummy properties (“prop0” through “prop9”) that take a random value between 0 and 1000.
> for (var i = 0; i < 5000000; ++i) { var arr = []; for (var j = 0; j < 10; ++j) { arr.push({n: "prop" + j, v: Math.floor(Math.random() * 1000) }) }; db.generic.insert({props: arr}) }
> db.generic.findOne()
{
"_id": ObjectId("515dd3b4f0bd676b816aa9b0"),
"props": [
{
"n": "prop0",
"v": 40
},
{
"n": "prop1",
"v": 198
},
...
{
"n": "prop9",
"v": 652
}
]
}
> db.generic.ensureIndex({"props.n": 1, "props.v": 1})
> db.generic.stats()
{
"ns": "test.generic",
"count": 5020473,
"size": 1847534064,
"avgObjSize": 368,
"storageSize": 2600636416,
"numExtents": 19,
"nindexes": 2,
"lastExtentSize": 680280064,
"paddingFactor": 1,
"systemFlags": 1,
"userFlags": 0,
"totalIndexSize": 1785352240,
"indexSizes": {
"_id_": 162898624,
"props.n_1_props.v_1": 1622453616
},
"ok": 1
}
As you can see the index size is quite large at 1.6GB, since we are storing both the name of the property and its value in the index. That is the price of getting a generic index! Now on to querying… Let’s find documents where “prop1” is 0:
> db.generic.findOne({"props.n": "prop1", "props.v": 0})
{
"_id": ObjectId("515dd4298bff7c34610f6ae8"),
"props": [
{
"n": "prop0",
"v": 788
},
{
"n": "prop1",
"v": 0
},
...
{
"n": "prop9",
"v": 788
}
]
}
> db.generic.find({"props.n": "prop1", "props.v": 0}).explain()
{
"cursor": "BtreeCursor props.n_1_props.v_1",
"isMultiKey": true,
"n": 49822,
"nscannedObjects": 5020473,
"nscanned": 5020473,
"nscannedObjectsAllPlans": 5020473,
"nscannedAllPlans": 5020473,
"scanAndOrder": false,
"indexOnly": false,
"nYields": 0,
"nChunkSkips": 0,
"millis": 252028,
"indexBounds": {
"props.n": [
[
"prop1",
"prop1"
]
],
"props.v": [
[
{
"$minElement": 1
},
{
"$maxElement": 1
}
]
]
},
"server": "agmac.local:27017"
}
This does not give the expected result: it matches 50k records and takes 252s! The reason is that, as per the query language, n=”prop1” and v=0 do not need to be within the same subdocument as long as they are within the same document. Basically it considers all combinations of “n” and “v” within a document and matches more than it should. While you can debate this choice for the query language, the way to force a specific subdocument to match is to use “$elemMatch”:
> db.generic.findOne({"props": { $elemMatch: {n: "prop1", v: 0} }})
Now, how is the index used and how long does it take MongoDB v2.2 to find the documents? let’s see:
> db.generic.find({"props": { $elemMatch: {n: "prop1", v: 0} }}).explain()
{
"cursor": "BtreeCursor props.n_1_props.v_1",
"isMultiKey": true,
"n": 5024,
"nscannedObjects": 5020473,
"nscanned": 5020473,
"nscannedObjectsAllPlans": 5020473,
"nscannedAllPlans": 5020473,
"scanAndOrder": false,
"indexOnly": false,
"nYields": 0,
"nChunkSkips": 0,
"millis": 278784,
"indexBounds": {
"props.n": [
[
"prop1",
"prop1"
]
],
"props.v": [
[
{
"$minElement": 1
},
{
"$maxElement": 1
}
]
]
},
"server": "agmac.local:27017"
}
It now returns the proper 5024 documents… But the result is just as slow! As you can see in the explain output, the reason is that the range on the “v” field is still open-ended. Why is that? Let’s back up for a second: if not using $elemMatch, then all combinations of fields can be a match. It would be impossible to build an index to back it up since there would be way too many combinations. So the choice was made for MongoDB to put the values of a subdocument in the same btree bucket and to ignore combinations (basically the behavior of $elemMatch). But so why is the $elemMatch query still slow then? That is due to a shortcoming that got fixed in v2.4, see SERVER-3104. Upgrading to 2.4, you will see this:
> db.generic.find({"props": { $elemMatch: {n: "prop1", v: 0} }}).explain()
{
"cursor": "BtreeCursor props.n_1_props.v_1",
"isMultiKey": true,
"n": 5024,
"nscannedObjects": 5024,
"nscanned": 5024,
"nscannedObjectsAllPlans": 5024,
"nscannedAllPlans": 5024,
"scanAndOrder": false,
"indexOnly": false,
"nYields": 0,
"nChunkSkips": 0,
"millis": 21,
"indexBounds": {
"props.n": [
[
"prop1",
"prop1"
]
],
"props.v": [
[
0,
0
]
]
},
"server": "agmac.local:27017"
}
Alright we are down to 21 milliseconds, that’s more like it!! To do a range query on a given property, it works properly with $elemMatch:
db.generic.find({ props: { $elemMatch: {n: "prop1", v: { $gte: 6, $lte: 9 } }}})
To do “and”/”or” queries use the $all/$in operators respectively. Note that in the case of $all only the 1st item will be used to lookup in the btree, so put the most restrictive value first if you know it.
db.generic.find({"props": { $all: [{ $elemMatch: {n: "prop1", v: 0} },{ $elemMatch: {n: "prop2", v: 63} } ]}})
Another approach to the problem is to just put the “property: value” combination as-is inside a subobject of a list. This solution works both with v2.2 and v2.4. Let’s create documents like this:
> for (var i = 0; i < 5000000; ++i) { var arr = []; for (var j = 0; j < 10; ++j) { var doc = {}; doc["prop" + j] = Math.floor(Math.random() * 1000); arr.push(doc) }; db.generic2.insert({props: arr}) }
> db.generic2.findOne()
{
"_id": ObjectId("515e5e6a71b0722678929760"),
"props": [
{
"prop0": 881
},
{
"prop1": 47
},
...
{
"prop9": 717
}
]
}
The index should be on the list itself since the property name varies:
> db.generic2.ensureIndex({props: 1})
> db.generic2.stats()
{
"ns": "test.generic2",
"count": 5000000,
"size": 1360000032,
"avgObjSize": 272.0000064,
"storageSize": 1499676672,
"numExtents": 19,
"nindexes": 2,
"lastExtentSize": 393670656,
"paddingFactor": 1,
"systemFlags": 1,
"userFlags": 0,
"totalIndexSize": 2384023488,
"indexSizes": {
"_id_": 162269072,
"props_1": 2221754416
},
"ok": 1
}
As you can see the index is even larger than solution #1 by almost 40%, since the BSON subdocuments themselves are stored in the index as BLOBs. Now on to querying:
> db.generic2.find({"props": {"prop1": 0} }).explain()
{
"cursor": "BtreeCursor props_1",
"isMultiKey": true,
"n": 4958,
"nscannedObjects": 4958,
"nscanned": 4958,
"nscannedObjectsAllPlans": 4958,
"nscannedAllPlans": 4958,
"scanAndOrder": false,
"indexOnly": false,
"nYields": 0,
"nChunkSkips": 0,
"millis": 15,
"indexBounds": {
"props": [
[
{
"prop1": 0
},
{
"prop1": 0
}
]
]
},
"server": "agmac.local:27017"
}
The result is even faster than solution #1 at 15ms! But one caveat is that the predicate must always use a full JSON object. To match “prop1” between 0 and 9, the query would be:
> db.generic2.find({"props": { $elemMatch: { $gte: {"prop1": 0}, $lte: {"prop1": 9} }})
Slightly awkward. Also if there are other fields in the subobject, those would have to be part of the JSON predicate when matching (remember that the subobject is just a BLOB for MongoDB). Now say you want an open range where “prop1” is greater than 6, you still should specify an upper bound otherwise it will match many more documents than expected (you can use MaxKey as a bound):
db.generic2.find({"props": { $elemMatch: {$gte: {"prop1": 6}, $lt: {"prop1": MaxKey } }}})
One more caveat: you cannot index independently just the values. For example with solution #1, if you want to find any document that has a value “10” for any property, you can just create an index on “props.v”. This is not possible with solution #2 since the field name varies.
As a conclusion, you can see that MongoDB v2.4 now offers a straightforward and efficient way to build a generic index over many properties. You are now free to index and query all of the many properties of your Big Data project :)
Doing counts in MongoDB has always been a slow operation even on an indexed field… until now. To do the count, it would iterate through every single element in the index and try to match the key, giving a response time of several seconds for just a million documents. It would be especially slow on values with high cardinality, meaning that the count is high.
MongoDB v2.4 brings 2 improvements:
The details can be seen in the following ticket: SERVER-1752
Let’s test this by creating a collection with about 10 million elements containing just a field “a” which value is between 0 and 9. Also this value should be indexed.
> for (var i = 0; i < 10000000; ++i) { db.countTest.insert({a: i % 10}) }
> db.countTest.ensureIndex({a:1})
> db.countTest.stats()
{
"ns": "test.countTest",
"count": 10079572,
"size": 362864652,
"avgObjSize": 36.0000059526337,
"storageSize": 563376128,
"numExtents": 17,
"nindexes": 2,
"lastExtentSize": 155209728,
"paddingFactor": 1,
"systemFlags": 1,
"userFlags": 0,
"totalIndexSize": 581199136,
"indexSizes": {
"_id_": 327620496,
"a_1": 253578640
},
"ok": 1
}
Let’s do the count. With MongoDB 2.2.3, this takes almost 2 seconds…
agmac(mongod-2.4.1) test> db.countTest.count({a: 1})
1007957
agmac(mongod-2.2.3) test> start = new Date(); db.countTest.count({a: 1}); print (new Date() - start);
1729
Now let’s see how MongoDB v2.4 fares… time is down to under 200ms!
agmac(mongod-2.4.1) test> start = new Date(); db.countTest.count({a: 1}); print (new Date() - start);
169
That is great news, as one of MongoDB’s shortcomings is mostly gone. But note that this new optimization is only applied when the predicate is entirely covered by an index with equality or exact range.
Make sure you read up on the previous tip to understand the following post!
Faceted search is any type of search that can involve a wide variety of predicates and sort criteria. It typically involves a web UI where the user can pick from many fields (user, account id, date range, etc) and also sort on most of those fields. It is a hard problem to solve for a database, since in theory you would have to create all the possible combinations of compound indices.
The goals are:
An important first step is to identify the fields that are required for any search. Those will be the anchor to your indices. A typical example would be customer id, account id, project id, etc.
The 2nd step is to define an acceptable default time range that the search should use (e.g. 6 months). Even though the search may be applied since the beginning of times, you should define how far you want to look back by default, and that will make the default screen not get slower over time.
Last, you need to decide which fields are actually useful to sort on. You will need at least one index per sortable field, so while 10 fields is doable, 100 is not. Any extra index will take more RAM, storage and slow down the write operations.
All above points may force you to change the available features of the system and can make project managers and users slightly disappointed. But they will thank you when the page loads in milliseconds or seconds rather than minutes!
The index should be structured as follows:
If sorting on “price“ for a customer id “cId“, the index may look like:
{ cId:1 , month: 1, price: 1, field1: 1, field2: 1, … }
To look at the 100 highest prices over 3 months for a few customers, the query would look like:
find({ cId: { $in: [1, 2, … N] }, month: { $in: [“2012-10”, “2012-11”, “2012-12”] }).sort({ price: -1 }).limit(-100)
Note the negative limit is important. You would expect the server to just have to merge sort a few discrete branches of the btree, as illustrated below.

Unfortunately there is one big hurdle to this working well, and it is described in great lengths here. The bottom line is that you should revert to one of the alternate solutions, but most of them are not applicable when dealing with faceted search:
One solution that works really well is the unwind the “$in“ predicates into multiple queries that cover all “required” fields combination (e.g. customer id and month). This may seem problematic but in reality it rarely leads to more than a dozen queries. For example:
find({ cId: { 1, month: “2012-10” }).sort({ price: -1 }).limit(-100)
find({ cId: { 1, month: “2012-11” }).sort({ price: -1 }).limit(-100)
…
find({ cId: { 2, month: “2012-10” }).sort({ price: -1 }).limit(-100)
…
Each query will be very fast, under a millisecond even if you use extra predicates (especially if covered by the index). Consequently even complex search should be able to fetch 100 items for each combination within a few milliseconds overall, and then the client code can merge sort those results very quickly. A little bit of pain, yes, but yielding great rewards.
Pagination is slightly tricky… But there are very efficient and accurate ways to implement it. It will probably be part of another post.
Faceted search is a difficult problem to implement efficiently, and most databases are not good at it. With MongoDB you will encounter a few hurdles that should be removed during index API refactoring of v2.6. In the meantime you should use the above tips!
Problem
A very common feature of any social site (and other types of application) is to find the latest activity of one’s friends. Another related example is to fetch the latest data for people or topics that one follows. You may think that databases would make it easy to implement such a feature, but unfortunately it is far from being easy.
Many new NoSQL stores do not really support secondary indexes, which are necessary for this feature. In the case of MongoDB it is possible to do it efficiently but it presents many pitfalls.
The first step is to fetch the list of the user’s friends ids. The friendship relationship is typically either embedded into the user document (as a list of user ids) or in a mapping table (SQL style). In both cases it is assumed the application can quickly put together the list of friends’ ids.
The intuitive query should look like this:
db.posts.find({ userId: { $in: [12, 24, 56, …] } }).sort({ date: -1 }).limit(100)
Notes:
For such a query the best index is a compound “{ userId: 1, date: 1}“. Note that it is not important whether it is ascending or descending on the date.
What is expected? Ideally MongoDB will take the following steps:
The following diagram shows what should ideally happen.

One debatable design decision was to not pass the “limit“ value into the query protocol. Instead only the “batchSize“ is communicated, which tells the server how many results to return in the next batch but omits to say that no more results are needed.
If you set a limit of 100 then the “batchSize“ gets set to 100 but the server assumes you may ask for more than 100 records, and keeps the cursor open on the server. In turn it makes it impossible for the server to reduce the amount of records sorted: it must sort in-memory *all* your friends’ documents *since the beginning of times*.
The fix is to use a negative limit: if set to -100 then it will be passed to the server as a negative “batchSize“ value. It tells the server: send me only 1 batch with a maximum of 100 documents and do not keep the cursor open. One caveat to this solution is that a batch cannot exceed 16MB, so the 100 documents must fit within 16MB (which is likely).
With MongoDB 2.0 (which hopefully you’re not using anymore at this time) the behavior still won’t be the one expected. It will just continue to disregard the limit and always resort to sort all documents since day 1. As a consequence the queries will get slower and slower over time, which is a big no-no.
The following diagram shows the bad behavior.

A limited fix was implemented in 2.2. The following query will work as expected and return within milliseconds:
db.posts.find({ userId: { $in: [12, 24, 56, …] } }).sort({ date: -1 }).limit(-100)
So you may think that it is all good… unfortunately a big limitation is that any extra predicate may turn off the optimization. Any extra predicate must be an equality (or a “$in“) on a field that is part of the index to the left side of the field sorted on (“date“ in this case). If you want to only see posts that have specific properties (e.g. “private: False“) you must be very careful how you build the index:
db.posts.find({ userId: { $in: [12, 24, 56, …], private: False } }).sort({ date: -1 }).limit(-100)
It must use an index on “{ userId: 1, private: 1, date: 1 }“. A different type of predicate, or against a field not covered by the correct portion of the index will put you back in the bad slow case.
This issue will be fixed as part of the 2.6 index internal API refactoring, which you should look forward to!
One solution is to also specify a date range, which will properly limit how much data gets sorted.
For example:
db.posts.find({ userId: { $in: [12, 24, 56, …] }, date: { $gt: x, $lt: y } }).sort({ date: -1 }).limit(-100)
The obvious problem is that you don’t know what date range will yield enough / too much results… You can first query for the past day, then go back in time with increasing ranges if not enough data is returned.
To properly make full use of the index, a solution is to split the query into many queries, one per friend id. In that case MongoDB can properly iterate through the sorted index without doing extra work.
The queries would be:
db.posts.find({ userId: 12, … } }).sort({ date: -1 }).limit(-100)
db.posts.find({ userId: 24, … } }).sort({ date: -1 }).limit(-100)
…
The client code will then do the sort of the separate results, which is fast to do. Any predicate can be added to filter the data further, and should be part of the right side of the “date“ in the index.
Unfortunately this solution only works well if the number of friends is fairly low. Each query should return within a millisecond, but with 100 friends you can be looking at several milliseconds already. The good news is that the speed will remain the same over time, but there is some extra overhead for sure. Overall it is a good solution if the number of friends under 100 on average.
An alternative strategy is to duplicate the data into each of the friend’s own data. When a user posts an update, it is duplicated into 1 document for each of the friends. Consequently the problematic query becomes easy:
db.posts.find({ userId: 7 }).sort({ date: -1 }).limit(-100)
The obvious downside is the large duplication of content. If a user has 10k followers, the data will generate 10k documents written to the database.
In conclusion, either:
Expect MongoDB 2.6 to give you full intuitive support for this! One ticket to watch is SERVER-3310.
When planning my trip to Germany to attend my sister’s wedding in Berlin, I was faced with the absence of any direct flight from SF. Since I absolutely hate a stop over after 10h of closed cabin life, I decided to turn this into a roadtrip from Munich and drive around the famous autobahn. The last time I had the opportunity to drive on it was when I was 16, driving a diesel Mercedes station wagon packed to the rim with my grandparents, other family members and luggage - not an ideal speeding situation. I was especially looking forward to it since I’ve always found the US speed limits to be way too low, making people asleep while giving one a false sense of safety.
I was originally planning to get a nice car, but time slipped away and I showed up at the rental office expecting something like a small Peugeot. Fortunately the agency assumed that my american self (resident) could only drive an automatic car, which was not available, and proposed to upgrade me if I was willing to drive a stick.. sure no problem. What cars do you have? A mini cooper, a chevy, an opel, a bmw, a peu… stop right there, I’ll take the bimmer of course. I am in Munich after all. Turned out to be a nice and recent 3-series, albeit with the small 1.8 engine.
After much delay I finally got myself to see the no speed limit sign when leaving Munich for Stuttgart. The car sluggishly worked it’s way from 120 to 180kph since it’s at a dead spot in the car’s gears. Over 180 it wakes up and things get interesting. My first feeling was a great sense of freedom, which materialized into a huge smile on my face. I felt like in a video game, it’s just you and the road, no rules. I quickly realized that it’s not just about stepping on the gas: straight lines do not last forever and soon you find yourself at 200kph in a turn where you do not see what is coming to you further than 50 meters. It takes some amount of courage to maintain the speed. Back in the straight line, I finally push the car to the limit, which seems to be around 230kph based on the current wind. At that speed, you are not comfortable: any input into the wheel must be very gentle or you will experience a large deviation on the road - you want to be smooth with both hands on the wheel. Also the car controls become vague and you feel like the car is drifting when entering a turn. That was experienced with a bmw which is praised for its chassis, so I can only imagine what it would feel like in a small hatchback. At 230kph I wished the car was lower, with stiffer suspension and wider tires (and maybe a few wings).
Now that was the fun part of the experience. Unfortunately the autobahn is not just unrestricted driving. The 1st huge issue is: trucks. I have never seen that many trucks in my life. It was pretty much a continuous line of trucks for about 500km. There are as many trucks on the autobahn as there are yellow cabs in NY. They go about 100kph max and mostly stick to the right lane, but not always - they are allowed to overtake when there are 3 lanes. Now a truck overtaking another truck is a process that takes 5min because the overtaking truck thinks it’s going much faster until it gets into the wind and realizes why the other truck was slow. Meanwhile the small Opel decides it had enough and overtakes the truck race. It enters the fast lane at little over 100kph, usually in a turn and without checking its mirrors much. It forces you to slam on the brakes from 200kph in a turn, and it’s rather scary. Now for the 2nd problem: constructions. I don’t know what the deal is, but there is construction going on every 10min or so. And this is not some kind of “Putting Germany to work during recession” gig, it’s just that there always are constructions, cleanups, gardening, etc going on, with some of their signs looking rather permanent. All in all probably 20% of the time was unlimited, 60% at 120kph, and 20% below 80mph.
Now for some interesting development, it turns out that most other people drive slowly. I was expecting to have my tourist mobile overtaken by dozens of speeding Germans. Not so much, I was only overtaken twice in 6 hours, with most people going way slower. Now why would that be? I think one of the reason stems from what I stated before, that it is not a very comfortable daily experience in most cars. Another reason is that there are not many nice cars around - the nicest one I ever saw was an M3, no Porsches or other heavy armed german cars. I guess a lot of them are destined to the US roads where they can only be driven to a glorious 65mph.. The last reason I think is the fuel consumption. The more speed, the more wind, the higher the revs, and the less efficient it all becomes. At 1.6 euros per liter (about 6.4 euros per gallon, more than double US prices) the bill builds up really fast. I poured close to 200 euros worth of gas into my small 1.8l engine within a day, ooch. With the lame economy and the green trend, it may explain why people stick to reasonable speeds.
So in conclusion, what is my take on no speed limits? Well taking into account all the above, I think it would have been a much better experience with less construction, fewer trucks and a fourth lane. In terms of risks, speeding is a lot like a gun: put it in the wrong hands and it becomes really unsafe. In the right hands it can actually make you safer, since you pay full attention to what is going on. I believe that most accidents come from lack of attention rather than speed. Also from a pure productivity point of view, it does make your trips potentially much faster. And Let’s not forget that cars have been engineered and tested at those speeds for years. Last but not least, it gives you a great sense of freedom and some much deserved fun… Let’s turn some rather straight and lonely roads into freedom highways. Keep people entertained and awake. Going from SF to LA or Vegas could become a much shorter and enjoyable affair.
cheers,
AG
I’ve come across several users who experience poor performance when using Map/Reduce with MongoDB version 1.8 and older, and it turns out that in many cases it is easily fixable. Today I will focus on the “sort” parameter of the MapReduce command, which is often overlooked but critical.
Here is how the M/R works in the general case, assuming there is no query filter:
Now if you have many documents, and the key distribution is fairly random, it can result in following: all docs get inserted to map but it is not useful for reduction, and most documents will end up in the “inc” collection on disk that needs to be read back in order. The particular issue to understand is that since mongod has no idea what key you will use to emit, it cannot presort the data to make it efficient.
To fix this issue:
This can result in 100x performance in some cases. Note that in mongo 1.9 and above, some works has been done to improve performance:
But in any case mongod is still not aware of your emit key, so use sort!
cheers
AG