Two weeks ago I took a three day MongoDB course at CAPSiDE. Really interesting. If you are planning to use this technology I strongly recommend you to take one of this courses, if you can. During the classes I took a few notes. Nothing structured, but I'll comment what I think it can be interesting here:

[--nojournal]

Internesting option if you are playing with mongodb in your laptop and you don't want him to prealocate all the journal. You may be interested in reducing the size of the oplog too. By default it takes a 5% of your HDD and if you have 1TB it can take quite a few minutes to preallocate 50GB just "for tests". You can set it to 50MB without problems.

[mongo --eval]

We have the mongo command to access to the console, but what if we want to create a bash script and retrieve information of mongo? We can use something like that:

mongo --eval 'printjson(db.adminCommand( {shutdown:1}))'

From my point of view is ugly has hell. I don't like it. I would rather use a pymongo driver or equivalent in perl. You can use the --eval in bash scripts to retrieve information about the status of the MongoDB, though.

[Good stuff about having a replica set]

HA. At mongo they say to you that if you care about your data, you must have it replicated. Not as a backup (that's a different thing) but for having the data in High Availability. If a node crashes, you'll have another one taking the lead. If you are willing do to maintenance to your mongos machines, you can do the upgrades first to your secondary nodes and then promote one of the secondaries to primary and apply the changes to the ex-primary node.

And that's great because you (almost) are not loosing service. I say almost because when you execute the rs.stepDown() to the master there are elections and maybe during a second or so all the nodes are Secondary. But that's good enough.

Dirty-read. We'll talk about data persistence modes, but the replicas probably are not up to date with the exactly same information of the master. Of course you cannot write on a replica - in fact you can, by doing a db.setSalveOK() but you probably wont want to do that and leave everything inconsistent - but you can make reads on it and that can be great if you don't mind that some data can be stale.  This, of course, have the benefit of offloading the master.

Bakcups. You can use a replica for doing backups. Because most of the backups locks the MongoDB, you probably don't want to do it in the master so that why you have slaves.

Final note: The teacher didn't told us that, but when I was using MongoDB I knew from the foros and the support team the sentence: if you don't have a replica set and you loose data or have downtime, is your problem. What did you expected?

[Consistency levels] (or Write concern)

There are 4 levels (read better the official documentation):

  1. Network: Waiting for a socket ACK at the client and pretend that the information is stored. Of course you can loose data but it is fast.
  2. On RAM: the acknowledge in made by the server saying: "OK, I have your information". You can loose information if there is a power cut.
  3. Disc: The information has been flushed to the disk (or journal)
  4. The replicas: The information has been flushed to disk at the replicas. You can also define on how many replicas do you want your write to be confirmed. Always a good choice is "majority".

[Monitoring]

You want to monitor your Mongo, the replica sets and the shards. The thing is what to monitor. I took a few notes about it:

  • % of write lock. That's bad because if it is above 50% means that a looot of time is being spend in writes. And as you may know a write in Mongo locks the whole DB so no reads can be performed during the write.
  • Know what's the data_size (size of the data), the index_size and the file size.
  • Calculate the fragmentation of your DDBB. You can do that by doing the operation file_size - (data_size + index_size). Maybe you want to compact your data if it is very fragmented.
  • B-tree misses. Means that the indexes doesn't fit in RAM. That really bad.
  • Background flush average: if is high means that the disks are suffering or that you can have a RAID problem.
  • Number of connections: while not one of the most important metrics, it can be handy to have know how many connections you have with mongo
  • Uplog hours: given the rate of the data you are writting, in how many time you'll fill the uplog (and you'll have a problem).

Of course you can also create an account to MongoDB Management Service, it is free service and you can see all you statics in a beautiful dashboard. The support people at Mongo can see you statics and can connect via console to your mongo, though. If you are OK with that....