Eliot's Ramblings

Mongo’s New Matcher

MongoDB 2.5.0 (an unstable dev build) has a new implementation of the “Matcher”. The old Matcher is the bit of code in Mongo that takes a query and decides if a document matches a query expression. It also has to understand indexes so that it can do things like create a subsets of queries suitable for index covering. However, the structure of the Matcher code hasn’t changed significantly in more than four years and until this release, it lacked the ability to be easily extended. It was also structured in such a way that its knowledge could not be reused for query optimization. It was clearly ready for a rewrite.

The “New Matcher” in 2.5.0 is a total rewrite. It contains three separate pieces: an abstract syntax tree (hereafter ‘AST’) for expression match expressions, a parser from BSON into said AST, and a Matcher API layer that simulates the old Matcher interface while using all new internals. This new version is much easier to extend, easier to reason about, and will allow us to use the same structure for matching as for query analysis and rewriting.

This matcher rewrite is part of a larger project to restructure query execution, to optimize them, and to lay the groundwork for more advanced queries in the future. One planned optimization is index intersection. For example, if you have an index on each of ‘a’ and ‘b’ attributes, we want a query of the form { a : 5 , b : 6 } to do an index intersection of the two indexes rather than just use one index and discard the documents from that index that don’t match. Index intersection would also be suitable for merging geo-spatial, text and regular indexes together in fun and interesting ways (i.e. a query to return all the users in a 3.5 mile radius of a location with a greater than #x# reputation who are RSVP’ed ‘yes’ for an event).

A good example of an extension we’d like to enable is self referential queries, such as finding all documents where a = b + c. (This would be written { a : { $sum : [ “$b” , “$c” ] } }.) With the new Matcher, such queries are easy to implement as a native part of the language.

Now that the Matcher re-write is ready for testing, we’d love people to help test it by trying out MongoDB 2.5.0. (Release Notes)

Code

Why Fly to London for 48 Hours

I visited London a few weeks ago to attend and speak at MongoDB London. The event was very successful, and I enjoyed many conversations with attendees and staff during the event. But having the opportunity to spend time with our 10gen London team makes the value of the trips far exceed my contribution to the conference.

Although my time with the team was relatively short since my entire trip to the UK lasted only two days, it provided yet another example of “no substitute for in-person collaboration”.

While I was in the office with the team, some of us began discussing a particular technical topic (related to mutability vs immutability for a specific class hierarchy). This discussion had actually started several weeks before, when a working group was attempting to get a specification for a new feature finalized. However, the geographical distance and time zone differences between the participants had meant that the discussion was drawn out and hard to finalize. During this phase, I had been persuaded of a particular viewpoint.

Working together in person, however, means more than just lower latency. It means better instantaneous understanding. When we met face-to-face, we were able to move rapidly from discussion to quick prototypes and, rather surprisingly, I found myself changing my point of view (as did one of the engineers in London). We therefore changed the spec.

10gen is a very distributed company, with offices in 7 cities and more to come. Maintaining our agility would not be possible without the benefits of teleconferencing in all of its forms; yet as useful as it is, I find no replacement for being in the same room with someone. It may be that I am particularly bad at remote communication. Regardless, I know my frequent trips to other 10gen offices are well worth the air time.

10gen’s New Office

Monday was a big day for 10gen in New York; we moved into our new offices on West 43rd Street. The last time we moved (about 16 months ago), our then new office seemed quite spacious and impressions were that it would last quite a while. That turned out to be a bit short sighted. By January of this year we were bursting at the seams, with every desk full, expansion space taken, and competition for conference rooms straining everyone’s patience.

Our new office is one we built ourselves, and I’m happy to say that because of that, it represents more than just an end to the constraints on our resource scheduling for the moment. It means we had the opportunity to build the type of space that suits our culture – an environment for serious work, but with enough comforts to make life at the office very enjoyable. In some future posts I’ll cover some of the choices we made and why, but for now I’d just like to say “phew!”

Streaming Twitter Into MongoDB

curl http://stream.twitter.com/1/statuses/sample.json -u: | mongoimport -c twitter_live

One thing that you can do with mongo is have 1 streaming master and 1 read/write master

server A:

./mongod —master —dbpath /tmp/a

server B

./mongod —dbpath /tmp/b —master —slave —source localhost:27017 —port 9999

You can then pipe the stream into server A, and it will only process the live stream.

Server B will replicate all changes. You can also write to it, query on it, etc… This way you can do operations that block writing on server B, but server A will never backlog.