Eliot's Ramblings

My Fireside Chat at Data Driven NYC

A couple of weeks ago I did a great fireside chat with Matt Turck at Data Driven NYC.

I’ve always found that the fireside chat is a format with a lot of potential to be boring, but Matt is a great interviewer, and interacting with him on stage definitely adds to the event. For example, when I was talking about the headline features of our 3.2 release, I omitted a significant pair – the BI connector and Compass – and he reminded me to talk about them. It’s things like that which enhance the experience for the audience. At their best, a fireside chat interviewer takes care of the setup, makes sure you’re staying on track, and grabs opportunities to dig deeper.

One thing that Matt brought up (at around 12:40 in the video) was how, after an explosion of alternatives to relational databases, it’s starting to feel like things are converging again. Now, when you do one of these, you get a list of topics to prepare for in advance, but that’s a question that emerged organically from our conversation. I appreciated the opportunity to address that by citing a core tenet of MongoDB (at 14:45):

“We really want you to be able to configure yourself into different use cases, rather than having to use different kinds of products.”

All the other speakers were very interesting. I was particularly into the Dr. Kieran Snyder’s Textio presentation – that’s a cool product, with cool tech, and a lot of potential. Think about it: there’s an enormous amount of text in the world. You will never be able to read it all. Algorithms that understand those vast swaths of things you will never be able to personally consume have the potential to revolutionize knowledge. Or consider influential articles that wind up with thousands – or maybe some day millions – of comments… stuff like this can sift signal from noise. There’s a lot of promise there. Kieran and I had a really interesting chat about this stuff after the event, and I’m looking forward to see where they take Textio.

All in all, a really enjoyable evening, I would definitely recommend Data Driven NYC to anyone in New York with an interest in tech at all. They happen monthly, so it should be relatively easy to catch one.

After Parse: Where Should MBaaS Go?

Last week I talked about Parse shutting down and how unfortunate that was, but also how outstanding a job they have done providing a transition path for their current users. MongoDB also published a very detailed post on how to migrate a Parse app onto MongoDB Cloud Manager and AWS Elastic Beanstalk. Since that day, the amount of activity on the open source Parse Server has been phenomenal, and many have suggested, as did one commenter on my last post, that this means it’s time for MongoDB and Parse to work even better together.

All this discussion I’ve had about Parse has got me thinking about the nature of the Mobile Backend-as-a-Service space and MongoDB’s role there. I’m also interested in hearing directly from customers. If your MBaaS-backed application deals with a decent amount of data and load, leave a comment or shoot me an email (eliot@mongodb.com), I want to talk about making MongoDB and the ecosystem around it even better for the MBaaS use case.

Farewell Parse

Updated 2/3/2015 to reflect the publication of MongoDB’s migration guide.

I was sad to hear about Parse shutting down last week.

Parse made a big push towards serverless architectures, which I think is a great goal. Serverless architectures are the ultimate in letting developers focus on making great products for their users and letting other people make the plumbing work.

In the early days of web and mobile application development, backends were a thing that every team had to write themselves from scratch. Over time, common patterns were encapsulated into application frameworks. Parse was a glimpse farther into the future, providing app developers an abstraction for an entire backend.

I’m a fan of this approach. MongoDB’s number one focus has always been on making developers more productive, so they don’t have to work so hard to use their databases. As more and more applications are built using a set of services, the next step will be around making those services composable, their data composable, and making those pieces really easy to build, deploy, and maintain.

I’m also really happy to see the way that Parse is handling their shutdown. They are giving users one year to move off, providing tools to move off, and provided an open-source version of the backend so the migration can be as painless as possible.

MongoDB has published a guide on how to use cloud manager and elastic beanstalk to help those migrating off of Parse. We’ll also offer a consulting package for those who want a bit more customized help.

Mayor De Blasio Announces Comprehensive NYC K-12 CS Education Program

For the past 6 months, I’ve been participating in the NYC Tech Talent Pipeline Advisory Board, a partnership between New York City and technology companies in New York. From the press release announcing this board’s formation:

Mayor Bill de Blasio today announced 14 initial industry commitments to support the delivery of technology education, training, and job opportunities to thousands of New Yorkers as part of the Administration’s NYC Tech Talent Pipeline initiative. Announced by the Mayor in May 2014, the NYC Tech Talent Pipeline is a first-of-its-kind, $10 million public-private partnership designed to support the growth of the City’s tech ecosystem and prepare New Yorkers for 21st century jobs. The commitments were announced at today’s inaugural convening of the NYC Tech Talent Pipeline Advisory Board, during which Mayor de Blasio and 25 executives representing the City’s leading companies came together to help define employer needs, develop technology training and education solutions, and deliver quality jobs for New Yorkers and quality talent for New York’s businesses.

The board has been working since it was convened to devise, fund, and institute programs that train New Yorkers in the technology skills needed to drive innovation by businesses which operate here in NYC. And yesterday Mayor Bill de Blasio had a big announcement to make: within 10 years, every public school in NYC will offer computer science education.

I’m not going to belabor the obvious point that educating New Yorkers in technology skills is a win-win scenario. What is so great about this unprecedented commitment to computer science education is that it brings these benefits to all young children in New York.

I started programming at a very young age. I went to computer camp when I was 7. I took computers apart and tried to make them better. I was lucky enough to be exposed to computer science early, and to have a father who encouraged and helped me when I was young. That early experience made a huge difference in my life and played a large role in where I am today. But I was a rare exception, and that’s not how it should be.

Learning computer science requires access to a computer, which back then was not ubiquitous, but today, everyone has one in their pocket. It’s high time to adapt to this new reality and to stop thinking of computer science as an elective suitable for a small slice of the population. There are many reasons to expect students from all backgrounds to take to computer science with gusto. Software provides immediate gratification, which is great for fostering excitement in learning. It requires very little capital to write software, so anyone with dedication should be able to build something great. But it’s hard to do well, so developing understanding and excitement early makes a big difference. We need to give kids a chance to love CS before they hear or assume that they’re not the right type of person to be a software engineer.

I fully expect this program to lead to huge, positive changes in the lives of the children of NYC, and to bring to the companies that need software engineers a large, vital, diverse pool of them.

Document Validation and What Dynamic Schema Means

When we first published a mongodb.org homepage, we sloppily described MongoDB as “schema free”. That description over-emphasizes the baggage MongoDB left behind, at the expense of true clarity. At the time, however, document databases were brand new, and it was simple to describe them in terms of what they were not (witness the prevalence of the terms “non-relational” and “nosql”). This over-simplification was much more than an oversight. As you can see by reviewing this old blog post, it reflects an immaturity in our thinking. By 2011 we had come to see that calling MongoDB “schema free” reflected an old way of thinking about what “schemas” actually are, so we changed the homepage to say “dynamic schema”.

To appreciate the context for this evolution, recall that when we launched MongoDB, “schema” meant the tables your data was stored in, and the rules that governed the relationship between those tables. Relational schemas have a fixed structure, with strongly typed fields, so complex entities can only be modeled as collections of tables, with their relationships to each other also strongly defined. So schemas are fixed, and altering them is a high cost operation. It seemed correct to say that MongoDB was free of schema.

The DDL used to define a relational schema affords a few additional usability benefits as a side effect of how it requires data to conform to the relational model. Two key benefits: schemas provide documentation of what data is in a table (if you’ve seen one row, you’ve seen ‘em all!), and validation of the fields, by their very definition.

At this point it seems needlessly reductionist to call MongoDB schema-free, since of course, MongoDB and the apps built on it have always had schema, they just embodied them in their queries and the indexes they build to support them, rather than in a table definition. Furthermore, we did plan to offer our users the documentation and validation aspects of schema, but wanted to focus on developing the document model first. When MongoDB was created, we saw more value in doing away with the restrictive elements of tables than keeping them for their side effects, especially when they could be delivered as features, deliberately designed to suit the needs of developers and operators.

In MongoDB 3.2 we are following through on that plan, and one of those features is document validation. To use it, you attach a validation document to a collection. Validation documents use the mongodb query language to add constraints to the documents inserted to that collection. An example validator might be:

{ age : { $gte : 0, $lte : 150 } }

If someone tried to insert a document with a null or missing age, the document would be rejected. If you tried to insert 32 as a string or -5 it would also be rejected. This allows the database to enforce some simple constraints about the content of the documents, similar to PostgreSQL’s check constraints.

One common use case for MongoDB is aggregating data from different sources. With document validation, you’ll be able to ensure that all of the sources have some common fields (like ‘email’) so they can be linked.

You can attach a validation document to a collection at creation time, by including it as a validator field in the db.createCollection command, or by using the collMod database command:

db.runCommand( {
   collMod: "contacts",
   validator: { $or: [ { phone: { $exists: true } }, { email: { $exists: true } } ] }
} )

There are a number of options that can be used to tune the behavior of validation, such as warn only mode, and how to handle updates that don’t pass validation, so have a look at the dev-series documentation for the complete picture.

Along with the rest of the 3.2 “schema when you need it” features, document validation gives MongoDB a new, powerful way to keep data clean. These are definitely not the final set of tools we will provide, but is rather an important step in how MongoDB handles schema.

Under the Hood With Partial Indexes

Partial indexes allow you to create an index that only includes documents in a collection that conform to a filter expression. These indexes can be much smaller, cutting down index overhead in storage space and update time, and by matching against the filter criteria, queries can use this slimmed-down index and run much faster. This is one of the new lightweight “schema where you need it” features we’re bringing to MongoDB in 3.2. The idea for this feature came from discussion with our users who are accustomed to it from relational databases like PostgreSQL, which introduced the feature in version 7.2. With partial indexes, you use your knowledge of your application to adjust the space/time tradeoff you make when creating indexes to fit your needs.

One great example of this is when you have a collection where documents go through an active phase, and then move into an archival state along with a state field update (like “billed” going from “false” to “true”), where they occupy the bulk of a collection’s footprint. Since you’re unlikely to want to access them from that state outside the context of looking up a single record by its primary key or an analytical collection scan, they would just clutter up your index, consume RAM, and make your other operations run slower.

So, here’s an architecture question… is this a storage engine change?

Well, that’s a trick(y) question. From a design standpoint it absolutely should not be. Storage engines are simple (conceptually) and need to be focused on one thing: storing and retrieving data efficiently. Indexing concerns belong to layers above a storage engine.

But in the pre-3.0 days, thiswould have had to be a storage engine change, because we had not yet created a nice separation of concerns. A ton of work had to be done behind the scenes as we built 2.4, 2.6, and 3.0 to make this possible, but now we’re seeing all that hard work pay off. Pluggable storage engines is a big part of the future of MongoDB, and a sane architecture separating these layers turned making partial indexes from a nightmare into some code that’s actually really pleasant to read. So pleasant, in fact, that I’m going to tour you through some of it, by tracing the path of an insert into a collection.

At a high level, an interaction with a MongoDB collection traverses several layers to get down to the storage engine. For this tour, we’ll skip the networking and user layers, and trace the path from the Collection object to the IndexManager to the StorageEngine.

(Note: all links here are to the 3.1.7 branch to make sure they are stable, so this code is already slightly out of date - see master for newer code. Line numbers will have changed, but the general flow will be the same. (For the next year at least!))

The entry point is Collection::insertDocument which hoists out error handling (including document validation, another one of our 3.2 features, but that’s for another post), and passes down to Collection::_insertDocument

This code contains a transition across areas of concern:

A Collection calling down to a RecordStorelink
StatusWith<RecordId> loc = _recordStore->insertRecord(
    txn, docToInsert.objdata(), docToInsert.objsize(), _enforceQuota(enforceQuota));

_recordStore is an instance of our abstraction around storage engines (more detail can be found here), and you can see that we just hand the data for the document over to the _recordStore to handle.

The architecture detail of note is that this code doesn’t deal with indexes, nor is indexing buried below that called to insertRecord. Rather, after doing a little collection housekeeping _insertDocument just calls IndexCatalog::indexRecord.

which in turn calls _indexRecord for every index on the collection.

There, we simply do not index entries that do not match:

Does the index filter match the document?link
const MatchExpression* filter = index->getFilterExpression();
if (filter && !filter->matchesBSON(obj)) {
    return Status::OK();

For each index where the expression matches (or there is no filter), it calls IndexAccessMethod::insert, which generates the keys (0 to many, typically 1) and inserts each one. IndexAccessMethod is a superclass abstracting how indexes are used, since there are many types, such as geospatial, btree, and full text, and each will have their own implementation.

(Those of you following along in the code might notice the abstraction for the index itself is stored as the _newInterface member of the IndexAccessMethod class. At some point that will get a better name!)

So now the storage layer doesn’t know about partial indexes at all.

The reason that this works is that the storage engine layer is required to expose a transactional key/value api, through which all interactions pass, accompanied by a transaction descriptor. The layer above that treats both collections and their indexes as sets of key/value maps. So inserting a document into a collection with 2 indexes is 31 separate table insert calls to storage engine code from higher layers, with atomicity ensured by the storage engine’s transaction system.

  1. or more, in the case of multi-key indexing of arrays

AWS Pop-up Loft Talk

On August 25th I will be delivering a talk at the AWS Pop-Up Loft in NYC. The talk is entitled: “Behind the Scenes with MongoDB: Lessons from the CTO and Cofounder on Deploying MongoDB with AWS.” The AWS lofts combine hack days, talk series, bootcamps, and “ask an architect” opportunities, and mainly target engineers working on startup projects that are built on AWS, although other people do attend the talks.

Since this is a technical crowd, the talk will be highly technical, and since it’s an AWS event, I’ll be emphasising MongoDB’s uses in the AWS environment. Here’s the abstract:

Meet Eliot Horowitz, CTO and Co-Founder of MongoDB, the next gen database built for the cloud. Eliot will share his experience founding and scaling a successful startup, discuss the value of community, and urge you to throw away code as fast as you can.

Then he’ll get into specifics regarding how to deploy MongoDB in an AWS context. To focus the discussion, he will use the example of a MongoDB-backed, multiplayer mobile game hosted on AWS, and follow it from inception as a prototype to a global infrastructure spread across multiple regions and availability zones. You will learn specific methods enabling you to start lean while being prepared to scale massively, such as tag-aware sharding for geo-aware data residence, and using multiple storage engines to optimize for particular use cases.


I’m looking forward to it, and if you’re going to be there, let me know.

Extending the Aggregation Framework

The aggregation framework is one my favorite tools in MongoDB. Its a clean way to take a set of data and run it through a pipeline of steps to modify, analyze, and process data.

At MongoDB World, one of the features we talked about that is coming in MongoDB 3.2 is $lookup. $lookup is an aggregation stage that lets you run a query on a different collection and put the results into a document in your pipeline. This is a pretty powerful feature that we’ll talk more about in a later post.

In order to make writing $lookup a bit cleaner, we’ve done some work to make adding aggregation stages easier. While this is largely for MongoDB Developers, it could also be used by anyone to add a custom stage to do some cool processing on documents inside of MongoDB. Now, given that this requires compiling your own version of mongod, and writing c++ that could corrupt data, this is not for the faint of heart, but it is quite fun :)

For example, if you wanted to write an aggregation stage that injected a new field into every document that came through the pipe, you could do it like this:


Now, you could use $project for this, but my new stage makes all the values into my birthday. So, that’s better.

In the end, not too bad. If anyone has some cool ideas please share!

I Want an Apple Watch

A lot of people I talk to are unsure about the Apple Watch, and the category in general. Me, I’m counting down the days till I get my Apple Watch. In fact, at this point my impatience is so great, the prospect of having to wait another month to get one almost makes me want to go out and buy a Pebble. So, score one for the Apple marketing team, I guess.

Before we get into why, I first want to talk about Apple’s VIP feature. You can mark certain people as VIP, and then you can see emails from just them, limit email notifications to just them, and probably more things I haven’t even tried yet. I have emails from VIPs appear on my phone lock screen. This allows me to quickly glance to see if there is anything I want to read. For better or worse, my habit (addiction) is that I need to look at that fairly often.

So the only things on my lock screen are VIP emails, text messages and my next calendar item. All of those are things I generally want to see very often. Right now, that involves either pulling my phone out of my pocket and looking at it, or keeping it on a table and pressing a button. Oh, and I do like to look at the time on my phone pretty often too.

Those four things all seem to be pretty well served by the basic functionality of the Apple Watch. Time, check. Upcoming appointment, I think check. Text messages, check. VIP emails… well, they haven’t been specific about that, but I’d be really surprised if they didn’t integrate that awesome feature into the watch. For me, being able to accomplish those four things without the interruption of going to the phone seems really appealing. Time will tell if it actually works, but I’m hoping. And being to be able to dismiss a call while keeping my phone in a pocket will also be really nice.

For these reasons, my excitement is currently all about the core feature set, but I’m also intrigued by all the interesting apps that are likely to appear over the next few years. For a lark I’ve done a little daydreaming about that, maybe I’ll write up a few ideas for a later post.

Gmail Jira Decorator

As discussed in other posts, I spend a lot of time in email, and much of the email I get is related to MongoDB’s Jira. I’ve written before about my Jira summarizer, which maintains a single message in your inbox with a summary of recent activity in projects you watch. In my continuing quest to make Jira email easier to deal with, I wrote a tool to make it easier to quickly assess the email notifications about individual issues.

The tool is a chrome extension that operates on my Gmail inbox. Every 30 seconds it scrapes the subjects of emails and does a Jira request to get some basic information. (It offloads most of this work to a separate server I wrote.) It than munges the HTML to decorate the subject of the email with the status, assignee, severity and fix version.

This allows me to quickly see things that are blockers or critical, not focus on things that are assigned to someone already, or know that someone has decided that it should be fixed in the next point release vs. at some point in the future.

Gmail Jira Decorator in action

Interested in the project? Feedback on my email-centered workflow? Let me know!