Eliot's Ramblings

Mayor De Blasio Announces Comprehensive NYC K-12 CS Education Program

For the past 6 months, I’ve been participating in the NYC Tech Talent Pipeline Advisory Board, a partnership between New York City and technology companies in New York. From the press release announcing this board’s formation:

Mayor Bill de Blasio today announced 14 initial industry commitments to support the delivery of technology education, training, and job opportunities to thousands of New Yorkers as part of the Administration’s NYC Tech Talent Pipeline initiative. Announced by the Mayor in May 2014, the NYC Tech Talent Pipeline is a first-of-its-kind, $10 million public-private partnership designed to support the growth of the City’s tech ecosystem and prepare New Yorkers for 21st century jobs. The commitments were announced at today’s inaugural convening of the NYC Tech Talent Pipeline Advisory Board, during which Mayor de Blasio and 25 executives representing the City’s leading companies came together to help define employer needs, develop technology training and education solutions, and deliver quality jobs for New Yorkers and quality talent for New York’s businesses.

The board has been working since it was convened to devise, fund, and institute programs that train New Yorkers in the technology skills needed to drive innovation by businesses which operate here in NYC. And yesterday Mayor Bill de Blasio had a big announcement to make: within 10 years, every public school in NYC will offer computer science education.

I’m not going to belabor the obvious point that educating New Yorkers in technology skills is a win-win scenario. What is so great about this unprecedented commitment to computer science education is that it brings these benefits to all young children in New York.

I started programming at a very young age. I went to computer camp when I was 7. I took computers apart and tried to make them better. I was lucky enough to be exposed to computer science early, and to have a father who encouraged and helped me when I was young. That early experience made a huge difference in my life and played a large role in where I am today. But I was a rare exception, and that’s not how it should be.

Learning computer science requires access to a computer, which back then was not ubiquitous, but today, everyone has one in their pocket. It’s high time to adapt to this new reality and to stop thinking of computer science as an elective suitable for a small slice of the population. There are many reasons to expect students from all backgrounds to take to computer science with gusto. Software provides immediate gratification, which is great for fostering excitement in learning. It requires very little capital to write software, so anyone with dedication should be able to build something great. But it’s hard to do well, so developing understanding and excitement early makes a big difference. We need to give kids a chance to love CS before they hear or assume that they’re not the right type of person to be a software engineer.

I fully expect this program to lead to huge, positive changes in the lives of the children of NYC, and to bring to the companies that need software engineers a large, vital, diverse pool of them.

Document Validation and What Dynamic Schema Means

When we first published a mongodb.org homepage, we sloppily described MongoDB as “schema free”. That description over-emphasizes the baggage MongoDB left behind, at the expense of true clarity. At the time, however, document databases were brand new, and it was simple to describe them in terms of what they were not (witness the prevalence of the terms “non-relational” and “nosql”). This over-simplification was much more than an oversight. As you can see by reviewing this old blog post, it reflects an immaturity in our thinking. By 2011 we had come to see that calling MongoDB “schema free” reflected an old way of thinking about what “schemas” actually are, so we changed the homepage to say “dynamic schema”.

To appreciate the context for this evolution, recall that when we launched MongoDB, “schema” meant the tables your data was stored in, and the rules that governed the relationship between those tables. Relational schemas have a fixed structure, with strongly typed fields, so complex entities can only be modeled as collections of tables, with their relationships to each other also strongly defined. So schemas are fixed, and altering them is a high cost operation. It seemed correct to say that MongoDB was free of schema.

The DDL used to define a relational schema affords a few additional usability benefits as a side effect of how it requires data to conform to the relational model. Two key benefits: schemas provide documentation of what data is in a table (if you’ve seen one row, you’ve seen ‘em all!), and validation of the fields, by their very definition.

At this point it seems needlessly reductionist to call MongoDB schema-free, since of course, MongoDB and the apps built on it have always had schema, they just embodied them in their queries and the indexes they build to support them, rather than in a table definition. Furthermore, we did plan to offer our users the documentation and validation aspects of schema, but wanted to focus on developing the document model first. When MongoDB was created, we saw more value in doing away with the restrictive elements of tables than keeping them for their side effects, especially when they could be delivered as features, deliberately designed to suit the needs of developers and operators.

In MongoDB 3.2 we are following through on that plan, and one of those features is document validation. To use it, you attach a validation document to a collection. Validation documents use the mongodb query language to add constraints to the documents inserted to that collection. An example validator might be:

{ age : { $gte : 0, $lte : 150 } }

If someone tried to insert a document with a null or missing age, the document would be rejected. If you tried to insert 32 as a string or -5 it would also be rejected. This allows the database to enforce some simple constraints about the content of the documents, similar to PostgreSQL’s check constraints.

One common use case for MongoDB is aggregating data from different sources. With document validation, you’ll be able to ensure that all of the sources have some common fields (like ‘email’) so they can be linked.

You can attach a validation document to a collection at creation time, by including it as a validator field in the db.createCollection command, or by using the collMod database command:

db.runCommand( {
   collMod: "contacts",
   validator: { $or: [ { phone: { $exists: true } }, { email: { $exists: true } } ] }
} )

There are a number of options that can be used to tune the behavior of validation, such as warn only mode, and how to handle updates that don’t pass validation, so have a look at the dev-series documentation for the complete picture.

Along with the rest of the 3.2 “schema when you need it” features, document validation gives MongoDB a new, powerful way to keep data clean. These are definitely not the final set of tools we will provide, but is rather an important step in how MongoDB handles schema.

Under the Hood With Partial Indexes

Partial indexes allow you to create an index that only includes documents in a collection that conform to a filter expression. These indexes can be much smaller, cutting down index overhead in storage space and update time, and by matching against the filter criteria, queries can use this slimmed-down index and run much faster. This is one of the new lightweight “schema where you need it” features we’re bringing to MongoDB in 3.2. The idea for this feature came from discussion with our users who are accustomed to it from relational databases like PostgreSQL, which introduced the feature in version 7.2. With partial indexes, you use your knowledge of your application to adjust the space/time tradeoff you make when creating indexes to fit your needs.

One great example of this is when you have a collection where documents go through an active phase, and then move into an archival state along with a state field update (like “billed” going from “false” to “true”), where they occupy the bulk of a collection’s footprint. Since you’re unlikely to want to access them from that state outside the context of looking up a single record by its primary key or an analytical collection scan, they would just clutter up your index, consume RAM, and make your other operations run slower.

So, here’s an architecture question… is this a storage engine change?

Well, that’s a trick(y) question. From a design standpoint it absolutely should not be. Storage engines are simple (conceptually) and need to be focused on one thing: storing and retrieving data efficiently. Indexing concerns belong to layers above a storage engine.

But in the pre-3.0 days, thiswould have had to be a storage engine change, because we had not yet created a nice separation of concerns. A ton of work had to be done behind the scenes as we built 2.4, 2.6, and 3.0 to make this possible, but now we’re seeing all that hard work pay off. Pluggable storage engines is a big part of the future of MongoDB, and a sane architecture separating these layers turned making partial indexes from a nightmare into some code that’s actually really pleasant to read. So pleasant, in fact, that I’m going to tour you through some of it, by tracing the path of an insert into a collection.

At a high level, an interaction with a MongoDB collection traverses several layers to get down to the storage engine. For this tour, we’ll skip the networking and user layers, and trace the path from the Collection object to the IndexManager to the StorageEngine.

(Note: all links here are to the 3.1.7 branch to make sure they are stable, so this code is already slightly out of date - see master for newer code. Line numbers will have changed, but the general flow will be the same. (For the next year at least!))

The entry point is Collection::insertDocument which hoists out error handling (including document validation, another one of our 3.2 features, but that’s for another post), and passes down to Collection::_insertDocument

This code contains a transition across areas of concern:

A Collection calling down to a RecordStorelink
StatusWith<RecordId> loc = _recordStore->insertRecord(
    txn, docToInsert.objdata(), docToInsert.objsize(), _enforceQuota(enforceQuota));

_recordStore is an instance of our abstraction around storage engines (more detail can be found here), and you can see that we just hand the data for the document over to the _recordStore to handle.

The architecture detail of note is that this code doesn’t deal with indexes, nor is indexing buried below that called to insertRecord. Rather, after doing a little collection housekeeping _insertDocument just calls IndexCatalog::indexRecord.

which in turn calls _indexRecord for every index on the collection.

There, we simply do not index entries that do not match:

Does the index filter match the document?link
const MatchExpression* filter = index->getFilterExpression();
if (filter && !filter->matchesBSON(obj)) {
    return Status::OK();

For each index where the expression matches (or there is no filter), it calls IndexAccessMethod::insert, which generates the keys (0 to many, typically 1) and inserts each one. IndexAccessMethod is a superclass abstracting how indexes are used, since there are many types, such as geospatial, btree, and full text, and each will have their own implementation.

(Those of you following along in the code might notice the abstraction for the index itself is stored as the _newInterface member of the IndexAccessMethod class. At some point that will get a better name!)

So now the storage layer doesn’t know about partial indexes at all.

The reason that this works is that the storage engine layer is required to expose a transactional key/value api, through which all interactions pass, accompanied by a transaction descriptor. The layer above that treats both collections and their indexes as sets of key/value maps. So inserting a document into a collection with 2 indexes is 31 separate table insert calls to storage engine code from higher layers, with atomicity ensured by the storage engine’s transaction system.

  1. or more, in the case of multi-key indexing of arrays

AWS Pop-up Loft Talk

On August 25th I will be delivering a talk at the AWS Pop-Up Loft in NYC. The talk is entitled: “Behind the Scenes with MongoDB: Lessons from the CTO and Cofounder on Deploying MongoDB with AWS.” The AWS lofts combine hack days, talk series, bootcamps, and “ask an architect” opportunities, and mainly target engineers working on startup projects that are built on AWS, although other people do attend the talks.

Since this is a technical crowd, the talk will be highly technical, and since it’s an AWS event, I’ll be emphasising MongoDB’s uses in the AWS environment. Here’s the abstract:

Meet Eliot Horowitz, CTO and Co-Founder of MongoDB, the next gen database built for the cloud. Eliot will share his experience founding and scaling a successful startup, discuss the value of community, and urge you to throw away code as fast as you can.

Then he’ll get into specifics regarding how to deploy MongoDB in an AWS context. To focus the discussion, he will use the example of a MongoDB-backed, multiplayer mobile game hosted on AWS, and follow it from inception as a prototype to a global infrastructure spread across multiple regions and availability zones. You will learn specific methods enabling you to start lean while being prepared to scale massively, such as tag-aware sharding for geo-aware data residence, and using multiple storage engines to optimize for particular use cases.


I’m looking forward to it, and if you’re going to be there, let me know.

Extending the Aggregation Framework

The aggregation framework is one my favorite tools in MongoDB. Its a clean way to take a set of data and run it through a pipeline of steps to modify, analyze, and process data.

At MongoDB World, one of the features we talked about that is coming in MongoDB 3.2 is $lookup. $lookup is an aggregation stage that lets you run a query on a different collection and put the results into a document in your pipeline. This is a pretty powerful feature that we’ll talk more about in a later post.

In order to make writing $lookup a bit cleaner, we’ve done some work to make adding aggregation stages easier. While this is largely for MongoDB Developers, it could also be used by anyone to add a custom stage to do some cool processing on documents inside of MongoDB. Now, given that this requires compiling your own version of mongod, and writing c++ that could corrupt data, this is not for the faint of heart, but it is quite fun :)

For example, if you wanted to write an aggregation stage that injected a new field into every document that came through the pipe, you could do it like this:


Now, you could use $project for this, but my new stage makes all the values into my birthday. So, that’s better.

In the end, not too bad. If anyone has some cool ideas please share!

I Want an Apple Watch

A lot of people I talk to are unsure about the Apple Watch, and the category in general. Me, I’m counting down the days till I get my Apple Watch. In fact, at this point my impatience is so great, the prospect of having to wait another month to get one almost makes me want to go out and buy a Pebble. So, score one for the Apple marketing team, I guess.

Before we get into why, I first want to talk about Apple’s VIP feature. You can mark certain people as VIP, and then you can see emails from just them, limit email notifications to just them, and probably more things I haven’t even tried yet. I have emails from VIPs appear on my phone lock screen. This allows me to quickly glance to see if there is anything I want to read. For better or worse, my habit (addiction) is that I need to look at that fairly often.

So the only things on my lock screen are VIP emails, text messages and my next calendar item. All of those are things I generally want to see very often. Right now, that involves either pulling my phone out of my pocket and looking at it, or keeping it on a table and pressing a button. Oh, and I do like to look at the time on my phone pretty often too.

Those four things all seem to be pretty well served by the basic functionality of the Apple Watch. Time, check. Upcoming appointment, I think check. Text messages, check. VIP emails… well, they haven’t been specific about that, but I’d be really surprised if they didn’t integrate that awesome feature into the watch. For me, being able to accomplish those four things without the interruption of going to the phone seems really appealing. Time will tell if it actually works, but I’m hoping. And being to be able to dismiss a call while keeping my phone in a pocket will also be really nice.

For these reasons, my excitement is currently all about the core feature set, but I’m also intrigued by all the interesting apps that are likely to appear over the next few years. For a lark I’ve done a little daydreaming about that, maybe I’ll write up a few ideas for a later post.

Gmail Jira Decorator

As discussed in other posts, I spend a lot of time in email, and much of the email I get is related to MongoDB’s Jira. I’ve written before about my Jira summarizer, which maintains a single message in your inbox with a summary of recent activity in projects you watch. In my continuing quest to make Jira email easier to deal with, I wrote a tool to make it easier to quickly assess the email notifications about individual issues.

The tool is a chrome extension that operates on my Gmail inbox. Every 30 seconds it scrapes the subjects of emails and does a Jira request to get some basic information. (It offloads most of this work to a separate server I wrote.) It than munges the HTML to decorate the subject of the email with the status, assignee, severity and fix version.

This allows me to quickly see things that are blockers or critical, not focus on things that are assigned to someone already, or know that someone has decided that it should be fixed in the next point release vs. at some point in the future.

Gmail Jira Decorator in action

Interested in the project? Feedback on my email-centered workflow? Let me know!

Dengue Fever

Last week I went to Las Vegas for MongoDB’s sales kickoff. The night before I left, Sunday, I came down with a decently high fever. I got a bit nervous, as it came on strong and fast, but I took some Advil, went to bed, and the next morning felt ok to get on a plane. That whole Monday was pretty good with the help of some more Advil. On Tuesday morning the Advil was giving ground, on Tuesday evening it was in full retreat, and Wednesday at 5am I found a helpful MongoDB employee in the hotel to take me to the ER.

Apparently, while in the Dominican Republic for a family vacation, a.k.a playing with my kids in the water, I was bitten by a mosquito carrying Dengue fever. So I have now officially crossed “Get a tropical disease” off of my bucket list. I’m very excited about that.

I have two take-aways from this experience:

First, I don’t recommend getting Dengue fever. It’s not pleasant. Use a lot of bug spray, really.

Second, if you do have to get Dengue fever, make sure when you get really sick, and are given a fair amount of morphine, that a) you do not write any code, and b) you be administered said morphine in the presence of co-workers, who then have blackmail material for life.

I’m still a bit under the weather, but at least I’m not contagious.

Seriously, though, don’t get Dengue fever.

MongoDB 3.0: Seizing Opportunities

MongoDB 3.0 has landed.

The development cycle for 3.0 has been the most eventful of my entire career. As originally planned, it would have been great, but still incremental in nature. Instead, we wound up acquiring our first company, integrating their next-gen storage engine, and by capitalizing on that unlooked-for opportunity, delivering a release so beyond its original conception that we revved its version number.

Renaming a release in-flight is out of the ordinary, so I wrote about our reasoning when we announced the change. We had originally planned to deliver document-level locking built into MMAPv1, and a storage engine API as an investment in the future, not part of a fully developed integration. That would have been our incremental improvement, in line with our storage engine efforts throughout the 2.x release series. We had already added database-level locking, iterated over many improvements to yielding and scheduling behavior, and refactored a ton of code to decouple components.

At the outset of this development cycle we did several things in parallel. We carved out the code layers to support our storage engine API, started building collection-level locking into MMAPv1, and started designing document-level locking. At the same time, we worked with storage engine builders to put our API through its paces. By the summer of 2014, we had a MMAPv1 prototype for document-level locking, which we demonstrated at MongoDB World. While this was not going to make our use of disks more efficient or solve other MMAPv1 problems, it was nonetheless a huge improvement, and exactly what we were aiming for.

Then the WiredTiger team called us and demonstrated a working integration with MongoDB’s storage engine API. Before long, we realized we had before us an opportunity to shoot the moon. We would have to scale back our plans for MMAPv1 to just collection-level locking, but by doing so, we could completely leapfrog our roadmap and supercharge our team. By delivering MongoDB with WiredTiger, we could offer our users everything we had promised, along with performance MMAPv1 will never match, and features it would take years more to build in. After all, WiredTiger was developed with laser focus on the raw fundamentals of data storage in a modern environment, allowing it to support massive concurrency and other great features like compression.

For all its magnificence, WiredTiger is not yet the default storage engine. We have every confidence in its ability– it is a shipping product in its own right, and has proven its mettle to customers with the most demanding production environments, such as Amazon. We are using it ourselves in production to back MMS. However, the use cases for MongoDB are so broad and varied, we need to gather a wide range of feedback. With that data, we’ll be able to optimize and tune the integration and provide robust guidance on the role of specific metrics in capacity planning, leading to better, more predictive monitoring, and a healthy collection of best practices.

The acquisition of WiredTiger marks an important transition for me as well. Storage engines are incredibly interesting components of a database, but as much as I might like to dig further into them, our goal to make MongoDB the go-to database requires me to be more pragmatic. With a team of world-renowned experts available, that know more about (for example) how to implement MVCC than I ever will, it makes sense to leave storage engines in their capable hands so I can focus on other areas.

MongoDB 3.0 is a great release. I am very proud of the massive team effort that produced it. We will not be resting on our laurels though. There is still a long list of features and improvements our users need to be successful, and with MongoDB 3.0, we expect MongoDB to be used in even more demanding and mission critical projects. Many of those projects will surprise us, and these surprises will create new demands. We are excited to get started on these challenges, further optimizing MongoDB, and extending its capabilities so the pioneers can continue to surprise us.

LiveScribe vs. Phone Camera Update: The NOOP Edition

In my first post on this topic, I said I’d post an update in a week or so. Ok, so that was about 7 weeks ago.

I abandoned the trial of both of these techniques because 2.8.0 is, frankly, more important than my experiments in productivity. I’m going to get back to it, but this is actually an opportunity to say something important about getting derailed from productivity projects by urgent items.

This happens to all of us from time to time when the pressure mounts, and that’s a good thing. The key is to keep your head, focus on the most urgent thing while it’s urgent, and remember to revisit those productivity projects. They are important in the long run, or you wouldn’t want to start on them in the first place. If you find yourself constantly saying “I had to drop that, I got too busy”, it’s time to re-evaluate.