Eliot's Ramblings

Refreshing My Client-Side JavaScript Chops

My overarching goal as CTO of MongoDB is to make building applications easier and easier. Given my day job, it’s actually been a little while since I did any work on the front end.

Since we started MongoDB, client side javascript has really taken off. This pleases me, because our decision to make JavaScript a core part of the MongoDB experience was based on a view that JavaScript would continue to rise in capability and prominence. Back in 2005, I was spending quite a bit of time in JavaScript, but the state of the art has changed dramatically since then.

Wanting to refresh myself, I wrote a proof-of-concept real time dashboard for MongoDB, using JavaScript and Handlebars.

Sure, it’s not much to look at, but I was really excited about how easy it was to do full client rendering with just API calls for data compared to last time I tried. In addition, I finally took ES6 for a test drive, and have to say classes and modules live up to what I had hoped for.

I think next up is react.

Innovate vs. Appropriate

One theme I kept harping on at MongoDB World a few weeks ago was knowing when to innovate around new ideas and when to just reuse what already works well for products that have been successful. This comes up continuously at MongoDB, because having a good understanding of it is a significant competitive advantage. I attribute a large extent of MongoDB’s success to our unbending adherence to this discipline.

When we started MongoDB, we had a clear goal - make data and databases easier for developers and operators, so that data and databases serve their users, not the other way around. To that end, there were two key things we wanted to change. First, we wanted to reduce the impedance mismatch between application code and data, which we addressed by building our database around collections of documents, rather than tables of rows. Second, we wanted to make distributed systems accessible and usable by most organizations, which we have done by making them first-class components and intuitive to use, instead of leaving them to higher layers of the stack to build.

Everything else we wanted to leave the same. We would ask ourselves, “Does this need to be different in a document based distributed database?” For example, indexes in MongoDB have the same semantics as they do in their relational counterparts, because they have worked quite well for decades, and their semantics match those used to query a document database. Query results composed of documents need the same filtering and/or sorting as those composed of rows, and indexes that cover an ordered set of fields (a, b, c) are redundant with indexes that cover a same-ordered subset of those fields (a, b). MongoDB provides a shell that mimics a relational shell, because in our world, the needs to explore databases and collections, do ad hoc queries, create indexes, get stats, and perform analytics is identical to those needs in a relational world. Even mongodump name came about because I had been using mysqldump for a decade, and making data easy to work with and distributed systems accessible would not be in any way furthered by changing that aspect of a database.

Today, we are constantly improving features and adding new ones to MongoDB. Every time we do, the question is: do we need to invent something new to make this feature fit into the document world or a distributed systems world? If the answer is yes, we innovate to try to make suitable for MongoDB. If not, we try to find the best solution out there and apply it to MongoDB (in a patent safe way of course).

Why is this so important? First off, it takes a lot more though to invent something than to copy something. Taking semantics or ideas from successful systems focuses design and architecture work where it is needed most. Focusing innovation also makes it easier for users to learn a new system. If everything is needlessly different, it will be more frustrating for your users, so there better be a good reason. And lastly, every innovation involves risk. You think you are improving something, but if you’re wrong, you’ve wasted time and have to do it all over again.

This is an important concept for all companies to master, both new and old. Like the adage “Is this a core competency” for helping decide if you should build or buy, all product teams adding features should be asking themselves “Does this need to be different in our domain?”

Backend as a Service: Security and Privacy

Almost all modern applications are composed of presentation layers, services executing business logic, and backing stores where the data resides. Developers could be more productive and agile if they could work more directly with the backing data without having to build specific APIs for every access type, but is quite a challenging problem. An emerging class of solution known as Backend as a Service (BaaS) has tried to address this problem over the last few years, but hasn’t become the norm yet.

In an ideal world, it would be great if your web or mobile app could talk directly to a database. In the real world, though, this is never done, for several reasons. Let’s start today with the security and privacy area: fine grained access needs to be built in from the ground up, and also be expressive enough to let any complex application to be built.

Security and privacy challenges are about allowing different users to have access to different data, different documents, and maybe even different fields. One might need to query on computed values of fields without being allowed to see those fields directly. A famous example of this is Yao’s Millionaires Problem in which two millionaires want to determine which one is richer without revelaing their net worth. Solving problems like that requires the kind of fine-grained access control to allow a user to run queries such as “show me all documents where a % 5 == 1” but not be able to see the actual value of a. A broad category of problems, of which Yao’s Millionaires is one, is called secure multi-party communication, and thier solutions all rely entirely on offering that kind of access control. If you are building your own REST api for your web app, building in that logic is trivial. If you are trying to build a generic BaaS, it’s a lot more complex.

There are a few BaaS providers working on this problem. Parse and Firebase are probably the best examples at the moment. They both definitely have pushed this along pretty well, but I think another big step function is needed. Further pushing the security and privacy model to allow apps to be more expressive will allow BaaS to radically improve time to market for many applications.

CSNYC, the 2016 NYC CS Fair, and Bootstrap

Last Thursday (4/7/2016) I spoke at the 2016 NYC CS Fair. Their number one goal is to encourage public high school students who study CS to stick with it, by showcasing all the great opportunities that await them should they pursue a career there. I talked about being a hacker, how to negotiate CS studies in higher education, the difference between CS and software engineering, and the importance of a good mentor. It was a great group of kids, and if even 25% of them go on to become CS students, I think the industry is going to see some positive change.

The CS Fair is put on by CSNYC, with the mission of “Computer Science for all”. One of the reasons I support csnyc is that I believe it actually has a chance of moving the needle for diversity in tech. My personal belief is the only way to make meaningful, lasting change in this area is to get kids excited about the field earlier and earlier. The more children think of CS as their thing, the more they can resist the fear, negative pressure, or bias they contend with, pushing them them to drop it.

In order to get kids excited about computer science, a few things need to happen, but the most interesting one to me at the moment is bootstrap. Bootstrap is a program that integrates a math and computer science curriculum that is meant to drop into an existing math class and be a better way to teach concepts like functions and variables, while also learning computer science. There are quite a few schools starting to use bootstrap, and I’m trying to help drive adoption. Others are doing way better and more than me, but I’ll keep trying anyway. If you’re involved in CS education, leave me a comment. How can I help?

My Fireside Chat at Data Driven NYC

A couple of weeks ago I did a great fireside chat with Matt Turck at Data Driven NYC.

I’ve always found that the fireside chat is a format with a lot of potential to be boring, but Matt is a great interviewer, and interacting with him on stage definitely adds to the event. For example, when I was talking about the headline features of our 3.2 release, I omitted a significant pair – the BI connector and Compass – and he reminded me to talk about them. It’s things like that which enhance the experience for the audience. At their best, a fireside chat interviewer takes care of the setup, makes sure you’re staying on track, and grabs opportunities to dig deeper.

One thing that Matt brought up (at around 12:40 in the video) was how, after an explosion of alternatives to relational databases, it’s starting to feel like things are converging again. Now, when you do one of these, you get a list of topics to prepare for in advance, but that’s a question that emerged organically from our conversation. I appreciated the opportunity to address that by citing a core tenet of MongoDB (at 14:45):

“We really want you to be able to configure yourself into different use cases, rather than having to use different kinds of products.”

All the other speakers were very interesting. I was particularly into the Dr. Kieran Snyder’s Textio presentation – that’s a cool product, with cool tech, and a lot of potential. Think about it: there’s an enormous amount of text in the world. You will never be able to read it all. Algorithms that understand those vast swaths of things you will never be able to personally consume have the potential to revolutionize knowledge. Or consider influential articles that wind up with thousands – or maybe some day millions – of comments… stuff like this can sift signal from noise. There’s a lot of promise there. Kieran and I had a really interesting chat about this stuff after the event, and I’m looking forward to see where they take Textio.

All in all, a really enjoyable evening, I would definitely recommend Data Driven NYC to anyone in New York with an interest in tech at all. They happen monthly, so it should be relatively easy to catch one.

After Parse: Where Should MBaaS Go?

Last week I talked about Parse shutting down and how unfortunate that was, but also how outstanding a job they have done providing a transition path for their current users. MongoDB also published a very detailed post on how to migrate a Parse app onto MongoDB Cloud Manager and AWS Elastic Beanstalk. Since that day, the amount of activity on the open source Parse Server has been phenomenal, and many have suggested, as did one commenter on my last post, that this means it’s time for MongoDB and Parse to work even better together.

All this discussion I’ve had about Parse has got me thinking about the nature of the Mobile Backend-as-a-Service space and MongoDB’s role there. I’m also interested in hearing directly from customers. If your MBaaS-backed application deals with a decent amount of data and load, leave a comment or shoot me an email (eliot@mongodb.com), I want to talk about making MongoDB and the ecosystem around it even better for the MBaaS use case.

Farewell Parse

Updated 2/3/2015 to reflect the publication of MongoDB’s migration guide.

I was sad to hear about Parse shutting down last week.

Parse made a big push towards serverless architectures, which I think is a great goal. Serverless architectures are the ultimate in letting developers focus on making great products for their users and letting other people make the plumbing work.

In the early days of web and mobile application development, backends were a thing that every team had to write themselves from scratch. Over time, common patterns were encapsulated into application frameworks. Parse was a glimpse farther into the future, providing app developers an abstraction for an entire backend.

I’m a fan of this approach. MongoDB’s number one focus has always been on making developers more productive, so they don’t have to work so hard to use their databases. As more and more applications are built using a set of services, the next step will be around making those services composable, their data composable, and making those pieces really easy to build, deploy, and maintain.

I’m also really happy to see the way that Parse is handling their shutdown. They are giving users one year to move off, providing tools to move off, and provided an open-source version of the backend so the migration can be as painless as possible.

MongoDB has published a guide on how to use cloud manager and elastic beanstalk to help those migrating off of Parse. We’ll also offer a consulting package for those who want a bit more customized help.

Mayor De Blasio Announces Comprehensive NYC K-12 CS Education Program

For the past 6 months, I’ve been participating in the NYC Tech Talent Pipeline Advisory Board, a partnership between New York City and technology companies in New York. From the press release announcing this board’s formation:

Mayor Bill de Blasio today announced 14 initial industry commitments to support the delivery of technology education, training, and job opportunities to thousands of New Yorkers as part of the Administration’s NYC Tech Talent Pipeline initiative. Announced by the Mayor in May 2014, the NYC Tech Talent Pipeline is a first-of-its-kind, $10 million public-private partnership designed to support the growth of the City’s tech ecosystem and prepare New Yorkers for 21st century jobs. The commitments were announced at today’s inaugural convening of the NYC Tech Talent Pipeline Advisory Board, during which Mayor de Blasio and 25 executives representing the City’s leading companies came together to help define employer needs, develop technology training and education solutions, and deliver quality jobs for New Yorkers and quality talent for New York’s businesses.

The board has been working since it was convened to devise, fund, and institute programs that train New Yorkers in the technology skills needed to drive innovation by businesses which operate here in NYC. And yesterday Mayor Bill de Blasio had a big announcement to make: within 10 years, every public school in NYC will offer computer science education.

I’m not going to belabor the obvious point that educating New Yorkers in technology skills is a win-win scenario. What is so great about this unprecedented commitment to computer science education is that it brings these benefits to all young children in New York.

I started programming at a very young age. I went to computer camp when I was 7. I took computers apart and tried to make them better. I was lucky enough to be exposed to computer science early, and to have a father who encouraged and helped me when I was young. That early experience made a huge difference in my life and played a large role in where I am today. But I was a rare exception, and that’s not how it should be.

Learning computer science requires access to a computer, which back then was not ubiquitous, but today, everyone has one in their pocket. It’s high time to adapt to this new reality and to stop thinking of computer science as an elective suitable for a small slice of the population. There are many reasons to expect students from all backgrounds to take to computer science with gusto. Software provides immediate gratification, which is great for fostering excitement in learning. It requires very little capital to write software, so anyone with dedication should be able to build something great. But it’s hard to do well, so developing understanding and excitement early makes a big difference. We need to give kids a chance to love CS before they hear or assume that they’re not the right type of person to be a software engineer.

I fully expect this program to lead to huge, positive changes in the lives of the children of NYC, and to bring to the companies that need software engineers a large, vital, diverse pool of them.

Document Validation and What Dynamic Schema Means

When we first published a mongodb.org homepage, we sloppily described MongoDB as “schema free”. That description over-emphasizes the baggage MongoDB left behind, at the expense of true clarity. At the time, however, document databases were brand new, and it was simple to describe them in terms of what they were not (witness the prevalence of the terms “non-relational” and “nosql”). This over-simplification was much more than an oversight. As you can see by reviewing this old blog post, it reflects an immaturity in our thinking. By 2011 we had come to see that calling MongoDB “schema free” reflected an old way of thinking about what “schemas” actually are, so we changed the homepage to say “dynamic schema”.

To appreciate the context for this evolution, recall that when we launched MongoDB, “schema” meant the tables your data was stored in, and the rules that governed the relationship between those tables. Relational schemas have a fixed structure, with strongly typed fields, so complex entities can only be modeled as collections of tables, with their relationships to each other also strongly defined. So schemas are fixed, and altering them is a high cost operation. It seemed correct to say that MongoDB was free of schema.

The DDL used to define a relational schema affords a few additional usability benefits as a side effect of how it requires data to conform to the relational model. Two key benefits: schemas provide documentation of what data is in a table (if you’ve seen one row, you’ve seen ‘em all!), and validation of the fields, by their very definition.

At this point it seems needlessly reductionist to call MongoDB schema-free, since of course, MongoDB and the apps built on it have always had schema, they just embodied them in their queries and the indexes they build to support them, rather than in a table definition. Furthermore, we did plan to offer our users the documentation and validation aspects of schema, but wanted to focus on developing the document model first. When MongoDB was created, we saw more value in doing away with the restrictive elements of tables than keeping them for their side effects, especially when they could be delivered as features, deliberately designed to suit the needs of developers and operators.

In MongoDB 3.2 we are following through on that plan, and one of those features is document validation. To use it, you attach a validation document to a collection. Validation documents use the mongodb query language to add constraints to the documents inserted to that collection. An example validator might be:

{ age : { $gte : 0, $lte : 150 } }

If someone tried to insert a document with a null or missing age, the document would be rejected. If you tried to insert 32 as a string or -5 it would also be rejected. This allows the database to enforce some simple constraints about the content of the documents, similar to PostgreSQL’s check constraints.

One common use case for MongoDB is aggregating data from different sources. With document validation, you’ll be able to ensure that all of the sources have some common fields (like ‘email’) so they can be linked.

You can attach a validation document to a collection at creation time, by including it as a validator field in the db.createCollection command, or by using the collMod database command:

1
2
3
4
db.runCommand( {
   collMod: "contacts",
   validator: { $or: [ { phone: { $exists: true } }, { email: { $exists: true } } ] }
} )

There are a number of options that can be used to tune the behavior of validation, such as warn only mode, and how to handle updates that don’t pass validation, so have a look at the dev-series documentation for the complete picture.

Along with the rest of the 3.2 “schema when you need it” features, document validation gives MongoDB a new, powerful way to keep data clean. These are definitely not the final set of tools we will provide, but is rather an important step in how MongoDB handles schema.

Under the Hood With Partial Indexes

Partial indexes allow you to create an index that only includes documents in a collection that conform to a filter expression. These indexes can be much smaller, cutting down index overhead in storage space and update time, and by matching against the filter criteria, queries can use this slimmed-down index and run much faster. This is one of the new lightweight “schema where you need it” features we’re bringing to MongoDB in 3.2. The idea for this feature came from discussion with our users who are accustomed to it from relational databases like PostgreSQL, which introduced the feature in version 7.2. With partial indexes, you use your knowledge of your application to adjust the space/time tradeoff you make when creating indexes to fit your needs.

One great example of this is when you have a collection where documents go through an active phase, and then move into an archival state along with a state field update (like “billed” going from “false” to “true”), where they occupy the bulk of a collection’s footprint. Since you’re unlikely to want to access them from that state outside the context of looking up a single record by its primary key or an analytical collection scan, they would just clutter up your index, consume RAM, and make your other operations run slower.

So, here’s an architecture question… is this a storage engine change?

Well, that’s a trick(y) question. From a design standpoint it absolutely should not be. Storage engines are simple (conceptually) and need to be focused on one thing: storing and retrieving data efficiently. Indexing concerns belong to layers above a storage engine.

But in the pre-3.0 days, thiswould have had to be a storage engine change, because we had not yet created a nice separation of concerns. A ton of work had to be done behind the scenes as we built 2.4, 2.6, and 3.0 to make this possible, but now we’re seeing all that hard work pay off. Pluggable storage engines is a big part of the future of MongoDB, and a sane architecture separating these layers turned making partial indexes from a nightmare into some code that’s actually really pleasant to read. So pleasant, in fact, that I’m going to tour you through some of it, by tracing the path of an insert into a collection.

At a high level, an interaction with a MongoDB collection traverses several layers to get down to the storage engine. For this tour, we’ll skip the networking and user layers, and trace the path from the Collection object to the IndexManager to the StorageEngine.

(Note: all links here are to the 3.1.7 branch to make sure they are stable, so this code is already slightly out of date - see master for newer code. Line numbers will have changed, but the general flow will be the same. (For the next year at least!))

The entry point is Collection::insertDocument which hoists out error handling (including document validation, another one of our 3.2 features, but that’s for another post), and passes down to Collection::_insertDocument

This code contains a transition across areas of concern:

A Collection calling down to a RecordStorelink
1
2
StatusWith<RecordId> loc = _recordStore->insertRecord(
    txn, docToInsert.objdata(), docToInsert.objsize(), _enforceQuota(enforceQuota));

_recordStore is an instance of our abstraction around storage engines (more detail can be found here), and you can see that we just hand the data for the document over to the _recordStore to handle.

The architecture detail of note is that this code doesn’t deal with indexes, nor is indexing buried below that called to insertRecord. Rather, after doing a little collection housekeeping _insertDocument just calls IndexCatalog::indexRecord.

which in turn calls _indexRecord for every index on the collection.

There, we simply do not index entries that do not match:

Does the index filter match the document?link
1
2
3
4
const MatchExpression* filter = index->getFilterExpression();
if (filter && !filter->matchesBSON(obj)) {
    return Status::OK();
}

For each index where the expression matches (or there is no filter), it calls IndexAccessMethod::insert, which generates the keys (0 to many, typically 1) and inserts each one. IndexAccessMethod is a superclass abstracting how indexes are used, since there are many types, such as geospatial, btree, and full text, and each will have their own implementation.

(Those of you following along in the code might notice the abstraction for the index itself is stored as the _newInterface member of the IndexAccessMethod class. At some point that will get a better name!)

So now the storage layer doesn’t know about partial indexes at all.

The reason that this works is that the storage engine layer is required to expose a transactional key/value api, through which all interactions pass, accompanied by a transaction descriptor. The layer above that treats both collections and their indexes as sets of key/value maps. So inserting a document into a collection with 2 indexes is 31 separate table insert calls to storage engine code from higher layers, with atomicity ensured by the storage engine’s transaction system.


  1. or more, in the case of multi-key indexing of arrays