View profile

Why You (Probably) Shouldn't Use the MongoDB Aggregation Framework

As a young undergraduate studying systems programming, I often wondered why threads were necessary. A
Why You (Probably) Shouldn't Use the MongoDB Aggregation Framework
By Mastering JS Weekly • Issue #37 • View online
As a young undergraduate studying systems programming, I often wondered why threads were necessary. An individual process could run much faster if the OS didn’t have to context switch all the time, so why are threads “better” for performance?
Turns out, that thought isn’t entirely without merit. At my first job out of college, I worked on algorithmic trading software for a company where we joked that “milliseconds are so 2008” because we were optimizing at the nanosecond level. It was standard practice to run our trading software on dedicated CPUs with nice -19 to minimize context switching and get every last drop of performance we could get out of our hardware.
The lesson is that multi-threading isn’t necessarily better for performance. If you’re careful, minimizing OS-level multi-threading can make your code much faster. But multi-threading is more about minimizing performance variance in a potentially crowded system, rather than maximizing performance in isolation. This tradeoff is similar to what happens with MongoDB queries given slow trains.

Slow Trains and Speed vs Variance
Here’s an example that demonstrates why variance often matters more than speed. Suppose you have a computer that runs one program at a time. Alice is studying astrophysics and running a complex planetary motion simulation that takes hours to run. Bob is learning to code and wants to run his first “Hello, World” script.
In isolation, Bob’s program will be fast. But if Alice is running her simulation at the same time, Bob may need to wait for hours for his code to run! In a multi-threaded OS, Bob’s code will still run about as quickly, regardless of whether Alice is running her simulation.
MongoDB currently has a limitation: one operation per socket at a time. By default, a Mongoose connection has 5 sockets open to the MongoDB server, so only 5 operations can make progress at the same time. You can configure this using the `poolSize` option.
For example, `poolSize = 1` means only one operation at a time. The `findOne()` in the below code will take 1 second, because it is blocked behind the “slow train” of the `find()` that sleeps for 1 second.
A simple `findOne()` on a 1-document collection can take over 1 second!
A simple `findOne()` on a 1-document collection can take over 1 second!
What This Means for Aggregations
MongoDB’s aggregation framework is very powerful and lets you do a lot of amazing things. But one `aggregate()` call counts as one operation. And, because the aggregation framework lets you do so much, it also lets you shoot yourself in the foot with slow trains.
Lessons in MongoDB from Spiderman
Lessons in MongoDB from Spiderman
With a single query, you’re fairly limited in how bad a single query can be. Worst comes to worst, you execute an unindexed query on a massive collection, and MongoDB has to scan through every document in the collection to answer your query.
Once you introduce the $lookup aggregation stage, things can get a whole lot worse. You can get O(n^2) or O(n^3) collection scans, leading to performance issues with relatively small collections. At Mastering JS, we consider a collection with less than 10k documents “small” and a collection with more than 100k documents “large”. Below is an example of a slow aggregation on a small collection, because `$lookup` performance degrades as O(n^2) in the absence of indexes.
Slow because MongoDB has to `$lookup` for every document
Slow because MongoDB has to `$lookup` for every document
So an aggregation that uses `$lookup` without indexes can lead to extremely slow trains, which can lead to performance degradation on other queries.
When To Use Aggregations
If you’re building a classic web app or REST API, you should prefer to use multiple queries rather than aggregations. That’s because many fast queries provides better concurrency and less performance variance under load than one potentially heavy aggregation. In other words, using queries rather than aggregations leads to fewer slow trains and more consistent performance under unpredictable workloads.
However, aggregations can be useful in cases where you have control over the workload. In a REST API, you don’t have much control over the workload, it depends on what your users want. But, if you’re executing ad-hoc analytical queries, or even loading data from MongoDB when compiling a Jamstack app, aggregations can give you better performance.
That’s because, in those cases, you have more control over what queries you execute when. You don’t have to worry about a potentially crowded connection pool.
Get Mastering Mongoose Today
As a Mastering JS subscriber, you can buy a discounted copy of Mastering Mongoose here. If leave a review on GoodReads after purchasing, we will give you an extra $5 back. Thanks for your continued support!
Most Recent Tutorials
Other Interesting Reads
Coding efficient MongoDB joins. The aggregation framework allows joins… | by Guy Harrison | dbKoda | Medium
Did you enjoy this issue?
Mastering JS Weekly

A weekly summary of our tutorials

If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue