The subject line of this issue is a bit of prognostication, of course, but it’s nonetheless remarkable how fast the next big thing comes along. Apache Spark likely has a long road ahead of it as the data-processing engine of choice for a wide variety of jobs, ranging from stream processing to machine learning. However, researchers are working fast to overcome some of Spark’s limitations.
The latest effort, called Flare
, comes from researchers at Purdue and Stanford. They have developed a system they claim can outperform Spark on SQL queries, while still maintaining solid performance across other processing tasks. Importantly, they have also designed Flare to take advantage of the prevalence of resource-rich cloud computing instances, which can increase efficiency by letting users scale up rather than scaling out.
As they explain:
“Today, machines with dozens of cores and memory in
the TB range are readily available, both for rent and for
purchase. At the time of writing, Amazon EC2 instances
offer up to 2 TB main memory, with 64 cores and 128
hardware threads. Built-to-order machines at Dell can be
configured with up to 12 TB, 96 cores and 192 hardware
threads. NVIDIA advertises their latest 8-GPU system as
a “supercomputer in a box,” with compute power equal to
hundreds of conventional servers. With such powerful
machines becoming increasingly commonplace, large clusters
are less and less frequently needed. Many times, “big data” is
not that big, and often computation is the bottleneck.
As such, a small cluster or even a single large machine is
sufficient, if it is used to its full potential.”
I don’t know about the work at RISELab, but the researchers working on Flare have retained support for Spark’s APIs, which are user-friendly and powerful, and also widely used at this point.
Basically, what we’re seeing right now across the big data space is what happens when innovation in cloud computing mixes with open source development. The result is that platforms like Hadoop and Spark, designed for yesterday’s use cases using yesterday’s infrastructure, end up showing their age faster than some folks might expect.
This isn’t necessarily a big deal for the current big data software market. Enterprise adoption is still the bellwether by which companies trying to commercialize these technologies are judged, and enterprises are typically slower to adopt new technologies and have lots of legacy systems that need need to integrate. They also tend to prefer proven, commercially supported technologies to research projects. For Cloudera or Hortonworks or MapR or Databricks, there’s still plenty of work to do and plenty of opportunity to bring those large companies up to speed at their own pace.
Spark, and even Hadoop, will have a long and fruitful life inside large companies as the focal point of data efforts ranging from analytics to artificial intelligence.
What I always wonder, though, is what the pace of innovation means for the startups of today and tomorrow—borne of the cloud and open source eras. They’re all aiming to become the next large enterprise, and they don’t have technical debt tied to legacy databases or even big data systems. Some newer large companies with open source roots, like Facebook, continue to develop their own technologies
when the legacy stack can’t keep up.
It seems like we’re poised to start hearing about the next big thing in big data rather soon. I’m excited to find out what that will be, and also curious to see how long it lasts before its replacement comes along.