As data sources and volumes grow, and data-driven alignment is increasingly seen as a competitive imperative, the war between platform vendors to provide the primary repository for our data is intense. The war has several fronts, one of which is analytics. And within that framework, the main adversaries are the data warehouse and data lake camps.
The data warehouse side is strong as it is a combination of proven incumbent vendors such as teradata and Vertical (now part of micro focus), all three major cloud providers and darling of the industry snowflake. On the data lake side, independent providers such as Cloudera and the aforementioned data bricks, are perhaps the most emblematic competitors. A few months ago, Databricks announced that it had achieved record results on performance benchmarks, making it the winner in the battle, defeating the data warehouse model and the vendors who champion it. While this is no longer breaking news, I thought an analysis of the announcement was still in order.
Don’t just stand still
While proponents of the Data Lake (and “Lakehouse,” as Databricks likes to call its own platform) may criticize the warehouse as outdated, the latter is tried and tested and enjoys a degree of dominance. That puts the onus on the data lake side to show it can handle the same workloads as the warehouse with competitive performance.
Databricks now believes it has that proof. Last November, the company announced the results of a series of benchmarks created by and based on standards from the Transaction Processing Performance Advice (TPC). The tests were conducted against the relatively new and recently improved Databricks SQL Platform, the company’s foundation for the aforementioned Lakehouse architecture. Specifically, the benchmark configuration used Databricks SQL 8.3, which includes Databricks’ proprietary photon engine, a vector-processing query processor-optimized replacement for Spark SQL, written in C++.
Databrick’s SQL in particular, and the Lakehouse architecture in general, uses data lake technology at its core, combined with enhancements – like ACID compliance, writebacks, and vector processing – that help provide capability parity with data warehouse platforms. Databricks SQL continues to use clusters of machines running the spark-based Databricks runtime, but instead optimizes the nodes on these clusters for the types of queries and user request patterns that are common in data warehouse and business intelligence (BI) use cases.
Databricks uses the TPC DS stable from testing, long an industry standard for benchmarking data warehouse systems. The benchmarks were run on a very powerful 256-node, 2112-core Databricks SQL cluster whose cloud infrastructure was priced by Databricks at over $5 million. Incidentally, “DS” stands for “Decision Support”, a precursor to the term Business Intelligence, which is quite appropriate given the design and mission of Databricks SQL.
Databricks characterizes the Benchmark Results by setting a new world record for TPC-DS performance running on any platform, be it a warehouse, lake or lake house. The previous record holder for performance at scale was Databricks’ TPC-DS benchmark runs Alibaba. The Chinese internet and e-commerce giant had achieved a result of 14,861,137 QphDS @ 100 TB (decision support queries per hour, based on queries with 100 TB of data) using its own bespoke – and also quite powerful – data warehouse System.
Databricks, meanwhile, announced that it achieved a result of 32,941,245 QphDS @ 100TB, more than twice the performance of Alibaba. This was done on a system that the company said was 10% less expensive than Alibaba’s homebrew platform. And while the benchmarks were run by Databricks themselves, the results were checked by the TPC.
According to Databricks, it set an all-time record. The company further believes that any blockages that were preventing customers from using a Lakehouse platform instead of a storage platform should now be removed. That’s significant because Databricks itself, in its endorsement of the Lakehouse approach, previously acknowledged that warehouse platforms performed better for certain workloads, and the company understood that this performance gap prevented customers from switching to the Lakehouse side.
Confrontation with Snowflake
Databricks clearly felt that these benchmark results would help data warehouse darling Snowflake compete. Speaking of which, apart from the TPC benchmark results themselves, Databricks advertises the work of the Barcelona Supercomputing Center (BSC) Comparison of Databricks SQL and Snowflake. Databricks says this work, based on TPC-DS benchmarks but not TPC verified, shows Databricks SQL is 2.7 times faster (see image below, from a Databricks blog entry on the subject). BSC also reports that a Databricks SQL cluster is 12x better than a similarly sized Snowflake setup in terms of price performance.
There’s a lot of spin here, but what the TPC and BSC results show is that the Lakehouse architecture can handle these BI workloads, and it does it respectably. This is significant because most Spark-based systems, including Databricks, were previously best suited for data engineering, machine learning, and intermittent querying in analytics. It was more difficult to get such a system to serve running analytics workloads or ad hoc analytics with multiple queries building on top of each other.
If the question is whether this means that the lake house is now a full-fledged one replacement for a warehouse, then the answer is unclear. The main reason for this lack of clarity has to do with the customer criteria and customer sentiment as to why there was a lake or lake house not an adequate replacement before. Yes, for some the reason to stick with a department store has been performance, and this series of TPC benchmarks can address those concerns and influence customers who have championed them.
A matter of formality
For other customers, the criteria are more about the paradigm—including data modeling and, to some extent, data governance—than performance itself. The ethos of a lake is to store data in the form of named files in open formats so that the data is compatible with and can be used by: a palette of database and analysis engines. And because the data is stored as files on disk or in cloud storage, there is less need (and willingness) to model it. This makes the data less formal, often less scrutinized, and less verified. Control is delegated more, making data insertion easier. (These characteristics of a data lake also apply to lakehouse scenarios.)
A data warehouse is more formal, controlled, and typically enforces a more explicit and comprehensive data model. It’s less agile, which frustrates users, but it also has more filters, which can correlate with overall higher data quality and user confidence.
Great benchmarks for big data
A system with $5 million in infrastructure and massive data volumes may be one of the largest and able to rival Alibaba’s benchmark, but it’s not typical of what most customers need or can afford . It shows that Databricks SQL can handle massive workloads, and for some customers that will be important in itself.
The best way to understand the meaning of the Databricks benchmark results is to word the question correctly. Databricks would frame it in terms of which model tops the list. But perhaps the question is more which model appeals more to certain customers in certain applications and whether the performance of both models is sufficient.
Ultimately, most businesses can probably benefit from a data warehouse and a data lake (house). The warehouse can be a repository of highly vetted, carefully customized and modeled data to drive reports, operational dashboards and ad hoc queries in the realm of “known unknowns”. Lakes and lakehouses, on the other hand, can hold more data with a shorter onboarding process and less modeling-on-write, and can be used for exploratory analysis and on-the-fly visualization.
The victory, not the winner
The TPC results make it clear that both models perform well, deliver excellent results, interface when needed, work with the same BI tools, and are cost-effective, cloud-first, elastic, and agile. But while the warehouse/lakehouse question doesn’t have to be an either/or decision, vendors have an advantage in seeing it this way: their competition for the same customers and the same workloads leads to continuous innovation that benefits the customer.
Whether TPC benchmarks are the ultimate arbiter of what is best depends on the buyer’s criteria. But Databricks’ TPC-DS results are still impressive. They’re a milestone for the industry and a mandatory feature to ensure vendors are taking a continuous improvement approach, whether they’re promoting the lake, lakehouse or warehouse.