These are exciting times for InfluxDB, the world’s most popular time series database, which has been the fastest growing category of databases for the past two years, according to DB-Engines.com. But when Paul Dix and his partner founded it a decade ago, the company behind the time series database and the product itself looked very different. In fact, InfluxDB went through several transformations to get to where it is today, reflecting the evolution of the time series database category. And more changes are on the horizon.
Dix and Todd Persen co-founded Errplane, the predecessor of InfluxDataback in June 2012 with the idea of building a SaaS metrics and monitoring platform, à la data dog or New relic. The company had graduated Y combiner‘s Winter 2013 course, attracted some seed funding and had about 20 paying clients.
Finding the right underlying technology would be critical to Errplane’s success. Dix, who had worked with large-scale time-series data at a fintech company a few years earlier, evaluated the technology available at the time.
Of course, shrink-wrapped time-series databases didn’t exist at the time, so he essentially built one from scratch using open-source components. He used Apache Cassandra as the persistence layer, used Redis for the fast real-time indexing layer, wired it all up to Scala and exposed it as a web service. This was version 1 of the Errplane backend.
By the end of 2013, Dix was well into version 2 running locally to lift Errplane out of the morass of SaaS metrics and monitoring providers that suddenly emerged. He again looked for answers in the vast toolbox known as open source.
“I picked up LevelDBwhich was a storage engine that was written Google originally by Jeff Dean and Sanjay Ghemawat, the two greatest programmers Google has ever seen,” says Dix. He wrote V2’s functionality in a new language called Go and exposed the functionality as a REST API.
As the product worked, it became increasingly apparent that the company was struggling. “I said, ‘You know what, Errplane, the app isn’t doing well. That won’t take off. We won’t reach escape velocity in the process,” says Dix. “But I think there’s something in the infrastructure here.”
It turned out that Errplane’s customers were less interested in the server metrics and monitoring aspect of the product and more interested in the ability to handle large amounts of time-series data. His attention was caught and Dix attended a Monitorama conference in Berlin that fall, where his suspicions were confirmed.
“What I saw from a back-end technology perspective was that everyone was trying to solve the same problem,” says Dix Datanami. “People in the big companies were trying to build their own stack. They were looking for a solution to store, query, and process time series data at scale. It’s been the same with vendors, we’re trying to build higher-level applications, which is what we’ve been trying to do ourselves.”
“Degenerate” Use Cases
Time series data in and of itself is nothing special. Any piece of data can be part of a time series just because it has a timestamp. But the types of applications that use time-series data have specific attributes, and these attributes drive the demand for specialized databases to manage that time-series data.
From Dix’s point of view, applications that make extensive use of time-series data fall into a category that mixes elements of OLAP and OLTP workloads, but neither fully fits into either.
“The real-time aspect makes it look a bit like a transactional workload, but the fact that it’s historical data that’s being processed at scale and you’re analyzing a lot makes it look like an OLAP workload,” he says.
You could use a transactional database for time-series data, and people do that a lot, Dix says. But scaling quickly becomes an issue. “You’re adding millions, if not billions, of new records every day before you get really big scales,” he says.
OLAP systems like MPP columnar databases and Hadoop-style systems are designed to handle the large amounts of time-series data, says Dix. The downside, however, is that OLAP systems are not designed to provide continuous, real-time analysis for up-to-date data.
“They had a period of time where you take the data and convert it into a format that’s easier to query at scale, and then run the report once an hour or once a day,” says Dix.
Data removal is another aspect of time series data that requires special treatment. The value of time series data typically decreases over time, and to keep costs down, users typically delete older time series.
“Well, in a transactional database, it’s not designed to essentially delete every single record you’ve ever entered into the database,” says Dix. “Most transactional databases are actually designed to store data forever. Like, you never want to lose it. So this idea of automatically removing data when it becomes obsolete is not something these databases were designed for.”
Because of these challenges, all major server metrics and monitoring companies eventually build their own proprietary time-series database, says Dix. “They don’t even use commercial databases anymore because of these different things that make time series what I call a ‘degenerate’ use case.”
Timely new beginning
By 2015, Dix and his partner were ready to abandon Errplane and start selling the backend as a time series database. The good news was that they were already ahead of the game as they already had a database to sell.
“When we released InfluxDB, there was immediate interest,” says Dix. “It was obvious that we struck something straight away. Developers have this problem and they needed this technology to solve it. How do you store and query time series data? Not just on a scale, but how do you do that easily, even on a smaller scale?”
InfluxDB didn’t take much work to get going. The biggest difference between this first release and the Errplane V2 backend was the need for a query language, which Dix likened to “syntactic sugar” sprinkled over it. This came in the form of a SQL-like language called InfluxQL that allowed users to write queries (instead of just using the REST API).
But the work wasn’t done. Raising additional funds, InfluxData began developing a new version of the database and surrounding tools (ETL, visualization, alerting) that would eventually become part of the TICK platform.
Dix also set about rebuilding the database, and InfluxDB 1.0 debuted in September 2016.
“For this version of InfluxDB, we built our own storage engine from the ground up,” explains Dix. “It was heavily influenced by LevelDB and this type of design. But we called it the time series merge tree. LevelDB is called a tiered merge tree. So we had our own storage engine for that, but everything else was Go and that was the open source piece.”
Open to the cloud
InfluxData shares the source code for its InfluxDB database under the MIT license, so anyone can get and run the code. The San Francisco-based company also continued to develop a cloud offering AWS, giving customers a fully managed experience for time-series data. It has also developed a closed-source version that is distributed for high availability and scale-out clustering.
Meanwhile, InfluxData’s platform claims grew. Introduced in 2018, TICK consisted of four parts: a data collector called Telegraf; the InfuxDB database itself; the visualization tool Chronograph; and the condenser of the processing machine.
“We wanted to find a way to connect the four different components together in a more meaningful way where they have a single unified language,” says Dix, who was addressed with a new language called Flux. “We also wanted to move to a cloud-first delivery model.”
While the majority of InfluxData’s customers were local at the time, Dix told the development team that he wanted to be able to push out updates for every part of the platform every day of the year. The transition was complete in 2019 and today the cloud represents the fastest growing component of InfluxData’s business.
InfluxData was presented this week several new features designed to help customers process data from the Internet of Things (IoT), including better replication from the edge to a central instance of InfluxDB Cloud; Support for MQTT in Telegraf; and better management of data payloads via Flux.
Another rewrite of the underlying core technology for InfluxDB is also imminent in the near future.
“The big thing that I’m personally focused on and looking forward to is that we’re basically building a new core of storage technology that’s going to replace everything within a cloud environment this year,” says Dix. “In this case it’s written in Rust, while almost everything else is written in Go. Now Apache Arrow is used extensively.”
While InfluxDB has a massive lead in the time series database category according to DB-Engines.com, the category as a whole is fairly young and still growing. For Dix, this means that the possibilities for new use cases are diverse and growing.
“For me, the key takeaway from creating InfluxDB was that time series is a useful abstraction for solving problems in a number of different areas,” he says. “Server monitoring is one thing, user analytics is another. Financial market data, sensor data, business intelligence – the list goes on.”
When you add things like IoT and machine learning, the potential for time series analysis becomes even greater. How big will it end up being? Only time can tell.
How time-series data drives summertime fun
IT researchers tackle time series anomalies with Generative Adversarial Networks
Time series leader InfluxData collects more money