đ¤ Data Explained: OLAP vs. OLTP
Is the data warehouse dead? Or is the only thing dead the horse being beaten? đ´
đ Whatâs a Warehouse?
The universe of data jargon is pretty largeâ today, we have data warehouses, lakes, and lakehouses. While I simply want to hike to Delta Lake, itâs now a storage technology, too.
Data warehouses are the OG of data jargonâ coined by the IBM Systems Journal in 1988, a data warehouse is described as âa system used for the analysis and reporting of structured and semi-structured data.â
Weâll go even simpler than thatâ itâs just a relational database. Good, at least that part was easy.
đŚ Storage Wars
The biggest choice in data might just be database selection. If youâre starting a data team, whether to use a cloud-native data warehouse (BigQuery, Snowflake) or a cloud-hosted traditional database (Postgres, MySQL) will occupy a large portion of your time.
These are more commonly referred to as OLAP (Online Analytical Processing) and OLTP (Online Transactional Processing) databases, respectively. Weâll talk about why you might choose one or the other and the future of data warehousing.
đ Why OLTP?
OLTP systems are engineered to handle transactional data originating from multiple users. This usually takes the form of a row-oriented database. Many traditional database systems are OLTP: Postgres, MySQL, etc.Â
If you currently work at a company and you hear engineers talk about a âprodâ database (or you work on the prod database đ ), itâs likely an OLTP store. Why? Because OLTP is great for inserting and retrieving transactional data (one row at a time).
To insert a transaction, you need to write an entire row. This is perfect, since OLTP systems operate by row. Data is written and retrieved one row at a time. Historically, databases were used to power production systems, so they were designed to effectively write and retrieve one thing at a time, really well. Cool, right?
âď¸ Why OLAP?
As data became big, this started to present a problem. What happens when you need to query 1 million rows? 10 million rows? Since we just talked about how row oriented databases work, letâs assume you want to sum revenue by month over a table with 10 million rows. Since OLTP sources read data one row at a time, youâll have to process every column of every row in the table. Not good. đŹ
Furthermore, traditional OLTP databases require configuration. Youâll have to understand horizontal vs. vertical scaling, provisioning, and a whole bunch of DevOps stuff. Thatâs not a bad thing per se, but itâs not data. Itâs DevOps. So now, as a data engineer, youâll have to go through a DevOps team, which can be a huge bottleneck, or learn DevOps, which isnât part of your competitive advantage (Hint: itâs data).
From this need, we saw the rise of âserverlessâ OLAP systems. These systems, think BigQuery, Snowflake, and Redshift (ew) allow for elastic scale in both storage and transactional volume. That means they can scale up to process a demanding query, then return to baseline once executed. Running low on space? You can add more storage with a single click. Theyâre highly optimized for analytic workloads.
What do we mean by âanalytic workloads?â Well, tasks that are common in analytics: aggregating data, joining data, etc. These correspond to GROUP BY
, JOIN
, and WINDOW
in SQL.
By definition, analytic workloads are not evenly distributed. You might have a complex query that takes 10 minutes to run, then 5 hours of downtime, then another complex query and an ETL pipelineâ serverless makes sense to scale up/down according to uneven loads throughout the day (and hopefully no load at night).
Now letâs look back at our revenue query. Before, we were doing something like this:

No bueno. But now, weâre reading one column at a timeâ so that ten million row aggregation? You only need to scan two columns, date and revenue. In a table with 10 columns, thatâs 20% of the original data! Now it looks something like:

So now weâre crazy efficient⌠and our systems can automatically scale. Thatâs the power of serverless OLAP systems. đ¤Ż
OLAP systems are most commonly used by analytics and data science teams for their speed, stability, and low maintenance cost. To summarize:
đŽ The Future of Data
So, there are a lot of folks out there saying âdata warehouses are dead,â but most of them have products that are built entirely on data lakes and competitors with data warehousing solutions.
Now, donât get me wrong, data lakes are pretty cool, but most data teams still need a data warehouse. More than the architecture, they provide massive amounts of computing power with a SQL interface that anyone can understand. That means analysts and analytics engineers can write SQL workflows that parallelize and process relatively large amounts of data, with virtually zero upkeep.
Once ubiquitous, tech like Spark and Hadoop is now reserved for the largest data teamsâ Facebook, Netflix, etc. So, unless youâre working with petabyte-scale data, data warehouses are still very much useful.
But data is a fast moving space and the future is bright. Tools like DuckDB and metadata management like Delta, Hudi, and Iceberg are paving the way for systems that can read Parquet directly into a relational system or create virtualized databases that can leverage in-memory, relational operations.
Personally, Iâm excited for the day we can have âvirtualâ (in quotes because, well, everything is virtual now) data warehouses that sit atop Parquet files and act nearly identically to current warehousing solutions. I welcome our new Duck overlords. đŚ I think weâre far from that point, but until then, a boy can dream. đ
đ¤ Data Explained: OLAP vs. OLTP
Insightful