Reading time: 4 min
I had an earlier post on the techniques of keeping a data warehouse updated all the time – even intra-day. Now let’s take a look at the other end: how to make a DWH quickly respond to user queries be them direct or via some BI tool.
There are many use cases where response time is not critical, e.g. in regulatory reports, or agent fee reports created monthly, in bulk. But as business gets faster there are more and more cases when data warehouses are expected to respond very fast – not only (or sometimes not even primarily) precisely.
“Very fast” is a relative concept here as well, similarly to the update side. Bank card payment processing allows the DWH at most 0.1-0.2 seconds to fetch and transfer fraud detection data to the card system or else the user experience gets terrible. The data warehouse has to serve frequently exercised business workflows within a couple seconds at most to feel reasonably smooth. And users get understandably upset if they are shown the hourglass for double-digit seconds when updating an interactive dashboard after a change – no matter how complex that dashboard may be.
It is also important to pinpoint our focus: do we have some specific scenario that we need to target or do we aim at speeding “everything” up? In general, the specific optimizations are easier and cheaper than the more general ones.
There are several tools and techniques to tackle the response-time challenges – but unfortunately, none of them is the silver bullet. Here are some of the most frequently used solutions.
One of the most obvious directions to speed up the system is to add more hardware. Sometimes the resources are indeed very much not in line with the requirements, and right-sizing the hardware does solve the problem. In case one does, indeed, have big data, using many servers in parallel can radically shorten response times. This is primarily where cloud and Hadoop-based solutions shine, but they are rare to encounter in traditional data warehouses, therefore they are not discussed in this article. Generally, however, adding more raw power does not make the issues go away; it may even cover some swelling more in-depth problem.
At the same time, hardware performance details are important inputs to software optimizations: it is essential to know which element of the system is overloaded (e.g. disk I/O, memory, CPU). Sometimes knowing only this much allows for simple-to-build optimizations with a significant yield. Also software solutions often intentionally use hardware components in an imbalanced way to increase speed.
Nearly all software-based speed optimizations are built into or upon the data mart layer (i.e. the one built to directly serve external data access). When needed, specific schemes are devised to support speedy data servicing; collectively they are often referred to as the “speed layer”.
The easiest software solution is to “do nothing”, or as a well known “bon mot” of the software world puts it: “the fastest code is no code”. Of course, we actually do have code – it’s just that we run it at load time instead of at query time: certain values are pre-calculated and stored when the data mart is updated. It is useful because it is nearly always faster to fetch a value than to re-calculate it. This is especially effective when there are few values that are results of complex calculations, and it gets progressively less so as the number of query parameters grows. For a while, it may help to store subtotals and only add the necessary ones at query time. The kings of this approach are the OLAP engines, but on a smaller scale they are often directly coded into relational data marts as well. There are stingy cases, though, where this technique fails; the most notable one is, perhaps, when distinct items need to be counted.
A different method is to avoid repeated calculations. (For example when the scenario is such that if somebody requested a piece of information, it is very likely that others will fetch it as well.) This can be supported via caching. When processing a query first we check if its value have been calculated and return the stored result if so, else we store the query and the associated result for the next time it is requested. Thus the first query will be relatively slow, but subsequent ones will be lightning-fast, and architects do not have to decide at design time what to pre-calculate. Although this technique has a few aspects that are inherently hard to implement properly, and also it only pays dividends from the second identical query on, it is still often worth considering.
The next step in speed optimization is that if we cannot avoid query-time calculation, let’s make it faster. Reading from (and writing to) disks is the slowest component of modern data warehouse technologies by a wide margin. Although SSDs have improved this a lot, it is still much slower to access data from disk than from memory, therefore disk access should be minimized. Indexing has long been used to this effect. Lately, however, two other techniques, in-memory data storage and “column store”, have become popular as well.
The idea of in-memory data storage is trivial, but it was often economically infeasible to keep a meaningful amount of data in memory on a data warehouse scale even like 5 years ago. It has fundamentally changed owing to a drop in memory prices and the surge of specialized software solutions. It has got to the point that now all major data warehouse toolsets have good support for it, and some even have in-memory data as their core model of operation.
Mostly independently, but in parallel, column-based storage has also become a huge hit. Its gist is that the data physically is not stored in rows (“records”) as we normally do in relational databases, but in columns (as if a standard data table is rotated by 90 degrees). I.e. values of the same attribute are put physically side-by-side. This organization is especially favorable in analytical settings, e.g. in data warehouses. One advantage is that it allows for much higher data compression (and thus more of it fits in the memory). Also, as nearly all data warehouse operations are column-oriented (e.g. filtering, aggregating), it is a more natural fit, and sequential access is also faster than random data access. Another benefit is that this makes it much easier to drive modern multi-core processors to their fullest potential which also has a huge positive speed impact.
These two techniques have been just about absent in legacy data warehouse codebases; introducing them would most likely result in significant speed improvements.
As you can see, there are many ways to reduce the response times of data warehouse queries in most cases. There is huge potential in this because latencies very often highly influence users’ perception of the data warehouse as a whole. Therefore it is worth ending the implementation project with an optimization phase or set off a separate optimization project for the data warehouses already in place. Although tuning for speed is a complex endeavor and it can be resource-intensive as well, the tools mentioned above are capable of wonders in professional hands. Thus this investment has high returns most of the time in hugely improved user experience and substantially reduced response times.
András Zimmer, CTO