Data warehouse development: “Top-down, bottom-up, or outside-in”?

Reading time: 2 min

In the initial (typically the bidding) phase of a data warehouse project, a key question is whether the project is to advance in a top-down or bottom-up fashion.

The top-down method – by using the example of cooking – is, when a precisely known food is systematically decomposed into a recipe with ingredients and quantities are determined, before eventually ingredients are purchased. (Of course, a “tiny” problem is if the store happens to be out of stock of some of them…)

Here we expect the user to tell up front exactly what reports (outputs) the system is supposed to produce, then we can deduce the necessary data mart tables, build the supporting normalized data model and see what source data it needs.

This – in theory – minimizes “redundant” development work by eliminating all but what is actually needed. The problem is, that in most cases it simply doesn’t work in practice. Customers very rarely can tell what they need well enough at the beginning of the project. (One of the rare exceptions is mandatory reports, eg. ones to the supervising authorities as they are usually well defined.) Another problem may be that it turns out that the required source data is not quite available; but in practice, this usually comes to light pretty soon. In the long run, this development model almost certainly runs into the problem that the customer wants more than what was originally agreed on; but the data model may not be able to serve these – ever so slightly different – needs.

The other method, going bottom-up, is similar to an experienced housewife: she has something to cook on her mind, she then goes to the store, buys what she thinks is necessary (and of course what she can get), then she goes home and starts cooking. She will have prepared some lunch by meal time; most likely not exactly the preconceived food, but still delicious. (If there’s no corn on the cob at the grocery’s to cook, a homemade cream corn from deep frozen corn kernels may be a nice variation.)

For data warehouses, we first collect what accessible data may be meaningful for a given use, and then design the normalized layer. Building on this “core” layer the data mart is then created with tables or views to support the reporting needs. This gives more room to manoeuvre when the needs are refined or new ones appear. However it comes at a price, too: there is a chance that we work with data (or – in the worst case – with whole data sources) which will eventually not be needed. It is much more art than science to determine which pieces of data to load into the normalized layer and how to prepare the data mart to serve queries that are not yet known. (In fact, it is “educated guessing”; where “education” mostly comes from experience and from the knowledge of the business domain.)

Both of these mechanisms have their place: top-down when serving well pre-defined outputs, bottom-up when data sources are few and reporting needs are blurry. Many times, however, we have found it to be more effective to use an in-between method; let’s call it “outside-in”.

When applying the “outside-in” methodology, the implementation design team is split into two: one team works with the client on many iterations to design the output they want (e.g. the dashboards). This goes to the point where the dashboards are actually created; their source being database views on temporary tables. The data in these tables is set up mostly from Excel or other pre-prepared sources with individual loaders or straight manual data entry.

In parallel, the other half of the team maps out what source data is available and what kind of normalized structure would be useful to store it – while keeping in mind that they will eventually have to produce the view (“interface”) determined by the dashboard team. (It is generally true here as well that more data elements have to be fetched from the source than what satisfies the current requirements, so the “visionaries” experienced in building data warehouses cannot be completely eliminated here either.) When both teams are ready and tested their respective parts, dashboards’ data sources are switched over to the data mart tables filled with the standard automated procedures, and the system is ready for a full live start.

The greatest benefit of this mixed method is that it has much shorter design feedback cycles, while at the same time – because of the parallelism – both branches have longer lead times. The users can work on and with the reports they are really interested in right from the beginning of the project. Also it is quite possible that they can actually use the reports to some extent well before the data warehouse is complete. (Of course, during this period, the data is still manually generated and loaded into the temporary tables.) An extra advantage is that by the time data mart tables are created usage patterns will have been known to some extent, which also simplifies performance optimization.

Leave a Reply Cancel reply