Reports of the death of the “Star Schema” are greatly exaggerated.

I have been a practitioner of the “Kimball Methodology” for more than 20 years, so naturally, when I received this provocative email, I had to go and read the “ebook”. Those who know me know I am opinionated, and when something fires me up, I like to offer my opinion, so here it is.

The title is provocative and for IT managers trying to clear their backlogs and get into the modern analytics game, the promise is tempting. This book offers the hope that there is an application that makes accessing and integrating data for analytics easy. For those who dream that there is a tool or application or suite that will solve these problems with little effort, this book was made for you.

One of the first arguments to support the claim that the Star Schema is dead is based on a faulty assumption about the approach:

Until recently, the prescriptive nature of Kimball’s model was an advantage for organizations looking to glean more insight from their data. They identified the key business questions that needed answering and set up data warehouses that would be able to answer those questions. It was a relatively simple and direct approach—and more cost effective, too. The shortfalls of Kimball’s approach, however, are becoming increasingly apparent as data technology advances. The very strategy that makes bottom-up data structures so efficient—preordaining the questions to answer and problems to solve—is also extremely limiting. That is to say, when you restrict the search area for business questions, it robs you of potentially game-changing insights that come from unexpected sources. You lose the ability to ask follow-up questions, ask adjacent questions, and deviate from the prebuilt path. That’s a big deal because unexpected insights are often the most impactful for a company.

Sort of correct, but the conclusion is not. Simple and Direct approach – yes. Based on the questions that needed to be answered – sort of. Incorta seems to believe that the Kimball method is to gather reporting requirements, and create a logical and physical data model based on the reports users want. The assumption seems to be that the purpose of this method is simply to prepare and pre-process and organize data so that the reports run fast and are easy to make. This is not exactly the case, though it is a common misunderstanding.

An equally important priority of the Kimball approach is to create a platform for integrating data from various applications, and this matters when it comes to designing data models. As a result, understanding the business reporting requirements is to map those requirements back to the source – the processes that are generating the data. Once we have identified the processes, we use a dimensional approach to create and deliver a data model in the data warehouse based on those processes (not the reports).

This is a simple and direct approach, and it does, in fact, prepare and pre-process the data so that the reports run fast and are easy to make. However, it does not in fact pre-suppose anything about the “questions to answer and problems to solve”. In fact, the point of this approach is to capture all of the useful data and store it in a physical model at the level the data is generated. This is what is meant by “atomic level data mart” in the Kimball approach. If you have an atomic level data mart based on any business process (like sales, for example) there isn’t a report that you can’t make because you have all of the data captured at the level it is created and stored in the dimensional model where it is relevant.

Meanwhile, the star schema simplifies the data structure for the report author, while creating a physical data model that is optimized for performance. Any combinations and aggregations of this atomic data can be made by the database and reporting platform at run-time. And here, Incorta is correct: Technology has improved by leaps and bounds and all of this happens incredibly fast on modern database platforms.

The approach is also fast and cost effective because you don’t need to bring all the data at once. You don’t even have to know about all the potential data up front. You can prioritize the data you need or want or know about now, because future requirements will have a logical home in the star schema, if they are related to the business process captured by the model.

What do I mean? You don’t need to gather all of your customer data and deliver it to the customer dimension at the outset, because you can always add it to the customer dimension in the future if and when you need it. The same principle is how you can add or change attributes to the product or sales organization dimensions to accommodate changes to those organizations in the future.

This approach is also the most effective way to integrate data from multiple applications. If your orders are in Oracle EBS, while your CRM is SalesForce.com, your email marketing in Marketo, and calls are in Cisco you very likely need some logic to integrate data in order to see your conversion analytics. That work is best done in a data warehouse.

And with all this data flowing through the organization, there is always a need to be able to drill down to the lowest level of detail to verify and validate. Atomic level data marts, based on these core business processes are the perfect foundation to meet this need. Maybe that’s why people demonstrating Incorta on YouTube use governed data warehouses in their product demos. It’s as if they are saying “Hey look, you don’t need to make a data warehouse”, while they use a data warehouse in their product demo to make the product look easy to use and powerful. Maybe it would be better said that you don’t need to build a data warehouse if you already have one.

But there is so much more to a data warehouse. The approach is flexible and portable and useful for analytics, but valuable far beyond reporting. It can be deployed on prem in any database, or in the cloud, in any database. You can use any applications you want with it. You can have managed reporting in Cognos, Tableau or PowerBI, and you can use R, or SAS or SPSS to create advanced models. You are free to choose and assemble tools based on your technical skills, project needs, cost or whatever makes sense to you.

You can also use the warehouse as a hub for data for all of your applications. You can automate the process of extracting prospect data from SalesForce, deliver it to the warehouse, and then use that data to populate Oracle EBS, or Marketo or some application you haven’t even created yet. None of this is preordained, and there is no limit to the number of problems this can solve.

But this isn’t a tool or an application sold by any marketing department. It’s a discipline, an approach or method and a strategy, based on a vision that sees the data landscape of your organization and seeks to optimize that. The vision of data flowing through applications and processes wholistically, rather than using tools to fix or correct problems at stops along the way.

A vision like this will help you select applications that fit and avoid integration problems before they happen, rather than solve a specific challenge of the day. Here you have a strategy to make sense of data in your organization and are guided by it, rather than react to problems as they arise. Perhaps it’s counter intuitive, but I actually love the dis-integration of Extract, Transform and Load, rather than the integration that InCorta is selling. Using separate tools to create data pipelines makes managing a data lake significantly easier. And modern database platforms make a data lake strategy cost effective and much easier. A well managed data lake opens up possibilities for data warehouse operations not only for analytics but for anything that needs data in the enterprise.

Against this environment you can use any reporting application, exploration tool, visualization engine, any ETL tool, advanced analytics, reverse ETL, master data management .. the list goes on. And the flexibility to switch and the cost of switching is much lower if you have small parts of your data operations scattered across several simple to use applications, rather than the whole kit and kaboodle in one application.

Why is that? Because with this approach, you are integrating the data, not the tools.

In writing this blog, I watched several demos and read several other articles and white papers, and I’ve been in this industry for 25 years. I can say with confidence that there is no end to end tool that will make this work easy without doing the right things with data. Technology is changing, and some of the changes are genuinely significant, but none of it will replace the need for a sound strategy, a solid approach and skilled people capable and committed to doing good work. In my experience, organizations that rely too much on applications to solve problems end up with more complexity, less flexibility, more applications and higher costs than organizations that rely on talented hard-working people with a good strategy.

Let me know what you think.

-EmL

elealos@quantifiedmechanix.com

Reports of the death of the “Star Schema” are greatly exaggerated.

RECENT POSTS

TOPICS