Data modeling tools are just dumb and they don’t fit into data engineering workflows anymore.  And they’re expensive.

Well, at least when it comes to analytics using dbt and Snowflake.

Imagine you’re doing the work of a data engineer, you developed a model in dbt, but before you can deploy it, you have to change it so that it won’t drop and recreate the table or view you are delivering. Also, you have to use a data modeling tool to define the table, columns, data types and constraints. Then deploy the DDL. And you have to orchestrate moving that ddl code  and the data transformation ETL/ELT. Why? Because you used Informatica for years and that was how it worked? Dumb.

Suffer no more. 

Here are a few things you should try NOT to do with dbt in no particular order:

  1. Try not to use incremental models. If you do, you might save a little time in your dbt builds, but you will spend more time developing. Every change you make in the future will require you to figure out how to make the change for the initial load, and then ongoing. It’s at least 2x the work, and ½ of that is useless after implementation. If you use the full refresh option it’s at least ½ the development effort, and all of it is value add.
  2. Don’t use the hash key algorithm for your surrogate keys. 
    •  they look ugly when you look at the data, 
    • They take up way more space in exactly the tables where this hurts you the most – the fact tables, 
    • They reduce query and ETL performance, 
    • They are not universally supported in downstream applications. 
    • Here’s an easy way to create integer keys which are much better. And soon I will show you a way to make them temporarily permanent for all you haters.
  3. Do not use a data modeling tool like Erwin to define and deploy your data models. It just is not necessary and adds no value. dbt will create any view or table you need, will also define primary keys and foreign keys and will test, warn and fail if needed. It will also implement much more sophisticated tests. And it will log all this information so you can figure out what happened. If you use a tool to create the ddl, that essentially means you are not following rule #1 and this rule. So in addition to adding a bunch of development effort, you are creating a workflow that is slow and does not add value. You are probably breaking rule #2 as well.  

So why do people do this? The Informatica explanation is real, and here is a more thorough explanation. 

First let me start with a parable about why the solid rocket boosters on the space shuttle were based on the size of a horse’s ass 2000 years ago. It’s true. 

Tools like Erwin wrote DDL for dba’s. That’s it, that’s all it does. Saved you the trouble of writing potentially complex DDL statements and made them portable – you could deploy the objects to different database platforms without altering the DDL. Now, with dbt and Snowflake we don’t really need dedicated dba’s and a lot of the parts of the DDL are irrelevant. This in itself is a big benefit. Analytics teams rarely get the attention and support they need from DBAs and DBA’s are traditionally application focussed and frequently do not understand the different workloads associated with data engineering and analytics. In the pre SaaS days they almost always slowed analytics development down. It usually came off like they were ruling their fiefdoms – they would not allow analytics teams to manage databases or schemas, create, truncate and drop tables and indexes. What a pain. If I could have the time back that was spent convincing DBAs to trust that our team could manage the database automatically with scripting and ETL, that would add up to days.

So you do not need a modeling tool to define and deploy database objects. I want to be clear, however:  You do need a data modeling approach (like dimensional warehouse bus, for example) You do need a data model and you do need to design it, but it does not need to be so detailed to include every name, every datatype, and relationship, and it does not need to be in Erwin or ER studio or anything like that. And all of that is a waste of time anyway because these details are best discovered when you are working with the data.  If you have to hand off a data model that is this detailed to engineering, you could have already done the engineering with just a little more effort. Yeah, dbt is also a great interface for exploring and investigating the data during the engineering process. 

So why do people still do this? 

They have people on their teams whose roles and skills are far too narrow. They hire data engineers who have no idea how or why their work product will be used. They don’t know how analytics applications work, how the applications will be deployed, who will use them or what the business purpose is. In its worst form, an analyst works with a business owner who then documents requirements, who then communicates with an architect who then designs the solution, and then an SME defines the mappings, and the mappings and design are handed to an engineer who codes the ETL.  Transcription and communication errors abound and the development and testing cycles churn and churn. Most errors are code fixes, but some go back to mapping to fix, and others go back to requirements and design for clarification. Wonder why these projects spin out of control? Hopefully this helps explain.

Why do we do things like this? Well industrial revolution thinking, of course. We have been breaking down big jobs into little tasks since Henry Ford figured out how to make Model T’s cheap. I think the basic idea is to make tasks and jobs so simple anybody could do them reliably, and consistently. Management is stuck in this paradigm. 

This approach does not work for software development because defining the tasks is the work, not doing them. The data engineer isn’t the person on the shop floor bolting a door on an F150 – they are the person designing the assembly line, breaking down each of the tasks and defining how they’ll be done.  The computer is the assembly line worker, and the software developer is the manufacturing engineer. If designing the task was the work, we would use a computer to do it. Maybe someday, but that day is not today. 

EmL

elealos@quantifiedmechanix.com