It’s Time to Retire Data Manager and Move to DataStage

It’s 7 AM and you check your email over a cup of coffee in the kitchen. You see an email from a director in Finance that her reports aren’t reflecting yesterday’s activity and that something must be wrong. It’s day 3 of the close so this is important and urgent. You scan your other emails, and there are no alerts from the warehouse. You go to your laptop and connect to the ETL server and see that the overnight batch jobs have not executed. You look through the logs and find an error you haven’t seen before.

Data Manager has been out of regular support for five years, but many companies still have implementations running along. The risk of critical issues grows day by day, month by month and year by year. At some point an OS or database update, or some other surprise will cause a failure that will be difficult, maybe impossible, to recover or mitigate. Searching for backups and supporting software installations in order to construct an environment that can limp along is no picnic. And if your users won’t notice the benefit of upgrading to a supported platform, they will surely notice the impact of a data warehouse failure and mission critical analytics that are out of commission.

ETL is sticky. Most companies are not looking for reasons to rip out a bunch of ETL and replace it with something new. Between the day to day demands of supporting the data warehouse and new demands of the business, there is not a lot of time or budget to do something that in the best outcome would be unnoticeable by the users. But, like car maintenance, this sort of maintenance is best done at the time and the method of your choosing, rather than waiting until something fails and you’re left on the side of the road.

When IBM announced end of life and end of support for DataManager existing licenses and entitlements were migrated to DataStage, giving DataManager customers a no cost option to make a significant upgrade to a powerful best in class ETL platform. But because of the gaps in capability there are some important things to consider.

Software Installation

The DataManager Software installation was very simple. You could be up and running in minutes. The DataStage Installation and the environment is potentially far more complex. This will take some time to work through getting the environment set up correctly.

Managing different environments (IE., dev/test/prod) and tools to manage them.

Again, DataManager is/was very simple and essentially left it up to the administrator to define the process and procedure for migrating code between dev/test/prod. Rolling back code was essentially done with backups of the catalog. DataStage has tools to create and move ETL packages between projects environments, and parameters are far more effective at managing the differences between dev/test/prod (ie., database connection info, credentials). It isn’t difficult, just different.

Migration Approach

This isn’t an upgrade from a Toyota to a Lexus, but more like a transition from a car to a factory that could be used to build a car, or a boat, or a jet, or other capabilities that you haven’t considered before. This flexibility creates a need to consider the approach you take.

Option 1 is a simple migration where there is essentially a one to one mapping of DataManager Builds to DataStage jobs. This approach has the Lowest migration effort, but also delivers the least benefit from using DataStage. This is a good option where there are no glaring issues with ETL and no need or value in changing ETL performance.

Option 2 is a complete redo of ETL the way you would have done it if you started with DataStage. This approach has the highest migration cost and effort, but delivers the greatest benefit from DataStage. This is the best option if Analytics have become a strategic priority for the organization and making this investment delivers business value beyond ETL performance.

Option 3 is “something in between”, to determine and choose the sweet spot to try to maximize value, striking the balance between migration cost and effort, and benefit from DataStage. On some level, some version of this option will always be the rational choice. And I think the following discussion will help you figure out that balance.

The first option is a simple migration where there is essentially a one to one mapping of DataManager Builds to DataStage jobs.

This is probably the fastest approach and will allow you to keep the existing databases and tables used in the ETL process. The individual builds in DataManager would map to individual jobs in DataStage and they would be gathered into Sequence jobs that essentially mirror the DataManager JobStreams. This is also the lowest risk option because validating results should be very easy – the data structure and content from both processes should be identical.

This will get you to a supported platform and most likely provide incremental benefits with performance related to the parallel engine in DataStage.

But there are potential issues:

DataTypes

DataManager is much less stringent with data types and data and is much less likely to fail or error if there are minor data type issues between source and target databases and data type definitions in the ETL jobs, or if there are nulls in the data. This will add to the effort of migrating the jobs to DataStage, but should be resolved easily while jobs are migrated and bench tested by developers.

Raison D’Etre

Data Manager in many ways is a “fit for purpose” ETL tool for specifically delivering Star Schemas. As a result, some things just can’t migrate without thinking them through. DataManager has things like dimension builds that just do not translate into DataStage very well. Managing surrogate keys, defining type 1 and type 2 attributes, and late arriving facts, for example will need to be thought through. It is very useful to develop a template for how you want dimension and fact jobs to generally look and work so that you are not reinventing the wheel each time you move a job.

HorsePower

Meanwhile DataStage has the parallel engine, one of the very important benefits of DataStage. This alone will improve ETL performance dramatically, but it has to be incorporated into the new jobs as builds are migrated from DataManager. This makes lookups, merging and aggregating a little more complicated to migrate to DataStage, but the additional cost is more than paid back through spectacular ETL throughput.

Development Paradigm

Generally DataManager has a rigid build structure. The most flexible build was the “fact build” and it consists of a datasource, a datastream, a transformation layer and a delivery layer and with very limited exceptions (partitioned fact deliveries) all of the data passed through each of these layers at the same time. Meanwhile, DataStage is essentially unlimited allowing developers to have as many steps, and as many separate dataflows as needed. Because of this structure it is very likely that there are more jobs and staging tables in the DataManager environment than needed in the new DataStage environment. This is one of the very important differences between these two tools. Because of this, DataManager require more “scratch space” for staging, spends more time writing and reading data from databases. In addition to taking more time to do the work, this creates more complexity in the database area, more builds in DataManager and more complexity managing job dependencies in DataManager.

This leads to the second option a complete redo of ETL the way you would have done it if you started with DataStage

For an Analytics Architect with a tendency toward obsessive compulsive disorder, this option holds special appeal. Over time, requirements change, people change, the business changes and as a result the ETL and the data warehouse environments inevitably become much less efficient and orderly. In a nutshell, this is the option where you get to say, “if I had known then everything I know now, and I was starting out with DataStage, how would I do everything?”

This has the promise of ripping out old code that is no longer needed, or that you can simply do better for any number of reasons. Perhaps you understand source data more thoroughly, maybe some source applications have changed considerably or possibly there are new needs because of acquisitions, or other significant company changes. Maybe all of these things have happened and you don’t want anyone coming in and seeing what a mess things are!

This is the best option for capturing all of the efficiencies from all of the learning and evolving, all of the cumulative changes andDataStage. Specifically, when it comes to migrating from DataManager to DataStage, this approach means fewer discrete jobs, fewer dependencies between them, fewer database reads and writes, less complexity, and potentially a significantly improved ETL performance both in terms of speed and reliability. But it comes at a cost. Over time there might be hundreds, even thousands of discrete ETL tasks, and in some environments it may be difficult to understand how they all fit together. The task of analyzing all of this and refactoring ETL is potentially too much to take on in one project.

Which leads to the third option: Something in between.

This is the option where you choose the sweet spot to maximize value. By analyzing the dependencies between DataManager builds and stage tables and collapsing them into the smallest number of DataStage jobs, you can avoid the interim reads and writes to the database staging area. You focus on particularly poor performers, especially where the cost is in database reading or writing, and commit to do more refactoring every time you have to do ETL maintenance in the future. Over time this can work and with all of the other changes that are always happening, it will be hard to detect the DataManager heritage sooner than you think.

Doing this will require more analysis during the migration process in order to understand the interdependencies, but will pay back through faster ETL execution, fewer dependencies, fewer jobs and less complexity. Use the DataManager JobStreams as a guide to understand the interdependence between different builds in DataManager. This is an approach that can be implemented in degrees – it is not all or nothing. For example, if there is a linear, singular dependence between two DataManager builds simply because the transformation complexity was too much for a single DataManager build, this is an easy example where two DataManager builds can become one DataStage job and reduce complexity.

Meanwhile, if two or more jobs are dependent on a single stage, it may be an opportunity to combine all three, but as the dependence gets more complex, it will take more time to analyze and test. If you had started out with DataStage, no doubt the jobs and workflows would be different. The question of how close you want to get to that comes down to how much effort you want to put into the migration process. Weigh the cost benefit, make a decision and move on. Expertise with both DataManager and DataStage can make the decision much easier.

At the end of the day, the perfectionist in me would like to choose option two, but I recognize that it really is not a practical choice in terms of cost, but even more so in terms of time. And at this point it really is becoming an imperative to decommission DataManager as soon as possible. The risk of failure is getting too great. However, if you can tap into an experienced partner to help you through the process, you can capture more of the benefits without many of the costs and complete this migration faster and at a lower cost than doing the work yourself.

EmL
elealos@quantifiedmechanix.com

It’s Time to Retire Data Manager and Move to DataStage

RECENT POSTS

TOPICS