Wednesday, November 4, 2009

ETL - 10 tips to a healthy data lifecycle

In my job, I do a lot of ETL.  A lot.  At my company, I handle more than 50 scheduled file receipts, and likely 50 additional one-off receipts a year.  For the uninitiated, ETL stands for extract, transform, and load.  It is the generic term used to describe the process of getting data from one IT system (typically a mainframe or database) and putting it into another (typically a database or data warehouse).  ETL usually involves things like data cleansing, conforming the data to a new model, key management, data type conversions, aggregation, and error logging.

There are a lot of different camps on ETL best practices.  In my experience, here are 10 of the most important tips for doing ETL right:

1.) Expect the unexpected

What you knew to be true yesterday is not necessarily true today.  Consider data from an outside source to always be suspect.  ETL adheres to the old adage - be liberal in what you accept, and conservative in what you emit.  You will get junky data, and you have to know when you can recover gracefully and when you have to fail hard.

2.)  Data should get cleaned upstream whenever possible

ETL happens downstream, and you know what flows downstream.  The source systems housing the data you're being fed may be under the control of your organization, or they may not.  But if at all possible, foster a good relationship with the maintainers of the source systems sending you data.  Garbage in, garbage out.  If they send you junk, your job will forever be cleaning up data and coding band-aids to mask problems at the source.  But, if you can get problems fixed at the source, you and your users will have more confidence in the data and fewer problems.  Also, it's always a good idea to go ahead and store the data as it was received, even if you store a scrubbed version too.  That way, if there's a flaw in your cleansing routines, you can go back and re-run against the raw data.  It's sometimes advantageous to show your users the data as it was received too.  Often they can apply some additional pressure when you are not able to make headway with the source.

3.)  Get file extracts into a database table or tables first before doing ANY data conversions.

The worst thing you can do is try to do too much in your initial process for loading the file.  It's tempting to start converting data types and begin scrubbing and conforming data, but you have to avoid that temptation.  There's plenty of things that can go wrong in just picking up the file, decrypting or decompressing it, parsing it, and reading it in the right character encoding.  The last thing you want is to have your ETL fail at 2 AM on some random field that's supposed to be an integer having an alpha character and blowing up your ETL in some difficult to find place in a very large file.  I suggest making a simple set of staging tables that match the file format with only (n)varchar fields.  Don't convert dates, or packed fields with implied decimals, or anything else.  Those files need to get into your database in the raw ASAP.  Yes - it may be an additional step, but it's so much easier to analyze and debug problem data fields if you get your files into a staging table first.  You can convert and validate the data in a later step, and at that point you'll have better options for recovering from the unexpected.

4.)  Choose your toolkit wisely

I've looked at Informatica, DataStage, and Pentaho.  I've used DTS and SSIS extensively.  Personally, I still prefer to do my ETL in .NET and T-SQL.  Call me old fashioned.  I'll use an ETL tool when appropriate, but I have an extensive library of good, well-tested, performant code that works well for my dev team.  There's nothing wrong with finding a good ETL tool.  But you have to ask yourself:
  • How do we do version control?
  • How do we deploy/rollback?
  • How do we restart the process after fixing a problem?  Must we start over at the very beginning?
  • How do we document the ETL?
  • Is this tool really worth what they're charging me?
  • How do I call out from my tool to other tools if my primary ETL tool is inadequate or unable to perform a particular function
  • Can I reuse custom-built transformation functionality, or do I have to recreate your transforms to use them in other ETL projects?
  • Even if the tool is drag-and-drop, is it still easy to really see what is going on or are there lots of hidden blackbox operations?
  • To borrow from Larry Wall, are the easy things easy and the hard things possible?
  • How do we find developers with experience with the tools we're using, or train newbies?
  • How can we get the best performance in the places where performance is key?
  • How will we handle all the different data or file formats we expect to receive?  EDI?  XML?  Mainframe files with copybooks containing redefines, occurs, and packed fields?
  • How long will it take to implement a file from a new vendor or source system?
  • Do I need massive parallelism or distributed ETL when processing, or can I make do with a single threaded ETL?
Your choice of toolkit has a direct impact on all these factors, and lots more.  Choose wisely.

5.)  Fail Gracefully

Consider every point at which your ETL can fail, and decide what you want to do about it.  Do you log an error and move on to the next record?  Who monitors the log and what log entries are actionable?  Do you wake up an operator in the night to handle the error immediately?  Do you fail and wait until the morning?  What's the process for starting back up after the problem is fixed?  Do you have to start from the beginning, or can you pick up where the failure occurred?  Is there a risk of duplicating your data or otherwise creating a bigger mess?  Is it possible that your load could silently fail without you finding about it until there's a big fiasco?  Does your process simply have to complete every time it runs?  Can you miss a file?  You have to put some thought and effort into failing gracefully.

6.)  Be a pack-rat

Space is cheap.  Buy some cheap disks and keep every file you ever load.  You just never know when you'll have to reload something or recreate a series of data.

7.)  Build in a buffer

Sometimes a file will go missing or some data will not be received.  Build in a buffer so that you can miss a file and the data will appear in the next couple of files.  It's a little extra work to build your ETL to be able to process the data overlap, but it pays for itself on the very first issue.

8.)  Get a balancing report from an independent source and build a set of watchdog processes

For goodness sakes, don't try to balance financials from the same files you loaded.  Of course you loaded what you received.  Those checksums and trailer records help verify that.  It's what you didn't know you didn't receive that will bite you.  Get an independent balancing file sent to you on a regular basis and double check.  Have a watcher process re-balance historic data nightly to be sure it isn't somehow inadvertently changing.  Have a watcher process look at your FTP or file share location and look for additional files you received that didn't get loaded.  Perhaps they were named incorrectly.

9.)  Practice good accounting

Do not change financials!  Ever!  Even if the source system is stupid enough to do that, you should not.  As part of your ETL, create offset transactions to negate the numbers of a canceled transaction.  Don't just overwrite the old record with new data.

10.)  Have a need for speed

This one is hard for me to advocate.  Typically I'm the guy who'll tell you not to sacrifice maintainability for performance.  I also like the KISS principle - Keep It Simple, Stupid.  And I'll also tell you that premature optimization is the root of all evil.  And these things are still very much true, but when it comes to ETL you can bet that you'll have to address performance.  The key is to discover what is slow.  Don't assume.  Build tests.  Do benchmarks.  Figure out what to generally avoid - T-SQL cursors, UDFs, over-architected OO designs are no-nos.  Cache data if you need to.  If loading a data warehouse, figure out a good, performant key lookup strategy for surrogate keys.  Figure out how to minimize I/O operations, and do work in memory.  Let the database do a lot of the heavy lifting, and think of how to process data as sets rather than as individual records.  Stage data in memory (after you've written to raw tables of course), and bulk-load production tables where possible.  Look at your processes critically, and decide if you really think something should be taking as long as it is based on what its doing, and comparing it to other processes.  Ask for another set of eyes from a trusted developer.

I hope this is helpful to someone.  I've not found a lot of good resources on the web for common-sense, practical ETL advice.  Mostly, it's pie-in-the-sky, big expensive tool, enterprisey jibberish.  Feel free to comment with anything you've found to be helpful when it comes to ETL.  And I just would like to take a moment to point out that you've now made it through a 1500+ word essay on ETL without once having to suffer the word metadata, no thanks to TDWI.

1 comments:

Kaleidoscope said...

Wow! You do all that at work?! I had no idea. And I see you
(practically) every day. I am totally ignorant about the topic of this post, but I read (almost) every word. Your phraseology is fascinating. Really. I admire the way you put words together. When I read your explanation, I feel like I understand it (even though I don't), due to the fluid writing. Amazing! And congratulations. NG