Friday, June 18, 2010

SQL Server Job performance

I've been fighting a mis-perception at work that ETL (data transformation and loads) is one of our company's threats to database scalability.  It certainly was when I started here a few years ago, but I've worked hard to change all that.  At the risk of sounding like I'm tooting my own horn, my ETL processes are really well optimized.  I'm cranking through 50+ different large file loads a night in well under an hour.  As far as I'm concerned, I've got each individual ETL process smokin' fast, while occasionally making conscious trade-offs between performance and maintainability.

But the perception is still there in part because I've not made my SQL job stats transparent enough.  Stuff just works, and so outdated opinions never got challenged.  The reality today is that the jobs that are causing us the most trouble aren't the ones where we're loading external vendor files, but the ones where some developer is doing internal data processing and they're doing it BADLY.  It's the linked servers and the cursors and the UDFs and the over-architected single-object-inflate-then-commit patterns that threaten our scalability as our business grows.  So I set out to prove it.

Job reports are available for everyone to see in SSMS, but you have to consciously go after it and interpret the results.  That wasn't working for me, because as the DBA I'm the only one who was doing that regularly.  It's only in my face, and my jobs run fine.  So I developed the "SQL Server Job Report", which is a daily e-mail report to my whole dev team highlighting the following info:
  • Job name
  • Number of job runs in the past month
  • Average duration during past month
  • Success rate during past month
  • Most recent run date
  • Most recent run duration
  • Most recent run status
While everyone has been concerned with what happened during last night's runs, I wanted to put the spotlight on two other key metrics:
  1. What's the average job duration?  Is a job taking over an hour consistently?  That's a threat when it fails and we have to re-run it during business hours.
  2. What's the success rate for a job?  Jobs that aren't robust or well tested sap developer time.  Do we have jobs that fail more than 25% of the time?  Why?
This did the trick!  I got the responses I needed which were, "I didn't know that job was failing that often!" and the "I didn't know that job always took that long!".  Progress...

Using this as empirical evidence, I was able to devote some time today to the job that was biggest offender.  It happened to be a series of stored procs with linked servers, UDFs, and cursors that went against our largest tables and consistently took 3+ hours a night.  With less than 1/2 a day's work, I was able to help the developer get the job down to 20 minutes!  That's pretty satisfying.  We'll see how much traction I can maintain on the other problem children...