Friday, June 19, 2009

Reality Bytes

I receive and process a lot of data from external vendors. And, in my life immersed in extracting, transforming, and loading data, the thing I find most unbelievable is how frequently people worry about the size of the data when that can't possibly be a real concern. Take this morning - I receive a historic file from a vendor containing 1,500 records. That file is being sent incorrectly because it contains 12 months of 'history', but we've discovered that changes are happening on records older than 12 months. So, what we really want is 12 months of change history, not 12 months based on some termination date. Since they don't track change history, we'll have to settle for increasing the time window to 24 months.

Anyway, that file is 225 KB in size. Seriously - my shoelaces have that much memory! But, I'm predictably getting pushback because "adding to the history file would make it larger". "You really want to double the size of that file!?". Well, technically, yes. True, but irrelevant. A gram of dust landing on another gram of dust does increase the mass, but you can't be serious that you care! If there's some other technical reason, fine. But don't try to sell me on data size concerns.

This is an extreme (though true) example, but it always amazes me how often people incorrectly worry about the size of their data. I see this in coders too, when they worry a lot about O(n^2) looping. Okay, yes your comp-sci 101 class taught you to watch out for that, but they were talking about when processing millions of rows, not thousands.

Maybe I'm jaded because my last project was on a 4 TB Data Warehouse with my largest fact table being 18 billion rows. Many of the dimension tables were 60-80 million rows. Somehow, a couple of thousand extra records just doesn't phase me. People - get some perspective! Bytes are cheap.