Tuesday, 12 April 2011

Waste - Why you shouldn't bring all your data into your Data Warehouse

In lean, there are 3 types of work: value creating work, incidental work and waste.  When applied to a data warehouse context, we have customer value add data, business value add data and waste.  


Customer Value Add Data
This is any data collected that is directly used in a business workflow to provide a product/service to a customer.  Without this data, you would not know who ordered what and how much, where to deliver, bill the correct person, manage inventory or know if you were ever paid.


Business Value Add Data
Any data collected that the business uses to better understand or know their customers.  This is data that helps the business create customer profiles, understand buying patterns, behaviors and preferences.  To understand your customers better, as a business, you need data to answer questions like: 
  • How long has this person been a customer? 
  • How valuable is this customer to me (is he/she in my best/next best/low value customer segment)?
  • Are my products/services more attractive to teens/adults/singles/couples?
  • Are customers buying products together?  Would I be better serving my customers by offering a bundle?
  • What are my top sales regions?
  • Are my customers only buying from me because I have the lowest prices, my loyalty program or because they like my customer service?
  • How can I get better return on my marketing dollars?
    • What can I offer this person so that they buy more of the same product or cross-sell him/her a complimentary product?
Without this information, you could still run your business as the data used to answer these questions aren't necessarily needed in my workflow to provision a product/service; you just wouldn't be able to better understand and meet the needs of your customers.  

Waste
Any data that's collected which isn't used in providing the customer with value or used by the business to understand or serve customers better.  Seems obvious yet a lot of this type of data ends up in data warehouses.  
On previous projects I have asked business stakeholders to define what data needs to be moved into the data warehouse.  All business stakeholders have said, "Move all data into the warehouse.  We don't need it now, but may need it in the future".  It is their belief that if we're moving data in from the various tables in the operational systems, it's easier and cheaper to do it all now.  In doing so, they don't have to wait for data to be brought into the warehouse before creating a new report if a field ever becomes important.  What's the impact of this statement?  From a Project Manager's perspective:
  • Scope
    • Create a plan that estimates what it would take to move all data, table by table, into the data warehouse
  • Time
    • Additional time required for the coding, processing and testing required of the waste data.
  • Cost
    • Development costs increase due to additional time and scope of waste data being migrated, integrated and validated after being inserted into the warehouse
    • Need more storage to accommodate the waste data
    • Processing costs increase - more data needs to be loaded in the overnight load window


Waste creates more waste 
Beyond the initial sticker price of getting your data warehouse to a point where you can finally begin reporting, there are also ongoing costs associated to adding and maintaining waste data.   Think of it as a waste tax (or several waste taxes).


Waste testing tax
Development of ETL code is a one time effort for the waste data, but every time your code is updated, you need to test the code to make sure you didn't break anything.  


Storage improvement tax
Waste data grows as fast as your value add data.  Since you're storing waste data on the same storage as your value add data, you will need to increase your storage more often.


Network traffic tax
Whether your organization performs nightly batch data loads or you load data in real time, waste data, along with value add data, is moved from your source system(s) over the network to the DW (or staging area if applicable).  The more your data grows, the more traffic you have on the network and the more waste you are moving.


Data load/refresh tax
If your organization performs nightly batch loads, the load window is a fixed length of time.  For most organizations, data loads, if all goes well, just fit within the window.  Now if you decide to add a new source of data, you will need to spend time optimizing current load times in order to fit the loading of the new source in the same window.  This takes time, people and ultimately money to optimize waste.




Nobody likes paying taxes or wants to wait 18+ months to see value.  So apply agile and lean principles to your BI and DW initiatives. Prioritize and focus on the high value data, build the reports you need and you will see a return on your investment sooner.