“The time has come,” the Walrus said,
“To talk of many things:
Of shoes–and ships–and sealing-wax–
Of cabbages–and kings–
And why the sea is boiling hot–
And whether pigs have wings.”
– Lewis Carroll –
Recent experience has reminded me how important the fundamentals of data reconciliation are, and how easy it is to forget them when we are moving quickly on a project. As practitioners of Business Intelligence, it is foundational to what we do. After all, what is the value of serving up “intelligence” if the underlying data are incorrect? So with apologies to Mr. Dodgson, let us talk of sprockets and cubbies.
Before starting our journey, allow me to clarify the scope of this article. It is strictly about data reconciliation, which is only one aspect of the larger topic of data quality. And while data reconciliation intersects with data quality characteristics like completeness and accuracy, it does not necessarily do so with others such as validity, consistency, or relevance.
There are essentially two forms of data reconciliation, both of which follow the same rules of engagement. The first is the tying out of data in a source system to an end point application. Say you operate a sprocket manufacturing company. Reconciling the data in a sprocket reporting system back to the purchase order system would be an example of this form. The other form is the tying out of data between two end point applications. In this case, you want to ensure that your new Management Essentials Sprocket Sales (MESS) dashboard is not telling a conflicting story to your legacy Sprocket Customer Utilization Metrics (SCUM) reporting application.
There are also two levels of data reconciliation. The first evaluates the totality of the data being moved. In other words, if I have 500 sprockets in the source, do I also have 500 sprockets at the end point? This is the data property of completeness. The other evaluates the quality of the attribution or transformation that has been applied. Do all 500 sprockets fall into the correct cubbies and sub-cubbies that represent how your company does business?
There are five rules of engagement governing the data tie out process. These are common sense rules, but time and again I see them ignored due to haste or enthusiasm. I have done so myself, always to my inevitable chagrin.
Rule 1: Establish the Source
The very first step in reconciliation is to identify and understand your data source. Is it the incipient operational system where the data are created or is it a subject area in your data warehouse? Does the enterprise accept the quality of these data? What transformations have been performed before you receive them? What issues within these data form the beginning axioms of your reconciliation? For instance, if you know that the data or the attribution for a record changes from day to day, may be incomplete, or has different meaning depending on context, how will this affect your reconciliation process? Understanding the data source is fundamental to quality data reconciliation.
Rule 2: Address Completeness Before Attribution
We all like to see a sea of zeros on the variance sheet. For those of us who enjoy paddling about in data, it is like a field of daisies on a spring morning. But life doesn’t work that way. Particularly in complex solutions, it is important first to validate that you are working with the same data set with which you began. If somewhere between the source system and the solution end point you have dropped (or worse, added!) data, it doesn’t matter that your cubbies and sub-cubbies don’t tie. Validate first that every instantiation of the data from source to end point contain the same 500 sprockets you began with. You can waste a lot time trying to tie your cubbies and sub-cubbies if you don’t have all your sprockets.
Rule 3: Tie Data at Every Step
Never just reconcile your source directly to your end point. It doesn’t reveal the entire picture. If you tie out each worktable and validate that at each step your data remain complete and accurate, then they will have demonstrable integrity. This is key when you do need to transform or eliminate something from the data before they enter the solution. For example, your source system may create a sprocket record with a quantity and other attributes, but may not require a sub-cubby attribute until later. Your analysts don’t want those unattributed records in your reporting system yet because they will create noise or confusion. Nevertheless, you would want to validate first that all sprocket units are accounted for in your import table. This ensures that you now have ultimate control of the end result. Then eliminate the unwanted records and validate again. This now becomes the data set that is reconciled to your solution end point.
Rule 4: Avoid Reconciling End Point to End Point
It is rarely productive to try to tie out data in two end solutions directly with each other without first having reconciled the source to end point paths on both solutions. Even if they appear to tie, they might not. Often, business users will only look at that aspect of the data in which they are interested. As a BI practitioner, you are interested in the totality of the data. In order to avoid being asked back in six months to fix MESS (or worse, SCUM), apply the first three rules to both solutions you wish to reconcile. This will save time because you will uncover issues more efficiently, while ensuring data integrity at every step. At the very end, deploy a reconciliation mechanism that looks at both the totality of the data as well as the attribution across the two solutions.
Rule 5: Never Assume End Point Data Integrity
This rule really should be, never assume data integrity. But practically speaking, there is more inherent danger in assuming end point integrity. I don’t know how many times I’ve been told, “My data need to tie to SCUM” because SCUM has been the go-to source for information for years. But it turns out that a flaw was introduced into SCUM during a recent enhancement and managed to elude User Acceptance Testing (largely because Rule 4 was ignored). It turns out that everyone is now reconciling to incorrect data. Only by following Rules 1-4 in reconciling MESS did we discover the flaw in SCUM. In fact, we would still be wasting time trying to fix MESS when the mess is actually SCUM.
On the surface, this all might seem mundane. It isn’t really, and I am surprised at the high number of data analysts who are only interested in the finished metric, not the data underlying it. There have been cases where I have brought a sample of raw data into a meeting to demonstrate an underlying condition that was driving a hot issue. Routinely, all eyes will glaze over and I will be asked the question, “That’s all very well, I suppose, but can you fix it?” Do pigs have wings? Without some appreciation for data tectonics, how can one expect to understand – much less analyze – what one is seeing in MESS and SCUM?
It is down in the data tectonics where reconciliation really occurs, and why the rules of engagement are critical to success when delivering or servicing BI solutions. I spend almost as much time writing code to reconcile data as I do coding the solution. At the end of the day, it helps me explain why the sprocket tooth count metric in MESS doesn’t tie precisely to its putative counterpart in SCUM. It also allows me to validate quickly at deployment.
Again, these are common sense rules that we all know and all forget to follow from time to time. Nevertheless, they are a key success factor and consequently a tool to be kept at hand. Use them liberally and your MESS and SCUM will tell a consistent story of sprockets and cubbies.
Do you have a sprockets and cubbies story that illustrates these rules? Do you have other tried and true data reconciliation techniques to share?
“But wait a bit,” the Oysters cried,
“Before we have our chat;
For some of us are out of breath,
And all of us are fat!”
“No hurry!” said the Carpenter.
They thanked him much for that.
– Ibid –