Strategies for Improving Data Quality
Some say that Data Scientists and Data Analysts spend 80% of their time cleaning data — some say it’s 60%. Whatever the definite number, the consensus is that data cleansing takes a significant amount of time. With data, it’s Garbage In, Garbage Out.
Having your data in a good state before analysis is key to being able to deliver credible, meaningful insights.
Many organizations are eager to work on the actual analysis: building reports, developing recommendations, and generating predictive models. This is understandable – these activities produce tangible output. And they hold the promise of contributing towards increased efficiency, cost savings, and other results which help to justify the expense of data teams.
But organizations that rush into ‘building mode’ without spending adequate time on data quality are setting themselves up for failure. Analysis built using poor data is like trying to build a house with poor materials. No matter how well thought out the design, the structure will not hold. And at the end of the day, a lot of money will have been spent, but the roof leaks and the floors sag and the house is uninhabitable.
Data quality initiatives can feel frustrating, because they’re never really ‘done.’ In many cases, 100% accuracy is unachievable. We recognize and accept that. But 80% is better than 60% and 90% is better than 80%. At each higher threshold, a greater number of analytics initiatives are possible. Time must therefore be spent on both building and cleaning. New levels of data cleanliness will unlock more analytics potential until the organization recognizes that higher quality data is needed before the next thing can be built.
Our focus now is on various approaches available to improve data quality. Spend the time to develop your data quality strategy and set your organization up for success in its analytics initiatives.
Add controls to prevent the creation of problematic data
If we can clearly define what types of data should be accepted, then in some cases the user can be prevented from creating problematic data in the first place. This can look like:
Adding in data validation that prevents the user from inputting data that doesn’t follow a specified format (ex: valid email address)
Rejecting the entry of data based on other predefined rules. For example, rejecting the input of prices above a reasonable range or not allowing an individual item that has been purchased to be purchased again (unless there has been a prior return).
Replacing fields where the user enters text freely (which is more prone to typos and errors) with predefined fields through drop downs. Or allow the user to scan in an
ID
instead of typing in.
The trade-off with data validations is that while the data ultimately received is ‘clean,’ care needs to be taken because:
A user can be prevented from submitting data at all
A user can be forced to submit inaccurate data.
Data validation should not block business critical activities from taking place nor prevent critical information from being recorded.
Automate the Data Cleanup
For instances where prevention of the creation of problematic data is not possible or desirable, data cleansing activities are necessary. Data cleansing activities can be automated or manual. Automated cleansing activities can take various shapes. They can occur before or after data load, be handled via scheduled jobs, or result from transformations within the database.
Automated cleaning is desirable when data problems follow patterns that are easily defined and when a solution can be standardized.
A key benefit of automated cleanup is efficiency — we don’t need to wait or depend on user actions. And if the problematic pattern can be clearly scoped and defined, automatic cleanup can create dependable and consistent output.
Issues like removing duplicates and replacing NULL
values with zeros are examples of very simple and easily automated solutions.
A risk of automated data cleansing is if the data does not fall into expected patterns. In this case, we can unintentionally create new problems and confusion in the data by misunderstanding the original problem or by including errors in our automated solution.
Therefore, for any solution, consider if it’s possible to recover the original data if we need to.
Give users a way to fix the data
There are cases when giving the user a way to fix the data is preferable to automated data cleanup.
Manual data cleansing may be preferable when:
The type of error does not follow a clear pattern and cannot be resolved in a standardized and repeatable way
The information needed to resolve the problem is not accessible without the input of a user
Challenges with manual data cleaning are that it can be inefficient and may require a user to repeatedly carry out tedious actions. And not all users may be incentivized to carry out data cleanup, the result being that it may not happen at all or may happen very slowly.
One method for ‘fixing’ problematic data is to surface it to the user and give them tools and instructions on how to fix it. The first requirement for this strategy is being able to define and identify what data is problematic. The user then needs to be alerted to the existence of problematic data. This ‘alerting’ can look different depending on the type of user and data in question.
In the case of the user being an internal operations user, a path may be to surface the data to the user through a report or dashboard. Or we might generate and send an email to the user alerting them to the problem.
Then, the user must have a means to fix the problem. On the more sophisticated end, this could look like tooling within an application, perhaps directly linked to from where the user was alerted. On the more primitive end of the spectrum, an internal user may pass on instructions through a ticket or email for the data or development team to edit, delete, or create records.
Write tests to identify new patterns of problematic data
Perhaps most importantly is that you need to be able to identify new patterns of problematic data quickly so that you can respond with new controls and solutions as necessary. At Patch Monkey, we use DBT (Data Build Tool) for tests to help flag problematic data. Whether using DBT, or another means, tests should alert your team when data falls outside of what is expected. Examples of occurrences that might be flagged by tests are:
A value is not unique
A value is
NULL
A numeric value is above or below a certain range
A value is not in a list of accepted values
A value is missing an expected relationship (ex: The customer_id on an invoice record does not match a customer record in the database)
Once a test has failed, your team can then determine if a new solution needs to be implemented. Can new controls be added to prevent the creation of problematic data falling in this pattern? Can we automatically clean up future data of this pattern? Or can we give users a way to fix data of this pattern?
More information on how tests work in DBT can be found here.
Conclusion
Regardless of which industry you are in, how much data you have, and what it looks like, data quality concerns are almost certainly a part of your tech stack. We hope the strategies listed here are helpful in your data quality journey.
If you’d like more information on any of the topics mentioned, or if you’d like to find out how Patch Monkey can be a partner in helping you address your data quality issues, contact us!
Happy data cleansing!