Easy, not Simple: Diagnosing Data Quality Issues

No data science program will be successful if its source data quality issues aren’t addressed. Anyone that says their source data doesn’t have data quality issues hasn’t looked at it hard enough, talked to business users or data warehouse analysts enough. Every source system will have endemic quality issues — it is the duty of the data professional to address them in a meaningful way.

Many organizations struggle at even defining what the issues are because it seems like an overly simple exercise in Who, What, When, and Where — but it’s easy — not simple.


To diagnose your data quality issues, you’ll need to enlist some aid. If you are approaching this problem from an IT standpoint, you can’t just rely on IT analysts, you’ll need to listen to all of the business users / downstream users of the data to understand what they might be doing to transform that data into something usable. If you are a downstream analyst / internal data broker/citizen data developer, then you’ll need to listen to more than just the concerns of those within your data consumer circle, you’ll need to take into account some of the upstream vulnerabilities that IT may know of.

The more serious data quality issues you have and the larger your organization, then the larger their downstream systems will be. Throughout any data exercise, you’ll find many workarounds in the wild. Data is where business and IT collide, and it’s often a huge friction point of misunderstandings and poorly strategized communications.

The first set of people that you identify should probably be easy — they will be the loudest about all of the information pain points they are facing. Gather those people together and start talking about what processes they are having problems with, and what they have to do to work around them.

You don’t want to be disorganized about these conversations. Ultimately, you are seeking to sum up your experience into a simple summary paragraph that can be used in executive summaries of your data governance efforts.

Over the past [period of time], data quality issues have impacted the accuracy of [type of data] in [system], which has impacted analyses in [business units impacted by data quality]. These issues have caused costly manual rework and distrust in our data integrity, as well as overall decreases in productivity and costs.

To be able to construct such a simple summary, you’ll need to systematically explore some questions. These don’t have to be asked in order — but they should all be asked.


Sit with data stakeholders and ask the following questions:

What are your symptoms of poor data quality?

What are the top risks are you facing due to those issues?

What points of friction are hurting the most?

Think of this more of a brainstorming session, though often you can throw in a healthy dose of therapy because these sessions are often painful. There are built-up emotions in those who work with the data day in and day out. Often, when all of the stakeholders are sitting in a room together (or if you hold smaller cross-functional data-SME focused subgroups to foster focused meaningful conversation), their passions for the data will be laid bare. It can be both exhilarating and difficult, and if you are proctoring, ensure that you maintain everyone’s empathy with each other. This isn’t a time for us vs. us — it’s a time for us vs. the data issue.

Once you have a handle on the fundamentals of the ‘what,’ you’ll later able to ask this question.

What are the top opportunities you can see?

Here is where you can brainstorm for short term and long term, cross-functional projects to address data quality issues.


When and with what frequency do data quality issues present?

How long do the data quality issues last?

These time-based questions help to identify issues that may have been marked resolved as ‘working as designed’ but never truly resolved to your users. You’ll find some data issues that only happened with a historical update, or others that happen once a year — or others that are far more complex and perplexing. If you are looking to summarize the impact of the data quality issues for your executives, understanding the time frames for possible errors will be vital. While this step often seems to be a no-brainer, it’s important that you not skip it — again, it can be surprising to hear what some people are experiencing/correcting within data, especially if you’ve never embarked on such questions before.

Image by Goumbik from Pixabay


Where is the data stored?

Where do you see the data quality issues?

How many downstream processes are impacted?

Again, the simpler questions are going to be some of the most enlightening. It was surprising to IT executives to find out that one of the most important data stores of information was trapped within Microsoft Access (and it was devastating to find this out because an upgrade to the software broke functionality within the system) or in Excel spreadsheets. There are still many systems of Master Data Management that are remediated in Excel and Access behind the scenes. To properly identify those, and any errors they are attempting to address (or errors that they might introduce!) it’s important to return to the simple questions.

You’ll often discover that data that was being prepared by a carefully engineered process is being replaced by users downstream by another Excel spreadsheet that someone’s admin assistant has to maintain. Be prepared for these types of situations to come up! Remember, it doesn’t matter how graceful your solution is if your C-suite still is getting their admins to write up a summary in their email for them.

To get a handle on what data is being used to create what information returns us to simple questions, which often do not have easy answers.

By Jamie Toth, The Somewhat Cyclops on .

Canonical link

Exported from Medium on December 5, 2021.