PRACTICAL WAYS TO CHECK DATA QUALITY

Data quality is often noted as an issue

  • Analysts spend from 20 to 60% of their time trying to understand and fix poor data1
  • Inaccurate data wastes up to 27% of an organization’s revenue2
  • 1 in 3 business leaders don’t trust the information they use to make decisions3

Given so much time and effort is spent checking data, and knowing the potential flow-on impacts that poor data quality can create, this short article is intended to share some experience in the ‘how-to’ practices and techniques that work!

Whilst working alongside a State Finance group, comprised of 5 finance business partners, it caught my attention that there seemed to be a rather constant effort by the team on ‘journal entries’ (adjusting of finance records). Whilst this is a common practice, the amount of time & effort being expended seemed excessive. A quick tally of the activity over the last quarter revealed that 20% of the teams’ time (effectively 1 person FTE) was being consumed in making journal entries, caused by the initial incorrect coding of expenses by the operational team!

This effort was leaving the team with little to no time to provide a better value-add through commentary on their financial reports, as they were consumed in reworking the poor quality data. A very real example of the pain that can be caused through poor data quality!

The key practices to identify data quality issues and ensure the data is Complete, Correct and Consistent

Practical Observation

First, check the data is Complete & Technically Correct – Which columns are present and do they contain the data required? Do the columns contain useful data? Within the data columns, are there distinct types of data (categorical and continuous formats) present?

Example of Data with Technical issues: missing values and date formats

Good questions to ask yourself at this stage include:

  • Does the data you’re looking at make sense?
  • Is the data that exists in the columns in a useful form?
  • Is the data format consistent with & aligned to the column heading? e.g. customer names are in the ‘customer name’ column
  • Do the columns that contain numbers have the expected magnitude & decimal places?
  • Scanning down the columns, are there patterns of repeated or duplicate values present?
Practical observation of this Data reveals a free text column that may contain useful information however is not in a useful format

Minitab software includes some very useful tools to assist conducting these types of observations: Data summary (Menu: data>worksheet information), and Tally table (Menu: stat>tables>tally individual variables)

Graphical review

Check the data is Consistent –  technically correct data that represents the process that the data relates to, by conducting a Graphical review – ‘Rolling in the data’, is it as expected and representative?

This includes identifying issues such as:

  • Inconsistent values
  • Sustained patterns of repeated values
  • Outliers or extreme values
  • Not representing the full range of categories, time, geography etc.
  • Truncated values, excluded data
Timeseries chart of ‘elapsed times’ revealing the presence of negative values and some other extreme outliers, the data integrity needs checking

Minitab incudes graphical tools such as the timeseries chart, bar chart, histogram chart (Menu: Graph) and Pareto chart (Menu: stat>quality tools>Pareto chart) that are quick & ideally suited to conducting these types of Graphical reviews

Pareto chart of fault categories revealing that ~11% of the data is suspect as it has been recorded as “Unknown” reason

Data needs to be representative of the process that it came from, beyond just checking the values, this requires making a ‘logic and sanity check’. More often than not, assumptions are made that the data is representative and checking it is skipped over, potentially resulting in missed insights or incorrect conclusions. Does the data in the dataset show the ‘full range’ that is expected?

Despatch warehouse data – count of loads despatched is associated with a time domain, in this example the warehouse operates from 7am to 11pm, and therefore a significant amount of data is missing after 6pm (1800 hours)

Conclusion

Before starting and diving into data analysis and reports, checking for data quality issues using Practical and Graphical techniques such as the ones covered in this article will ensure you understand how Complete, Correct and Consistent your data is, and identify any potential data quality issues that need to be addressed

  1. sas.com
  2. Harvard Business Review – Tadhg Nagle, Thomas Redman and David Sammon
  3. ibmbigdatahub.com

Related Content

CLEANING UP ‘FREE TEXT’ DATA!

All too often it is the free text column that contains the

Read More...