In this second of two posts about data quality, I’d like to delve into the challenge of building and maintaining evolving datasets, i.e., datasets that are function of changing inputs and fuzzy algorithms and therefore subject to constant modification.
At Semantics3, we deal with many such datasets; we work with changing inputs like product and image URLs which vary depending on when they’re crawled, and machine learning algorithms that are regularly subject to data patterns that they might not have seen before. As a result, output datasets can change with time, and from one run to the next.
Run-by-run volatility of this kind is not inherently a bad thing, as long as the aggregate precision and recall of the dataset is kept in check. To do this, we have, over the years, developed a set of approaches that can be broadly divided into two groups – statistical & algorithmic techniques to detect issues, and human-in-the-loop processes to resolve them. In this second part, I’d like to share some of our learnings from the latter group.
In Part 1, we looked at automated techniques to detect data quality issues. In this post, we’ll look into how we resolve these problems, and try to ensure that they don’t crop up again.