In this first of two posts about data quality, I’d like to delve into the challenge of building and maintaining evolving datasets, i.e., datasets that are function of changing inputs and fuzzy algorithms and therefore subject to constant modification.
At Semantics3, we deal with many such datasets; we work with changing inputs like product and image URLs which vary depending on when they’re crawled, and machine learning algorithms that are regularly subject to data patterns that they might not have seen before. As a result, output datasets can change with time, and from one run to the next.
Run-by-run volatility of this kind is not inherently a bad thing, as long as the aggregate precision and recall of the dataset is kept in check. To do this, we have, over the years, developed a set of approaches that can be broadly divided into two groups – statistical & algorithmic techniques to detect issues, and human-in-the-loop processes to resolve them. In this first part, I’d like to share some of our learnings from the former group.
In any given month, we have billions of data-points that undergo some sort of change. At this volume, human review of every data point is infeasible. Therefore, we rely on automated techniques to direct our human reviewers to the pockets of data that are most likely to be problematic.
Below, I’ll run through some of the most powerful techniques that can also be generalized across domains.