Of the thousands of attributes that we handle while curating product catalogs, the hardest and perhaps most important attribute is brand. Consumers often begin their searches with brand names of the products they’re looking for … which is why our customers (marketplaces, retailers, brands and logistics companies) are keen on having high coverage of standardized values for this field.
Unfortunately, many of the data sources from which we build our catalogs often fail to provide brand as an explicit field … or unwittingly carry an incorrect value in the brand field. In these cases, the challenge falls to us to build datasets that are robust to such issues. Specifically, this entails:
- extracting brand provided as unstructured text in other fields provided in the listing such as name or description,
- inferring brand where altogether absent,
- standardizing the brand string to a unique representation consistent across the catalog.
Over the last 7 years, we’ve tried to tackle this problem in many different ways – hacks, statistical methods, NLP, heuristics, algorithms, annotation and more. In this article, I’d like to chronicle some of the approaches that’ve worked for us.