Mapping the Universe of E-commerce Brands

Of the thousands of attributes that we handle while curating product catalogs, the hardest and perhaps most important attribute is brand. Consumers often begin their searches with brand names of the products they’re looking for … which is why our customers (marketplaces, retailers, brands and logistics companies) are keen on having high coverage of standardized values for this field.

Unfortunately, many of the data sources from which we build our catalogs often fail to provide brand as an explicit field … or unwittingly carry an incorrect value in the brand field. In these cases, the challenge falls to us to build datasets that are robust to such issues. Specifically, this entails:

  • extracting brand provided as unstructured text in other fields provided in the listing such as name or description,
  • inferring brand where altogether absent,
  • standardizing the brand string to a unique representation consistent across the catalog.

Over the last 7 years, we’ve tried to tackle this problem in many different ways – hacks, statistical methods, NLP, heuristics, algorithms, annotation and more. In this article, I’d like to chronicle some of the approaches that’ve worked for us.

Click here to read the rest of this article on the Semantics3 blog

How we do Data QA @ Semantics3: Processes & Humans-in-the-Loop (Part 2)

In this second of two posts about data quality, I’d like to delve into the challenge of building and maintaining evolving datasets, i.e., datasets that are function of changing inputs and fuzzy algorithms and therefore subject to constant modification.

At Semantics3, we deal with many such datasets; we work with changing inputs like product and image URLs which vary depending on when they’re crawled, and machine learning algorithms that are regularly subject to data patterns that they might not have seen before. As a result, output datasets can change with time, and from one run to the next.

Run-by-run volatility of this kind is not inherently a bad thing, as long as the aggregate precision and recall of the dataset is kept in check. To do this, we have, over the years, developed a set of approaches that can be broadly divided into two groups – statistical & algorithmic techniques to detect issues, and human-in-the-loop processes to resolve them. In this second part, I’d like to share some of our learnings from the latter group.


In Part 1, we looked at automated techniques to detect data quality issues. In this post, we’ll look into how we resolve these problems, and try to ensure that they don’t crop up again.

Click here to read the rest of this article on the Semantics3 blog

How we do Data QA @ Semantics3: Statistics & Algorithms (Part 1)

In this first of two posts about data quality, I’d like to delve into the challenge of building and maintaining evolving datasets, i.e., datasets that are function of changing inputs and fuzzy algorithms and therefore subject to constant modification.

At Semantics3, we deal with many such datasets; we work with changing inputs like product and image URLs which vary depending on when they’re crawled, and machine learning algorithms that are regularly subject to data patterns that they might not have seen before. As a result, output datasets can change with time, and from one run to the next.

Run-by-run volatility of this kind is not inherently a bad thing, as long as the aggregate precision and recall of the dataset is kept in check. To do this, we have, over the years, developed a set of approaches that can be broadly divided into two groups – statistical & algorithmic techniques to detect issues, and human-in-the-loop processes to resolve them. In this first part, I’d like to share some of our learnings from the former group.


In any given month, we have billions of data-points that undergo some sort of change. At this volume, human review of every data point is infeasible. Therefore, we rely on automated techniques to direct our human reviewers to the pockets of data that are most likely to be problematic.

Below, I’ll run through some of the most powerful techniques that can also be generalized across domains.

Click here to read the rest of this article on the Semantics3 blog

How to Launch and Maintain Enterprise AI Products

There is a significant disconnect between the perception and reality of how enterprise AI products are built.

The narrative seems to be that given a business problem, the data science team sets about gathering a training dataset (for supervised problems) that reflects the desired output; when the dataset has been built, the team switches to the modelling phase during which various networks are experimented with, at the conclusion of which the network with the best metrics is deployed to production … job done, give it enough time and watch the money roll in.

https://xkcd.com/1838/

It’s Never That Simple

My experience building enterprise B2B products has been that this is only where the work begins. Freshly minted models rarely satisfy customers’ needs, out of the box, for two common reasons.

Click here to read the rest of this article on the Semantics3 blog

Deriving Meaning through Machine Learning: The Next Chapter in Retail

Three slides from Benedict Evans’ brilliant talk, The End of the Beginning, really caught my attention.

The Old and the New

Across industries, machine learning is helping us get to successive levels of meaning of what a thing is. It’s helping us explicitly understand what things are, a leap forward from the existing state of affairs, which rely on extrapolation through indirect inference. In the context of retail, the implications of this are significant. Here’s why.

Click here to read the rest of this article on the Semantics3 blog

Product Matching — A Visual Tribute

Product matching is a challenging data-science problem that we’ve been battling for several years at Semantics3. The variety of concepts and nuances that need to be taken into consideration to tame this problem has reduced our data-scientists to tears on more than one occasion.

In this week’s post, we decided to pay a visual tribute to product matching by showcasing some of the particularly difficult examples that we’ve come across over the years. Enjoy!

Click here to read the rest of this article on Medium

The GIGO Principle in Machine Learning

And its implications for PMs, designers, salespeople and data scientists

Garbage-In-Garbage-Out is the idea that the output of an algorithm, or any computer function for that matter, is only as good as the quality of the input that it receives.

The principle underlying GIGO is essential when it comes to the real world deployment of algorithms. And with the increasing usage of ML in everything from public-facing APIs to the underlying services that power public-facing applications, awareness and assimilation of this principle is as important now as it has ever been.

Credit

Click here to read the rest of this article on Medium

“Hot Dog and a Not Hot Dog”: The Distinction Matters (Code Included)

And why Periscope Should’ve Held Out for a Little Longer

Spoiler Alert: This article references a recent episode of the show Silicon Valley. It only refers to material already provided in HBO released previews, but if you’d like to stay completely out of the know, look away now.

In a recent episode of HBO’s “Silicon Valley”, one of the characters Mr. Jian-Yang builds an app called “Not Hotdog”. The app allows users to identify whether objects are or are not hot dogs. At face value, it seems to be of little use, but it turns out to have very interesting wider applicability (watch the episode to find out more).

One of the comedic quirks is that the character, Mr. Jian-Yang, insists that the app enables two different tasks:

  1. Identifies whether an object is a hot dog.
  2. Identifies whether an object is not a hot dog.

Questions & Intuition for Tackling Deep Learning Problems

Working on data-science problems can be both exhilarating and frustrating. Exhilarating because the occasional insight that boosts your algorithm’s performance can leave you with a lasting high. Frustrating, because you’ll often find yourself at the dead-end of a one-way street, wondering what went wrong.

In this article, I’d like to recount five key lessons that I’ve learnt after one too many walks down dead alleyways. I’ve framed these as five questions that I’ve learned to ask myself before taking on new problems or approaches:

  • Question #1: Never mind a neural network; can a human with no prior knowledge, educated on nothing but a diet of your training dataset, solve the problem?
  • Question #2: Is your network looking at your data through the right lens?
  • Question #3: Is your network learning the quirks in your training dataset, or is it learning to solve the problem at hand?
  • Question #4: Does your network have siblings that can give it a leg-up (through pre-trained weights)?
  • Question #5: Is your network incapable or just lazy? If it’s the latter, how do you force it to learn?

Click here to read the rest of this article on Medium