Using AI to Automate Web Crawling

Writing crawlers to extract data from websites is a seemingly intractable problem. The issue is that while it’s easy to build a one-off crawler, writing systems that generalize across sites is not easy, since websites usually have distinct unique underlying patterns. What’s more, website structures change with time, so these systems have to be robust to change.

In the age of machine learning, is there a smarter, more hands-off way of doing crawling? This is a goal that we’ve been chipping away at for years now, and over time we’ve made decent progress in doing automated generalizable crawling for a specific domain — ecommerce. In this article, I’d like to describe the system that we’ve built and the algorithms behind it; this work is the subject of a recent patent filing by our team.

The goal of our automated crawling project

Our goal in this project is to extract the entire catalog of an ecommerce site given just its homepage URL (see image above). This involves three key challenges.

Click here to read the rest of this article on the Semantics3 blog

Experiments with Shrinking my Garbage Footprint

I wrote this post back in 2017, but left it languishing in my drafts folder. Present day reflection at the end of this article.

“Garbage City”. The city that I live in, Bengaluru, has been conferred this unceremonious moniker in one too many articles of late. An increase in waste generation, poor waste segregation practices, non-operational processing plants and apathy from the citizenry has spawned a growing environmental and health crisis in the city, which has in turn affected the aesthetic beauty of the “Garden City of India”.

And the rest of urban India isn’t far behind. Mumbai, Chennai (link to a previous post on garbage in Chennai), Delhi and Kolkata face their own equally daunting challenges. “According to a World Bank 2015 report, India produces 109,589 tonnes of municipal solid waste a day which is projected to triple to 376,639 tonnes a day by 2025.” [Ref.]

Confronted with these terrifying facts, what is a concerned citizen to do?

Click here to read the rest of this article on Medium

Neutralizing Emissions from International Air Travel — On Project CORSIA and its Shortcomings

If you were to take a flight from India to the United States, to which country would the carbon emissions produced be attributed to? To India, since that’s where the flight took off? An equal split between the two countries? Or to all the countries along the route that the flight takes?

This is not just a hypothetical question. It matters, because it informs graphs like this:

Credit: Union of Concerned Scientists

And these graphs are in turn important, because they tell us where the clock on the metaphorical time bomb of climate doom stands. They determine the targets that each country needs to achieve to keep temperature rise below 2 degrees.

So which country is it? The answer is … drumroll …

Click here to read the rest of this article on Medium

The Problem with the Way We Measure Carbon Emissions

I have been struck by how important measurement is to improving the human condition. You can achieve incredible progress if you set a clear goal and find a measure that will drive progress toward that goal. — Bill Gates

In the global effort to tackle climate change, volume of greenhouse gas emissions is perhaps the most important measure. This metric helps inform the targets that nations agree to in international arenas, and serves as a barometer for ongoing assessment of the impact of policy initiatives. That’s why, it’s important to build a strong understanding of how this metric is measured, understand any inherent biases that its measurement may carry, and counteract any economic misincentives that these biases might create.

Click here to read the rest of this article on Medium

The Ecommerce Knowledge Graph – Semantics3 Labs

Over the past 7 years, we’ve built an extensive Universal Product Catalog, by curating and understanding public data from across the public e-commerce web. This includes information about 100s of millions of products, ~1000 standardized attribute typesbillions of attribute values and tens of billions of pricing and ranking signals.

Now, as part of our latest research initiative, we’ve built an Ecommerce Knowledge Graph to harness the value of the relationships between the entities in our datasets. At the core of this graph is the set of relationships between the structured attributes that describe products in the catalog; the graph is also layered with the billions of relationships between products themselves through characteristics like shoppability, browsability and compatibility.

Click here to read the rest of this article on the Semantics3 blog

AI-based HTS Code Classification: 5 Technical Ideas for Building Solutions that Work

Imports, exports and tariffs are quite the theme in the news these days, be it in the context of Brexit, the US-China trade war or the Iran nuclear deal. Executive decisions on what duties should be levied on goods crossing borders are the norm of the day. Have you ever wondered how these decisions are practically implemented at the ground-level though? The answer – Harmonized Tariff Schedules (HTS), a taxonomy built by the World Customs Organization (WCO) to classify and define internationally traded goods. Semantics3 offers automated HTS code classification solutions to help logistics providers modernize their customs workflows.

Harmonized Tariff Schedule (HTS) code classification is a surprisingly challenging machine learning problem – while at face value it is a simple multi-label classification, the real-world specifics are often deceptively intractable:

  • For starters, the quality of data available from most sources is rather poor, so automated decision making systems have to learn to pull in external knowledge, and to develop a good understanding of understood norms.
  • In addition, target code classes change across geographies and with time, requiring algorithms to keep an eye out for stale data.
  • What’s more, it’s surprisingly difficult to have trained human annotators agree on what the right HS code for a given product should be – in datasets annotated by trained professionals, we usually see differing labels for the same product at least 30% of the time.

How do you build automated systems that can deal with these challenges? In this article, I’ll cover five techniques that have helped us deal with these problems.

Click here to read the rest of this article on the Semantics3 blog

The State of Ecommerce – 2019 Report

Over the course of two weeks in February 2019, two Semantics3 engineers crawled the entire universe of dotcom domains looking for ecommerce sites. We built our numbers bottom up, surveying over 138.2 million dotcom sites and analyzing ~6 million merchants. This report presents the observations that we gathered. Scroll to the end to access a downloadable PDF and interactive infographic.

Most reports on the state of ecommerce inevitably focus on dynamics associated with the largest players alone, typically built on financial reports from listed companies. In this report, we aim to provide a different perspective of the industry in two key ways:

  • We’ve holistically looked at all stakeholders in the industry, not just a sampled non-random subset, by analyzing all active ecommerce dotcom websites.
  • We’ve explored diverse aspects of the industry, including third-party marketplaces, social media platforms, hosting platforms, promotional channels, product catalogs & categories and technical intricacies.

Click here to read the rest of this article on the Semantics3 blog

Principles for Managing Data & Product Quality

A couple of weeks ago, I posted a two part series detailing how we do data QA at Semantics3. In the days that followed, I’ve had people get in touch to discuss the best ways to establish or improve data/product quality initiatives at their own organizations. For some, the impetus was a desire to stem customer churn, and for others just to make customers happier.

Here were some of the common discussion points from these conversations:

  • Should we hire a dedicated QA analyst to solve our data quality issues?
  • What sort of profile should such a hire have?
  • How do you decide what your data QA budget should be?
  • How do you draw a line between automation and manual effort?
  • We don’t want to build on heuristics because they don’t scale – any alternatives?

The inherent theme in all of these questions was that everyone looking for a framework to think about quality analysis. So, while my previous posts addressed specific ways in which we do data QA at Semantics3, in this article, I’d like to speak to the philosophy behind them.

Click here to read the rest of this article on the Semantics3 blog

Mapping the Universe of E-commerce Brands

Of the thousands of attributes that we handle while curating product catalogs, the hardest and perhaps most important attribute is brand. Consumers often begin their searches with brand names of the products they’re looking for … which is why our customers (marketplaces, retailers, brands and logistics companies) are keen on having high coverage of standardized values for this field.

Unfortunately, many of the data sources from which we build our catalogs often fail to provide brand as an explicit field … or unwittingly carry an incorrect value in the brand field. In these cases, the challenge falls to us to build datasets that are robust to such issues. Specifically, this entails:

  • extracting brand provided as unstructured text in other fields provided in the listing such as name or description,
  • inferring brand where altogether absent,
  • standardizing the brand string to a unique representation consistent across the catalog.

Over the last 7 years, we’ve tried to tackle this problem in many different ways – hacks, statistical methods, NLP, heuristics, algorithms, annotation and more. In this article, I’d like to chronicle some of the approaches that’ve worked for us.

Click here to read the rest of this article on the Semantics3 blog