Writing crawlers to extract data from websites is a seemingly intractable problem. The issue is that while it’s easy to build a one-off crawler, writing systems that generalize across sites is not easy, since websites usually have distinct unique underlying patterns. What’s more, website structures change with time, so these systems have to be robust to change.
In the age of machine learning, is there a smarter, more hands-off way of doing crawling? This is a goal that we’ve been chipping away at for years now, and over time we’ve made decent progress in doing automated generalizable crawling for a specific domain — ecommerce. In this article, I’d like to describe the system that we’ve built and the algorithms behind it; this work is the subject of a recent patent filing by our team.
Our goal in this project is to extract the entire catalog of an ecommerce site given just its homepage URL (see image above). This involves three key challenges.