Building a Web Scraping Based SaaS Business, part 5 – Data Matching Module

Sooo after a while here’s a new post about my business. Lately, I’ve been trying to tackle two of my biggest challenges: Getting new clients onboard and creating the product matching module. I can’t speak much now about the first one because I still try to figure out how to sell this kind of thing to anyone. Momentarily, I’ve been cold-calling certain small and medium sized e-commerce companies. I’ve created a short script that I can use to quickly introduce PriceMind and arouse a sort of curiosity. Eventually I show off a real, working demo to the future client hopefully. Anyway in this post I wanna speak about something else that I have just a little clue about as well but it’s more challenging technically. Data matching.

data matching

Data matching module

As you may know, I build a price intelligence platform. The base of everything is the database. It contains price and other information of products. Now, the difficult part is not web scraping. Because you know you can write any number of web spiders get any kind of data from any website, in most cases. Then, you can also clean and standardize data so the whole database (data warehouse) is cool to work with in the future. The real hard stuff is to create a system that finds the same product on various websites. Because a price intelligence platform is worth zero if it doesn’t compare your prices with the competitor’s prices. That’s the whole point.

Product matching

In order to resolve that, we have two options. The first one is easy but lame. You can gather the product page URLs of the same product from multiple websites. Then you save these URLs in a database so the scraper can reach it and scrape the websites accordingly. This is called manual matching. It’s really time consuming and not so fun work to do. But obviously you only have to do it once then it will work properly(hopefully). Momentarily, this is how PriceMind works. I have to gather product match URLs manually. But in the last 2 weeks I’ve been working on a module that does it automatically. I feel like it will save me countless hours of URL gathering. I’m gonna give you an introduction how I’ve been building this module.

Data record matching

So let’s imagine this scenario: I have two tables in the db. One contains products of SHOP A, the other one contains products of SHOP B. There are products that you can find in both shops. The problem is that the data records are not exactly the same although they refer to the same product. Let’s take an example from the contact lenses industry, these are the same products but with different product names:

SHOP A: Focus  dailies toric

SHOP B: Focus  dailies all day comfort toric

How can we figure out if they refer to the same product? String comparison!

String comparison

So I researched how to compare two strings and get their similarity value. There are some well-known algorithms for this:

  • Levenshtein distance:

This method gives you the number of minimum edits(insertions, deletion, substitutions) required to change one string to another.

  • Jaro-Winkler:

This will return a value between 0 and 1. 0 means the two strings are completely different, 1 means they are totally the same. It measures how many characters are in common also it assumes that similarity or difference near the start of the string is more important than at the end of it.

 

There are bunch of other ways to measure the similarity between two strings(q-grams, phonetic encoding, and other modifications of the above) but these two are the ones I’ve been experimenting lately. I prefer the Jaro-Winkler method because it seems to be more accurate when comparing product names.

Great thing that you don’t have to implement these algorithms because there are numerous libraries already do that. I’ve been experiencing with these:

A python lib containing a lot of string comparison methods. Also it can compare string based on phonetic encoding. It’s a really useful library to try out different ways to do stuff and just play around.

This is another pretty useful lib in python. It is so cool because it’s got not only the pure implementations of the algorithms I mentioned before but furthermore it can compare strings with tokenizing which is very useful when dealing with multiple words strings. Also it can compare not just two but multiple strings. It’s a fun library for sure.

 

Multi-parametric matching

Moving on from string matching, what I found is that considering only product names is not enough to create a module that provides accurate product matches. Normally, a product has several parameters. For example: size, color, type etc.. of course these are all domain specific properties. We should make use of them when searching for product matches! Again, I’m gonna stick to contact lenses because right now this is the domain I’ve been experiencing with. Contact lenses have many parameters aside from name: diameter, use time, oxygen content, water content, toric, multifocal etc… These are all valid parameters and we can figure out if two products are the same or not based on only parameters.

data matching

Similarity vector

In data matching there are some fundamental practices we have to do to create an accurate matching system. One of them is the similarity vector. A similarity vector contains a number of similarity values. Let’s see this product as an example:

SHOP A: PureVision 2
SHOP B: PureVision 2 HD

Sab = [brand, name, size, water content, oxygen content]
Sab = [1, 0.75, 1, 0, 1]

Each value inside the similarity vector refers to a parameter. 1 means that the two parameters are exactly the same. 0 means that the two products have totally different values regarding that parameter. In this example the name parameter is 0.75. Which means the names are somewhat similar but not exactly the same. Using the Jaro-Winkler algorithm we get 1 similarity value for these two strings:

PureVision
PureVision

These are brand names and it’s obvious they are the same brand. Let’s see something not so obvious, the product names:

2
2 HD

These are the product names. Using the Jaro-Winkler method we get a similarity value of 0.75. Knowing the similarity value between the brand names, and product names, without a deep knowledge of this specific domain I would have a hard time figuring out if these products are the same. Maybe HD means they are totally different products who knows. That’s why multi-parametric matching is so useful. The next parameter we’re matching is the size. We get 1 as a similarity value which means it matches. Now we are getting assured these two products is rather a match.

Moving on, next parameter is water content. Its value is zero. Which means they have totally different values or simply the data field is NULL for one of the products. Unfortunately, sometimes it’s not possible to gather the same kind and amount of information from various websites so there will be empty data fields that makes the data matching process more difficult.

The last parameter we measure is oxygen content. The similarity value is 1 which means it’s the same. Cool. Knowing these similarities we can have a pretty good guess if the products are the same or not. Also, without domain knowledge it’s pretty hard. Maybe if the water content is different there’s no way these products are the same for example.

 

Anyway, I will continue my journey learning data matching. Hopefully in the next post I will be able to show you a solid process how I do it. Now I just quickly rambled around some interesting stuff I’ve researched.

 

Some libraries and frameworks that I’ve been working with recently:

Awesome data matching learning material:

Download FREE ebook!