Web Scraping With Python: Scrapy, SQL, Matplotlib To Gain Web Data Insights

I’m here today to give you an example how you can too use web scraping to gain insights from online available data. I’m gonna be using Scrapy the web scraping framework to scrape football stats from the web. Then I will process the scraped data. Finally, making it really useful for everyone. I will create some basic, simple analysis with matplotlib and pandas. Let’s jump right into it!

python scrapy

Web Scraping With Python: Scrapy, SQL, Matplotlib To Gain Web Data Insights

Now I’m going to show you a comprehensive example how you can make raw web data useful and interesting using Scrapy, SQL and Matplotlib. It’s really supposed to be just an example because there are so many types of data out there and there are so many ways to analyze them and it really comes down to what is the best for you and your business.

Scraping And Analyzing Soccer Data

Briefly, this is the process I’m going to be using now to create this example project:

  • Task Zero: Requirements Of Reports

Figuring out what is really needed to be done. What are our (business) goals and what reports should we create? What would a proper analysis look like?

  • Task One: Data Fields And Source Of Data

Planning ahead what data fields and attributes we’ll need to satisfy the requirements. Also, looking for websites where I can get data from.

  • Task Two: Scrapy Spiders

Creating scrapers for the website(s) that we’ve chosen in the previous task.

  • Task Three: Process Data

Cleaning, standardizing, normalizing, structuring and storing data into a database.

  • Task Four: Analyze Data

Creating reports that help you make decisions or help you understand data more.

  • Task Five: Conclusions

Draw conclusions based on analysis. Understand data.

Storytime is over. Start working!

Task Zero: Requirements Of Reports

As I said we’re gonna analyze soccer data. I think though “soccer data” is too big of a scope for us right now so let’s say we wanna analyze only goals and match results. We consider matches that has been played in the latest soccer season (2016-17). Regarding soccer leagues, it really depends on how good of a data source we can find. If we could find a website with roughly all the European soccer leagues not just the major ones(Italy, England, Germany, Spain, etc..) that would be the best.

What kind of reports to produce?

This is a key question and must be answered before moving on. We have to know what kind of reports we’re gonna create to know exactly what to scrape. So let’s figure it out!

We’re focusing on goals and results. It would be interesting to see some basic overall reports like:

  • Average amount of goals
  • Average amount of 1st half/2nd half goals
  • Amount of home wins/away wins/draws
  • The biggest goal difference between two teams
  • The biggest comeback in the 2nd half
  • Distribution of average goals along the season

So these are just some adhoc ideas. We could think about other thousands of ways to analyze football data. We’re gonna scrape data only to be able to produce these reports. Again, without knowing what exactly you want to have as an end result it’s just a waste of time to write the scraper. This task is done. Let’s move on to Task One.

Task One: Source Of Data And Data Fields

We want to find a website which has all the data fields we need. What are our data fields? We can figure them out having a close look at the report requirements we just wrote. Data fields are data points and they will be scraped by our future scraper. Putting the data fields together we’ll get a Scrapy item or a record in the database.

Going through the report requirements we will need at least these fields:

  • Home goals 1st half
  • Home goals 2nd half
  • Away goals 1st half
  • Away goals 2nd half
  • Match date
  • League
  • Country

Scraping only these fields will be enough to generate some interesting reports about soccer matches  in the latest season.

Looking for a data source

We are getting closer and closer to be able to start writing our scraper. The next step is to find the source of our raw data aka a website we can scrape. So let’s do a research!

I simply start off googling soccer stats.

python web scraping

There are a bunch of website providing soccer stats. We should keep in mind that we’re NOT looking for the fanciest and most modern website. Usually google shows the best sites on its first page for the given keywords but now we are searching for not the best one but the one which just has the data we need and is simple. Simple means now, that the HTML of the site is nicely structured and relatively clean. Also, if we visit the page with JavaScript disabled it’s still displaying the data fields we need. Because we don’t wanna do  unnecessary JS handling in our scraper. Our goal now is to find a website which has got all the data preferably on one page or few pages. We are focusing on getting our data as simply and fast as possible.

After about 20-25 minutes of research in google, I found a website that meets the mentioned criterias fully: http://www.stat-football.com/en/

Next, have a look at robots.txt of the website. This file contains information about how our scraper or bot should behave on the site. It usually defines rules like disallowing some bots to visit specific pages or defines the minimum delay between two requests. To be ethical we have to follow the rules defined here. So check our chosen website’s robots.txt (it should be located at the root directory of the server):

It mentions a few pages that should be avoided. I’m okay with that we don’t wanna mess with those pages anyway.

This kind of website is the perfect website to scrape. Consistent html, relatively fast to scrape through the pages. No Terms Of Use that explicitly prohibit web scraping on the site. Robots.txt is fine. No javascript/AJAX in the background. Nothing. That’s what I like to see!

As a side note, in a real world project you rarely get to choose which website to scrape because you will have probably no choice. Now we got lucky to find a website which is easy to scrape.

So until now, we made a clear picture what we really want to get as an outcome. Then we found a great site to scrape though it took a little time to find it but it was worth effort. We can move forward and design our Scrapy spider. Finally!

Task Two: Create Scrapy Spiders

This is what we’re gonna be doing in this task:

  • Setting up Scrapy
  • Plan the path of our scraper
  • Inspecting the website
  • Playing around with selectors in Scrapy Shell
  • Writing the spider
  • Writing cleaning pipelines
  • Exporting, saving to SQL database
  • Deploying to Scrapy cloud

Setting Up Scrapy

We need to install Scrapy and we’re gonna set it up inside a virtual environment. Making sure that our project is somewhat isolated. Let’s create a project directory and setup a virtual environment:

Before installing Scrapy we have to install its dependencies:

Finally, install the latest scrapy:

If you have other operating system or any problems installing Scrapy, see the installation docs here.

Now we have Scrapy installed in our environment. Each Scrapy project has the same file structure. We can create a Scrapy project with scrapy startproject command.

This way, we created an empty scrapy project, the file structure looks like this:

scrapy project

Inspecting the website

This is the part when we plan the path of our scraper. We figure out the path our scraper should follow and recognize the easiest way to get our data fields. As we open the web page make sure JS is disabled in the browser so we can interact with the page the same way scrapy does. This is what we can see right on the home page:

inspect page

There are a bunch of leagues in the sidebar we definitely don’t wanna copy-paste the URLs of each. Let’s play around on the site a little bit and see how we can get to the page where our data is laid out.

We could open a league page to see what data is there:

browser inspector

On the league page there’s not much data we are interested in. But at the end of the page you can see there are several other pages containing data about the certain league. Let’s try them out and see which page should be the best to fetch data from.

scrapy inspector

The results page would be great for us but it doesn’t contain data about 1st half/2nd half goals data so it doesn’t fit us now. The fixtures page shows nothing probably because I have JS disabled and it uses JS to load data. Nevermind the next page fixtures-2 is perfect for us. It contains all the matches that have been played in the latest season also detailed information about goals on each match. All these on one page so we don’t even have to request way too many pages to scrape everything we need.

Until now we figured out which page contains the data we can access the easiest way. The next step is to inspect the html of the pages to write the proper selectors. We’re gonna need multiple selectors:

  • League URLs on the sidebar
  • Fixtures-2 URL at the bottom of the page
  • Data fields in the fixtures table

The first two are for link following purposes. The last one is to fetch actual data.

Playing around with selectors in Scrapy Shell

Scrapy shell is an awesome command line tool to test your selectors quickly without running your whole spider. We have to specify a URL and scrapy shell gives us some objects we can work with like response, request etc. So first of all, figure out the selector for league page URLs on the side bar.

python web scraping

Having a look at the inspector of our browser, we could try “ul#menuA > li” as a selector and see if it selects the proper elements. In order to test it, we have to launch scrapy shell and use the response object:

It selects all the URLs on the sidebar which is not good. We don’t need country URLs we only need URLs for actual soccer leagues. We have to slightly modify the previous selector to exclude the elements we don’t need:

I use the :not css selector to exclude the elements on the sidebar that have “.f11.t0.tar” class which means those are country URLs. So this is the full selector to select league URLs: “ul#menuA > li:not(.f11.t0.tar) > a”. I save it to a text file because we will need it in our spider.

Next, we have to figure out a selector for “fixtures-2” page.

scrape page

This exactly the element we need let’s see the html.

website inspecting

To be honest I have never used or heard before about nobr html tag but that doesn’t matter we can write a selector anyway. As I see, this element we want to select is persistently the 4th nobr element. So let’s try this selector in scrapy shell:

First, it selects the td tag that has got “im34.p0612” class then the 4th nobr tag inside that td. Also, I use extract_first() and not extract() because I want to return only one element not a list of elements like previously. As a side note, when you write selectors based on class and an element has multiple classes then in the selector you have to separate each of them with a dot(.). Like I just did.

Well, we got all the selectors to follow links to our end destination. Now figure out how to select actual data fields we want to scrape. This is an example page our scraper will visit:

data fields

First of all, we are going to write the selectors for the easily selectable fields: league, country. We can select the league field using the title right at the top.

python web scraping

Test in scrapy shell:

It selects the correct title tag. That weird \u2013 is just some encoding stuff the selector works properly.

Now write a selector for the country field. I inspected it a little bit and probably the easiest way to fetch that is at the bottom of the page.

It’s laid out in a separated tag, selector:

Okay, let’s move on to the fields that left: home_team, away_team, home_goals1(first half), home_goals2(second half), away_goals1, away_goals2

At first sight I hoped all the matches are in separate divs or somewhat separated because in that case it should be easy to scrapy each field. I hit up the inspector:

browser inspector

All the stuff is inside one pre tag. Which is not cool. Another problem is that the position of  date field is not consistent. Sometimes it’s inside a span sometimes it’s not. Here we need some trickery. Simple selectors like we just wrote is not going to be enough now. A possible solution would be to iterate over each element one-by-one and recognize the item we want to scrape. Let’s hit up scrapy shell once more. Keep in mind, that we want to select all the elements. In order to do that one selector is not enough. We have to write multiple selectors and put them together to select every element of the table, something like this:

This way we have all the elements of the table in a list. Now we just need to figure out a way to recognize when we should populate a new item. Let’s have a look at the table again:

web scraping

There’s one thing which is sure. Each match has one date. After the date we’ve got the teams and the result in the html structure. So we should iterate over the list of elements and recognize if the element is a date. If it is then it’s time to start populating a new item. This way we will need further data processing because we cannot scrape each field separately. So this is what I’m trying to do:

The elements variable contains all the text shown in the html table in a list. The loop goes through the elements(date, teams, result) and I start to assign the fields to a match item when I recognize the field is a date by searching “.201”  (like 07.12(.201)6 for example)in it. I could use regex to recognize date format but I guess there’s no soccer team in the world containing “.201” in its name. When I find the date I know that the next field contains the teams and the next one is the result.

We can test it in scrapy shell:

Writing the spider

We’ve played around with the selectors, figured out what path our scraper will follow. How we will actually fetch data? In this section we’re gonna put all these things together and create a working spider. First create a new python file inside the spiders folder in our scrapy project. This file will contain our spider, let’s call it MatchSpider. This is what the project structure looks like now:

python web scraping

Before writing our spider, we have to create an item. In Task One we figured out all the fields we need so now we just have to paste them into an item class that we define in items.py:

As we analyzed earlier, we need to follow links twice before scraping the item. Fortunately, scrapy has got a module that makes it easy to define rules for following URLs. This is CrawlSpider module which we are going to use now. We have to create a class and derive it from CrawlSpider. Also scrapy has two attributes we have to define to be able to run the scraper, name and start_urls:

We will be able to run the scraper later using the defined name above. We also have to define the URL(s) scrapy will request first in start_urls.

In the previous section we figured out all the selectors we need. As our scraper first needs to follow 2 URLs before scraping one item now let’s define two rules how the link following should be done:

You can see that I use the previously tested selectors here. LinkExtractor is used to find links inside the css selected piece of html. In the second rule we define a callback function which is now “populate_item”. This is the function we’re going to implement to scrape the actual item. As we previously tested the selectors now we just have to paste them into the scraping function. Now we are supposed to populate one item at a time. First assign the fields that are not in the main table on the page. League and country:

Again, we use now the previously tested selectors so we are sure it’s gonna work. Finally, use the loop to iterate over the table elements and scrape most of the fields:

The last step we should take is to make this function return (or yield) an item. As we want each match to be a separated item we yield an item at the end of the loop. The whole populate_item function:

We run the scraper with this command:

Output:

We are halfway done. Though we’ve got plenty work to do in the next step: processing data. The html we scraped data from was really messed up so we couldn’t fetch data fields separately. We need to separate them in the pipelines.

Process Data

Okay so now we’re gonna clean each field. We need to implement a cleaning scrapy pipeline. Go to pipeline.py and create a class called CleaningPipeline:

In every scrapy pipeline we can use the item and spider objects. A pipeline has to have a function process_item which returns the modified item. Now we’re gonna clean each field of the item, except country, then return it. To make it clear and simple let’s create a separate cleaning function for each field that we wanna modify:

To make it simple I usually use IPython to quickly test the cleaning functions I’m about to add. I’m gonna go through each field that needs to be cleaned and copy an example scraped, uncleaned value then paste it into IPython and play around with text cleaning functions. Let’s see an example of a typical scraped item:

Obviously, we need to do some cleaning to be able to store data correctly. Start with the league field. Example of an uncleaned league field value:

We only want it to be “English League Two”. We need to remove the years at the beginning and the part after the first parenthesis. Remember I use IPython:

We skipped everything before the first space(years) then stopped before reaching the first parenthesis(“(”). Finally removed all unnecessary spaces at the beginning and at the end with strip(). Now jump back into our IDE and add the implementation of clean_league().

value is the uncleaned data (in this case item[“league”]) that we’re gonna pass in process_item function to the cleaning function.

Next one, date field. This is how this field looks like:

In each case there’s a number before the date that we have to remove. Also bunch of unnecessary spaces. Fortunately the date format is consistent so we can do something like this:

We find the first dot(.) and step only two characters back to have the full date and remove the unnecessary number at the beginning. Also removing spaces from the end using strip(). In the pipeline implementation we also need to parse the string as a date then change the format so we can seamlessly insert it into our MySQL database.

Cleaning function:

Moving on, the next field is home_team. This field should contain only the name of the home team. Right now it contains both teams’ names in a messy way:

Home team is always the first one. We could do something just like this:

Finding the index of “-” selecting everything before that and removing unnecessary spaces.

Cleaning function:

With away_team we’re gonna do essentially the same thing but now select the part after the “—” character:

Cleaning function:

Next field, home_goals1 this field is intended to store how many goals were scored in the first half of the match by the home team only.

The result inside the parentheses indicates the result of the first half. And now we need the first number before the colon(:). So first remove all the characters that are not inside the parentheses. Try this in IPython:

This piece of code selects only the characters between the parentheses. Now process it further and remove the characters that are after the colon:

And that’s it now we only have to convert the string to an int:

Cleaning function:

We’re gonna do the same with away_goals1 that contains the same kind of data about the away team so now we selects the part after the colon inside the parenthesis:

Cleaning function:

One of the two remaining fields is home_goals2. It contains how many goals the home team scored in the second half. To get this we need to calculate it from the match result and the first half goals. We’re gonna use here the previously cleaned home_goals1 field. But first get the match result from a field value like this:

We should be okay selecting the part before the first parenthesis and removing the spaces and converting to integer only the home goals:

home_goals_full contains the number of goals scored by the home team in the whole match. We’re gonna subtract home_goals1 from home_goals_full to get the second half goals of the home team. Cleaning function:

The last field to clean: away_goals2. We have to do it similarly to home_goals2. Though now we’re going to calculate with the away goals thus select the number after colon(:).

Cleaning function:

Now we’ve created all the cleaning functions we need. Let’s try to run our scraper but before that we have to enable our CleaningPipeline in settings.py to make it work:

Run the scraper:

After running it scrapy shows that there are some errors:

We got 120 errors. Which is not cool. After a little bit of investigation I figured that these caused the errors:

web scraping

These are some cases we didn’t expect yet. Like matches that were postponed, annulled or awarded to one of the teams without playing the real match. We definitely have to drop these items to not store information about them. Also there are some matches that don’t have result at all for some reason. We should add this piece of code at the beginning of our process_item function. This way none of the pipelines will be executed unnecessarily and the item will be dropped if needed.

Now running the spider again it throws no errors.

We’ve scraped 17407 items (or records). We had to drop 120 items because of the reasons mentioned above.

Database Pipeline

As our scraper works and cleans data properly, let’s create a pipeline that stores all the data into a MySQL database. First, create a new class in pipelines.py under CleaningPipeline called DatabasePipeline with the usual process_item function:

Also, we have to install MySQLdb for python with this command if it isn’t already installed:

So first of all, we have to initiate a MySQL connection. Create a dictionary in settings.py containing the database settings:

Then we can invoke these custom settings in the DatabasePipeline using from_crawler constructor.

Now, we have to initiate the MySQL connection inside __init__ constructor using the passed parameters.

Next, in the process_item function we have to create an sql query string that inserts the proper data which will be then executed on the MySQL db:

If we’ve done everything correctly we should see about 17407 records in the database in a well-formated way.

scrapy sql

Scrapy Cloud

If you don’t know Scrapy cloud, it’s a platform that you can use to deploy scrapy spiders in the cloud quickly. For free!

ScrapingAuthority readers get an exclusive 50% off discount on Scrapy Cloud units for 6 months using this discount code: SASOCCER

Now we will setup our project on Scrapy Cloud. So in the future we’ll be able to schedule and run our spider from the cloud  whenever we want to. First of all we have to register to scrapy cloud here. Then create a new scrapy project:

scrapy cloud

Deploying the spider

Next step is to deploy our scrapy spider. Install scrapinghub command line tool shub:

After you’re done, cd into your scrapy project folder and run this:

The first time you deploy you will be asked to provide your API key and project ID. You can find all these by clicking on Code & Deploys on the sidebar of your scrapinghub project page.

That’s it. On Job Dashboard we can run our scraper and see the scraped items right away.

scrapy cloud

Accessing the data

We’ve just setup our spider in the cloud. But how do I access the scraped data directly? Scrapinghub provides a convenient API so you can reach your data in csv or json for example:

https://storage.scrapinghub.com/items/*project_id*?apikey=*your_api_key*&format=json

You have to insert your project_id and api key.

Or if you want to make the dataset publicly available without much further work you can just publish it on the Scrapinghub’s datasets catalog. Selecting a finished job and clicking Items page you can easily publish your dataset so it’s available for everyone to download.

scrapinghub

You can see what the end result looks like, as I’ve published our soccer project’s dataset here. You can see all the data in your browser or download it in JSON format.

Analyze Data

Finally, here comes the fun part: querying database to generate insightful reports and stats. Let’s recall what reports we wanna create:

  • Average amount of goals per match per league
  • Average amount of 1st half/2nd half goals per match per league
  • Amount of home wins/away wins/draws per league
  • The biggest goal difference between two teams
  • The biggest comeback in the 2nd half
  • Distribution of average goals per match per league along the season
  • Average amount of goals per match per country (if multiple leagues available)

Before we get into this we need to install some libs we’re gonna use to deal with data and to help us create reports. First install install pandas to handle data tables:

Install matplotlib to generate actual reports:

Average amount of goals

Okay so the first topic we step into is the average amount of goals of these seventeen thousand matches. We’re going to create a table which contains the country, league, average goals per match, average 1st and 2nd half goals.

 

report

I have never watched Finnish soccer but it looks like I should because they score a ton of goals. They score the most goals per match in the world (at least according to my database which contains 62 European and American league 2016-2017 season matches but of course there are some minor leagues this project doesn’t cover). The Swiss Super League is closely on the second place and a Dutch league on the third place which is not surprising for me. On the last three positions we can see two Russian leagues and the Israeli Premier League they hardly score 2 goals per match. From this table it looks like teams are more active in the second half regarding goals.

Main European soccer leagues

Now let’s have a look at the major European soccer leagues only. What about their 1st/2nd half goal scoring? We’re gonna query the db to generate a stacked report about how many goals they score and when.

report

All the leagues have about a thousand goals in the season. The German Bundesliga is the only slightly under it. This diagram shows that, as the previous table shows too, teams tend to score a little more goals in the 2nd half of the match.

Average goals per country

Moving on, talk about countries because in our freshly collected data we’ve got several leagues from the same country. Let’s see which country scores the most goals on their league matches. We’re gonna do a similar query that we did for the first report but now we’re grouping by country and not league.

report

As we previously saw the Finnish league at the top, contrary now we see Switzerland on the first place with 3.09 goals per match. Which is impressive as well. Also it’s interesting to see that while most of the country’s got significantly more goals in the second half than in the first one, in Ireland it’s practically the same amount. Israel is the last one, not surprising considering our first report about leagues. Apparently, they go for the win and not goals.

Distribution of goals

Next, it would be great to see a distribution report about goals. We are probably going to see a huge elevation on the diagram near two and three goals. Let’s see!

distribution report

There were more than 4000 matches the last season when the teams scored 2 goals. There were slightly more than 3500 when they scored 3 times. And almost 3500 matches when they scored only one goal. These values have the most density. The fourth one which relatively often occurs is 4 goals per match it happened on about 2300 matches.

Season finish

I wanna show you one more report about goals. It should be interesting to see how teams perform as the season is reaching to the end.

histogram report

It looks like  near Christmas and january teams soften down a little and score the least amount of goals in the season. But later as the season is ending in may they do whatever it takes and score the most goals in this period of time. If you get only one thing from this report, you should rather bet on over 2.5 goals in may than in january for sure.

Biggest blowouts

In soccer there are matchups when we’re ready for a blowout and sometimes it’s out of nowhere. Let’s see the biggest differences between two teams the last season regarding goals.

report

Somewhere in Ukraine two teams played and I don’t know how it happened but one team scored 13 goals the other team only one. Wow. On the second place I see another matchup from the same league goal difference is eight. I should really check out this league. The third one is from Germany a 8-0 blowout by Bayern Munich I guess.

Comebacks

In the next report, we’re going to see some comebacks. When a team was down by multiple goals at the end of the 1st half then they managed to finish on top.

comeback report

Well, it’s sort of bad that we cannot see here unbelievable comebacks like winning after getting 5 goals in the first half and such. Never mind data is not always surprising. The biggest comeback several teams have been able to make, apparently, is to get up the floor from a 2 goal deficit. If you think about it being behind the other team by two goals and eventually winning the match is actually an amazing performance.

Home Advantage

Honestly, soccer is really not about goals. It’s about winning. If a team wins they don’t really care how many goals they had. So as a final report let’s see how often did a team win the matchup when they were at home or away.

results report

So the teams that played at home won the match almost half the times. Also the amount of draws and away wins are almost the same which is pretty interesting. From this perspective, it looks like the home teams have an enormous advantage over the other team. Actually it’s sort of obvious.

Wrapping Up

This is the process one can take to find meaning in web data. First, figure out what data points you really need exactly. Try to think of the end result and reverse engineer the data you have to have in order to create it. Then, find a website or even multiple sites that have the data you need and make sure that website is okay with scraping it or you’ll be able to deal with the consequences. Second, write spiders that pulls your data. I showed you an example how you can extract messy data from html. Third, process and clean data. This step is optional but from my experience you almost always have to clean data to meet your needs. Fourth, here we create the “end product”. Create simple yet meaningful analysis. Fifth, draw conclusions. Act upon the facts. Make data-driven decisions.

Thanks for reading!

Links you might want to check out:

 

Download FREE ebook!

  • Bård Aase

    Hi
    In section 2 you start off by creating a virtualenv. Then in the next step you install scrapy globally, using sudo, e.g. not in the virtualenv

    Also, the apt-get command has sudo pip install scapy attached at the end.

    • hey,
      Thanks for your comment, I’m gonna fix them