I have a project where a script dynamically updates a database with URLs the scraper has to scrape. This database contains hundreds of URLs. I had to find a way to fetch all the URLs from the db with scrapy then run the spider on these URLs.
Gathering URLs To Scrape From Database
First of all, scrapy spiders have an attribute called start_urls. In most cases you define the URLs you want to scrape here. Now we need to set these URLs dynamically from a MySQL database. How should we go about this? In scrapy there’s a method called start_requests. This method returns a list of Request objects which then parsed inside the spider. Normally, these Request objects are made using the start_urls attribute. Now we need to override this method to query the database for URLs then create Request objects.
So first we need to make a connection to our db. Then get all the urls in a list. I use dict cursor to find the URL column in the table.
conn = MySQLdb.connect(
cursor = conn.cursor()
cursor.execute('SELECT * FROM links')
rows = cursor.fetchall()
for row in rows:
url = row["url"]
As I mentioned start_requests has to return an iterable(or a generator). So I iterate over the fetched items and create Request objects from the plain URLs.
Getting URLs From API
This way, we can grab URLs from whatever source. For example in a project of mine the spider gets the URLs to scrape from an API. Like this:
rows = requests.get("http://mywebapp.com/api/urls_to_scrape").json()
for row in rows:
As a quick side note, if you need to define for your host explicitly which ip addresses are accessing your db and at the same time you’re using scrapy cloud then you won’t be able to get a static ip of scrapy cloud spider (as I know) so you will need to use an API to access your URLs or whatever from your db.