Scrapy Settings:
If you’re looking to uniquely customize your scraper, then you’ll want to learn the tips and tricks of how to do this without a hassle. Using Scrapy settings, you can conveniently customize the crawling settings of your crawler. That’s not all, scrapy also allow you to customize other items like core mechanism, pipelines, and spiders. You’ll typically find a settings.py file in your project directory that allows you to easily customize your scraper’s settings.
Settings.py:
Customizing your scrapper is easy and you can make all the changes you want the settings.py file, you’ll find when you navigate to your project directory.
USER-AGENT: Scrapy user-Agent helps you to accurately pinpoint which browser is being used, along with the version and the type of operating system. Using user agent, you should be easily able to identify people who visit your site. the truth is you cannot visit most website without a user agent. Thankfully, adding a user agent is easy, simply navigate to setting.py file to add user agent.
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
ROBOTS.TXT: When including this, by default, it is set to True. To this end, your scraper will follow the guidelines defined in the site’s robots.txt every time you scrape a website. However, you can always tweak between True to False and you should be able to bypass robots.txt rules.
ROBOTSTXT_OBEY = True
Item pipeline: Pipelines are uniquely designed to help you process items immediately after scrapping. So, its important you make it clear which pipelines you’d like to apply while scraping. In this example, the second pipeline is commented out, as a result it is not activated, hence won’t be invoked.
ITEM_PIPELINES = { 'myscraper.pipelines.ExportPipeline': 100 }
Spider Modules: Here, you’ve to carefully pinpoint where the spiders are in your project.
SPIDER_MODULES = ['myscraper.spiders']
Concurrent-Requests: This determines the number of requests scrapy can make at a time and by default, it is set to 16. We advise that you don’t go overboard with this so you don’t end up damaging the website.
CONCURRENT_REQUESTS = 16
Download Delay: This command effectively controls the delay between requests given in seconds and by default, it is set to 0. Feel free to modify this to look nicer on the website.
DOWNLOAD_DELAY = 5
Default Request Header: A couple of website are usually designed to recognize scrapy’s default request headers, so it wouldn’t be such a bad idea to customize it.
DEFAULT_REQUEST_HEADERS = { 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', }
Here, checkout some basic basic settings you need to follow to fully set up the scrapy framework.
I realized when working on a scraping project on excellencetechnologies that these settings are too helpful to bypass restrictions.
The main goal when scraping is to be able to extract structured data from unstructured sources, typically, web pages. With scrapy spiders, you can easily easily return the extracted data as Python dicts.
Declaring Items:
Using a scrapper allows us to declare the types of items we want to scrape from a website and in what format we would like to store the extracted data in the database.
Items are declared using a simple class definition syntax and Field objects. Here, see this example:
import scrapy class Product(scrapy.Item): name = scrapy.Field() price = scrapy.Field() stock = scrapy.Field() tags = scrapy.Field() last_updated = scrapy.Field()
At the end of this article, I’ll show you a full working example of items and pipeline.
Item Pipeline:
After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.
Here is why you should use an item pipelines
- cleansing HTML data
- validating scraped data (checking that the items contain certain fields)
- checking for duplicates (and dropping them)
- storing the scraped item in a database.
Activating an Item Pipeline component:
To activate an Item Pipeline component you must add its class to the ITEM_PIPELINES setting, like this.
ITEM_PIPELINES = { 'myproject.pipelines.PricePipeline': 300, 'myproject.pipelines.JsonWriterPipeline': 800, }
Lets move on by showing you a proper example of items and pipeline using my SQL.
We will begin by first writing a simple scraper in spider.py file.
import scrapy class Spider(scrapy.Spider): name = "runspider" def start_requests(self): urls = [ 'https://blog.scrapingexamplexyz.com', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): items = ScraperItem() a=response.css(".external::text").extract() b = len(a) for s in range(0,b): address = response.css(".external::text")[s].extract() name = "XYZ" url_coming_from = response.url items['address']=address items['name']=name items['url_coming_from']=url_coming_from items['tag_name']='NA' items['Tx_count']='NA' items['type_id']='2' yield items
In this code, we return items that we scraped from website to items class. Feel free to replace selector and website endpoint accordingly. For our example, we used a dummy endpoint.
Next, we will define our items into items.py file and here is how to do that.
import scrapy class ScraperItem(scrapy.Item): address = scrapy.Field() coin = scrapy.Field() url_coming_from = scrapy.Field() tag_name = scrapy.Field() Tx_count = scrapy.Field() type_id = scrapy.Field() risk_score = scrapy.Field() pass
This is the same method that I explained at the starting of this blog when defining items.
Now that we are done with items structure, all we need to do is add a pipeline for validating scraped data and storing the scraped items in a database.
To this end, we will code in the pipeline.py for this.
import mysql.connector class ScraperPipeline(object): def __init__(self): self.create_connection() def create_connection(self): self.conn = mysql.connector.connect( host='198.38.93.156', user='dexter', password='cafe@wale1', database='db', auth_plugin='mysql_native_password' ) self.curr = self.conn.cursor() def process_item(self, item, spider): self.store_db(item) return item def store_db(self,item): address=item['address'] name=item['name'] url_coming_from=item['url_coming_from'] tag_name=item['tag_name'] Tx_count=item['Tx_count'] type_id=item['type_id'] address_risk_score=item['address_risk_score'] self.curr.execute('INSERT INTO sws_known_address (address_id ,name ,address,type_id,address_risk_score,tag_name,source,tx_count) VALUES ("'+str(ids)+'","'+str(name)+'","'+str(address)+'","'+str(type_id)+'","'+str(address_risk_score)+'","'+str(tag_name)+'","'+str(url_coming_from)+'","'+str(Tx_count)+'")') self.conn.commit()
From what we have so far, we can confidently say that we are done with pipeline. It is important to note that we first started by making an MySQL connection for storing data into DB. This helps call items from item class and stored data into DB by simple MySQL query.
After that, we will run our spider.
scrapy crawl runspider
Using pipeline affords us the opportunity to use more validation. We will see this at work using some examples shortly.
Item pipeline example:
validation and dropping items with no value:
from scrapy.exceptions import DropItem class PricePipeline(object): vat_factor = 1.15 def process_item(self, item, spider): if item.get('price'): if item.get('price_excludes_vat'): item['price'] = item['price'] * self.vat_factor return item else: raise DropItem("Missing price in %s" % item)
We can use drop items exceptions for this, however, if items have no set value, then this will help to drop the value of the item.
Duplicates filter:
Ideally, there is a duplicate filter that is set to look for duplicate item. This helps drop items that have already been processed.
from scrapy.exceptions import DropItem class DuplicatesPipeline(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item['id'] in self.ids_seen: raise DropItem("Duplicate item found: %s" % item) else: self.ids_seen.add(item['id']) return item
Finally, we are done with the intermediate level of scrapy framework. I hope this article was helpful to you. Feel free to leave us a comment.