Web Scraping is the process of data extraction from various websites.
DIFFERENT LIBRARY/FRAMEWORK FOR SCRAPING:
Scrapy:- If you are dealing with complex Scraping operation that requires enormous speed and low power consumption, then **Scrapy **would be a great choice.
Beautiful Soup:- If you’re new to programming and want to work with web scraping projects, you should go for**_ Beautiful Soup_**. You can easily learn it and able to perform the operations very quickly, up to a certain level of complexity.
Selenium:- When you are dealing with Core JavaScript-based web applications and want to make browser automation with AJAX/PJAX requests, then **Selenium **is a great choice.
CHALLENGES WHILE SCRAPING DATA:
Pattern Changes:
Problem: Each website periodically changes its UI every now and then. Scrapers usually need modification every week to keep up with the changes, or else it will give you an incomplete data or crash the scraper. Solution: You can write test cases for parsing and extraction logic and run the tests regularly. You can also use any other continuous integration tool to catch failures
Anti- Scraping Technologies:
Problem: Some websites are using anti-scraping technologies, for instance, LinkedIn. If you’re hitting a particular website from the same IP address, then there are high chances for the target website to block your IP address. Solution: Proxy services with rotating IP Addresses help in this regard. Proxy servers help mask IP addresses and can improve crawling speed. Scraping frameworks like Scrapy provides easy integration for several rotating proxy services.
Javascript-based Dynamic Content:
Problem: Websites that heavily rely on Javascript & AJAX to render dynamic content makes data extraction difficult. Because Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document, Ajax calls or Javascript are executed at runtime so it can’t scrape that. Solution: This can be handled by rendering the web page in a headless browser like headless Chrome, which essentially allows running Chrome in a server environment. Another alternative is we can use selenium for javascript pages.
Quality of Data:
Problem: The records which do not meet the quality guidelines will affect the overall integrity of the data. Making sure the data meets quality guidelines while crawling because it needs to be performed in real-time and faulty data can cause serious problems. Solution: One thing you can do for this is to write test cases. You can make sure whatever your spiders are extracting is correct, and they are not scraping any wrong structure data.
Captchas:
Problem: Captchas serve a great purpose in keeping spam away. However, they also pose a great deal of accessibility challenge for the web crawling bots out there. When captchas are present on a page from where you need to scrape data, basic web scraping setups will fail and cannot get past this barrier. Solution: For this, you need a middleware that can take captcha, solve it, and return the response.
Maintaining Deployment:
Problem: If you’re scraping millions of websites, you can imagine the size of the code. It’s even very hard to execute spiders. Solution: You can Dockerize your spiders and run them in an interval.
Scraping Guidelines/Best Practices:
Robots.txt file:Robots.txt is a text file webmasters create to instruct robots (typically search engine robots) how to crawl & index pages on their website. So this file generally contains instruction for crawlers. Robots.txt should be the first thing to check when you are planning to scrape a website. Every website would have set some rules on how bots/spiders should interact with the site in their robots.txt file.
Do not hit the servers too frequently: Web servers are not fail-proof. Any web server will slow down or crash if the load on it exceeds a certain limit, up to which it can handle. Sending multiple requests too frequently can result in the website’s server going down or the site becoming too slow to load.
User Agent Rotation: A User-Agent in the request helps identify which browser is being used, what version, and on which operating system. Every request made from a web browser contains a user-agent header, and using the same user-agent consistently leads to the detection of a bot. User Agent rotation is the best solution for this.
Do not follow the same crawling pattern: Only robots follow the same crawling pattern. Programmed bots follow a logic that is usually very specific. Sites that have intelligent anti-crawling mechanisms can easily detect spiders.
Scrapy Vs. BeautifulSoup
In this section, you will have an overview of one of the most popularly used web scraping tool called BeautifulSoup, and its comparison to Scrapy, python’s most used scraping framework.
Functionality:
Scrapy: Scrapy is the complete package for downloading web pages, processing them, and saving it in files and databases. BeautifulSoup: BeautifulSoup is an HTML and XML parser and requires additional libraries such as requests,urlib2 to open URLs and store the result.
Learning Curve :
Scrapy: Scrapy is a powerhouse for web scraping and offers a lot of ways to scrape a web page. It requires more time to learn and understand how Scrapy works, but once mastered, it becomes easier to make web crawlers and run them by just writing one line of command. BeautifulSoup: BeautifulSoup is relatively easy to understand for newbies in programming and get smaller tasks done in no time.
Speed and Load :
Scrapy: Scrapy can get big jobs done very easily. It can crawl a group of URLs in not more than a minute, depending on the size of the group and does it very smoothly. BeautifulSoup: BeautifulSoup is used for simple scraping jobs with efficiency. However, it is slower than the Scrapy.
Extending Functionality:
Scrapy: Scrapy provides item pipelines that allow you to write functions in your spider that can process your data, such as validating data, removing data, and saving data to a database. BeautifulSoup: BeautifulSoup is good for smaller jobs, but if you require much customization such as proxies, managing cookies, and data pipelines, Scrapy is the best option.
For this blog, we are going to explain the Scrapy framework as it has more usecases in real-time scraping problems.
Scrapy: Scrapy is a fast high-level web crawling and web scraping framework used to crawl websites and extract structured data from their pages.
Key Features of Scrapy are —
Scrapy has built-in support for extracting data from HTML sources using XPath expression and CSS expression.
It is a portable library, i.e. (written in Python and runs on Linux, Windows, and Mac)
It can be**_ _Easily Extensible.**
It is faster than other existing scraping libraries. It can extract the websites 20 times faster than any other tool.
It consumes a lot less memory and CPU usage.
It can help us to build a robust and flexible application with a bunch of functions.
It has excellent community support for the developers, but the documentation is not that great for the beginners because it does not have beginner-friendly content.
I have done many web scraping projects in excellence technologies using the Scrapy framework.
Let’s start with Scrapy framework.
Before we start installing Scrapy, make sure you have python and pip set up in your system.
Using Pip: Just run this simple command.
pip install Scrapy
So, we’ll assume that Scrapy is already installed on your system. If still, you are getting an error, you can follow the official installation guide.
In starting, we will walk you through with these tasks:
Creating a new Scrapy project.
Writing a spider to crawl a site and extract data.
First, we will create a project using this command.
scrapy startproject tutorial
This will create a **tutorial **directory. Next, we will be moving into tutorial/spiders with the help of this dir and create a file quotes_spider.py
importscrapyclassQuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
name: Here, the name identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders. We will start crawling parse() a method that will be called to handle the response downloaded for each of the requests made. yield: Here yield working as a return.
Now we will run our first spider. First, we will go to spiders dir then run this command with the name.
scrapy crawl quotes
This command runs the spider with the name quotes that we’ve just added, that will send some requests for the domains. You will get an output similar to this:
Now, check the files in the current directory. You should notice that two new files have been created:quotes-1.html and quotes-2.html, with the content for the respective URLs, as our parse method instructs.
This is a Basic spider which we discuss above. Now we have a basic idea of how the Scrapy framework will work. Let’s discuss some important basic things.
Extracting data: We can use the CSS selector and Xpath selector for extracting data from webpages.
The best way to learn how to extract data with Scrapy is trying selectors by using the Scrapy shell.
scrapy shell 'http://quotes.toscrape.com/page/1/'
Now we will see some examples of extracting data with a selector using Scrapy shell.
CSS selector: syntax - response.css(’ ‘)
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
The result of running response.css (‘title’) is a list-like object called SelectorList, which represents a list of selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.
To extract the text from the title above, you can use this:
>>> response.css('title::text').getall()
['Quotes to Scrape']