Scrapy

Scrapy is a free, open-source, high-level, Python library that allows for fast web scraping and web crawling. It can be used for a variety of tasks such as automated testing and other use cases that include extracting structured data from web pages.

Scrapy is popular for both small and enterprise level developer teams wishing to crawl and scrape web-pages for use cases ranging from indexing to data analytics.

The “batteries-included” included philosophy that Scrapy implements makes it a breeze to work with since the developers don’t have to worry for anything related to link-gathering, data exporting or persistence, hence they can focus on developing the logic of what data to gather and how to process them.

Scrapy was initialy developed Zyte ¹and maintained by them and many other contributors who continue on developing this framework.

www.zyte.com ↩︎

Project Site | Project Source

Authors

Ana, Tatabitovska

Master's student at TU Delft, following the Artificial Intelligence Technology Track

Florentin, Arsene

First year MSc student following the Software Technology track

Orestis, Kanaris

MSc Software Technology student aiming for the Distributed Systems research group

Thomas, Gielbert

MSc Complex Systems Engineering and Management student following the Information and Communication track

Scalability

Scalability is a function of how performance is affected by changes in workload or in the resources allocated to run your architecture 1. There are various dimensions for Scalability: i.e Administrative, Functional, Geographical, Load, Generation and Heterogeneous scalability 2. Since Scrapy is a single process library it can be shipped to each new user individually, and thus it is Administrative(ly) scalable. Additionally, since Scrapy’s design allows every component to be extensible, it is also Functionally scalable.

Scrapy

March 27, 2022

From Vision to Architecture

Scrapy, as explained in our previous essay, is a framework that allows for easy, fast and custom web-crawling for a variety of tasks. More information about Scrapy can be found in the official documentation page. Scrapy’s Architectural Style Scrapy’s architectural style does not follow a single pattern, but rather extracts different characteristics from different architectural styles. Scrapy’s maintainer Adrían Chaves informed us that Scrapy was initially based on Django for the networking engine, but the team very soon decided to switch to - and still remains on - Twisted, a popular open-source, asynchronous and event-driven networking engine written in Python 1.

Scrapy

March 13, 2022

Product Vision and Problem Analysis

Product Vision and Problem Analysis Introduction Scrapy is a “batteries included” framework which allows users to crawl the web by providing an easy to use and programmable Spider API, which can also scrape the pages and structure them in well-known formats. Scrapy’s goal is to provide its users with most of the functions they may need for crawling, scraping and structuring data from webpages, and error handling out of the box.

Scrapy

March 7, 2022

Quality and Evolution

Key Quality Attributes Scrapy, as explained in the previous essays, is a framework that allows for easy, fast and custom web-crawling for a variety of tasks. The main selling points of Scrapy are performance, extensibility and portability1. In order to achieve high performance, Scrapy bases its architecture on Twisted2, a networking framework that facilitates asynchronous and event-driven programming. Twisted is built based on the Reactor pattern and facilitates a construct that waits for events and performs callback functions.

Scrapy

March 21, 2021

Document Scrapy Component Api

scrapy/scrapy

There are certain classes in Scrapy that may optionally define the from_crawler and from_settings class methods.

Up to this PR this fact was document in every single one of them. There was a lot of repetition and sometimes it failed to cover how to properly subclass a class that defines one of these methods.

Remove duplicate definition of from_crawler and from_settings method. Make them refer to the centralized definitions in class-methods.rst where the explanation of the methods is elaborated. Closes issue number 5110.

open

Open PR

Improve cookie handling

scrapy/scrapy

Adding storage of cookies in local file that allows cross-spider access of cookies and providing interface method for the spiders to retrieve the cookies. Spiders automatically load the saved cookies when they are opened and write the new cookies to the file when they are closed such that another new opened spider can reuse the existing cookies.

open

Open PR

Recommend Common Crawl instead of Google Cache

scrapy/scrapy

Closes issue numbered 3582. Updated the documentation to not recommend Google Cache in the Scrapy documentation, but instead recommend Common Crawl. The terms of use allow scraping as long as Scrapy honours the restrictions of robot.txt files and NOFOLLOW metatags.

merged

Open PR

Scrapy

Authors

Ana, Tatabitovska

Florentin, Arsene

Orestis, Kanaris

Thomas, Gielbert

Scalability

From Vision to Architecture

Product Vision and Problem Analysis

Quality and Evolution

Contributions

Document Scrapy Component Api

Improve cookie handling

Recommend Common Crawl instead of Google Cache