DESOSA 2022

Scrapy

Figure: Scrapy

Scrapy is a free, open-source, high-level, Python library that allows for fast web scraping and web crawling. It can be used for a variety of tasks such as automated testing and other use cases that include extracting structured data from web pages.

Scrapy is popular for both small and enterprise level developer teams wishing to crawl and scrape web-pages for use cases ranging from indexing to data analytics.

The “batteries-included” included philosophy that Scrapy implements makes it a breeze to work with since the developers don’t have to worry for anything related to link-gathering, data exporting or persistence, hence they can focus on developing the logic of what data to gather and how to process them.

Scrapy was initialy developed Zyte 1and maintained by them and many other contributors who continue on developing this framework.


  1. www.zyte.com ↩︎

Authors

Ana, Tatabitovska

Master's student at TU Delft, following the Artificial Intelligence Technology Track

Florentin, Arsene

First year MSc student following the Software Technology track

Orestis, Kanaris

MSc Software Technology student aiming for the Distributed Systems research group

Thomas, Gielbert

MSc Complex Systems Engineering and Management student following the Information and Communication track

Scalability

Scalability is a function of how performance is affected by changes in workload or in the resources allocated to run your architecture 1. There are various dimensions for Scalability: i.e Administrative, Functional, Geographical, Load, Generation and Heterogeneous scalability 2. Since Scrapy is a single process library it can be shipped to each new user individually, and thus it is Administrative(ly) scalable. Additionally, since Scrapy’s design allows every component to be extensible, it is also Functionally scalable.

From Vision to Architecture

Scrapy, as explained in our previous essay, is a framework that allows for easy, fast and custom web-crawling for a variety of tasks. More information about Scrapy can be found in the official documentation page. Scrapy’s Architectural Style Scrapy’s architectural style does not follow a single pattern, but rather extracts different characteristics from different architectural styles. Scrapy’s maintainer Adrían Chaves informed us that Scrapy was initially based on Django for the networking engine, but the team very soon decided to switch to - and still remains on - Twisted, a popular open-source, asynchronous and event-driven networking engine written in Python 1.

Product Vision and Problem Analysis

Product Vision and Problem Analysis Introduction Scrapy is a “batteries included” framework which allows users to crawl the web by providing an easy to use and programmable Spider API, which can also scrape the pages and structure them in well-known formats. Scrapy’s goal is to provide its users with most of the functions they may need for crawling, scraping and structuring data from webpages, and error handling out of the box.

Quality and Evolution

Key Quality Attributes Scrapy, as explained in the previous essays, is a framework that allows for easy, fast and custom web-crawling for a variety of tasks. The main selling points of Scrapy are performance, extensibility and portability1. In order to achieve high performance, Scrapy bases its architecture on Twisted2, a networking framework that facilitates asynchronous and event-driven programming. Twisted is built based on the Reactor pattern and facilitates a construct that waits for events and performs callback functions.