DESOSA 2022

Product Vision and Problem Analysis

Product Vision and Problem Analysis

Introduction

Scrapy is a “batteries included” framework which allows users to crawl the web by providing an easy to use and programmable Spider API, which can also scrape the pages and structure them in well-known formats. Scrapy’s goal is to provide its users with most of the functions they may need for crawling, scraping and structuring data from webpages, and error handling out of the box. Scrapy has a clear and well-documented pipeline with a very simple API to either be used as-is or to be used for developing any desired missing functions or extend existing ones to fit one’s job.

Scrapy aims to be the crawling and scraping software of choice for all types of jobs, ranging from solo developers wanting to create simple projects, to large scale companies1 wanting to scrape the web for large scale analytics purposes.

Scrapy’s underlying domain primarily concerns itself with extracting data from web pages. Its main characteristics are extracting user-specified data types from a given website, and storing said data in a user-specified format. Moreover, this domain model also encapsulates the ability for a crawler to continue its crawl onto the next webpage by looking for a URL on the current webpage.

Aside from these key concepts, there are special cases of crawling that the domain should represent. Examples of such special instances are web pages hidden behind paywalls, web pages with specific crawling policies, or web pages that require authentication. An important characteristic of Scrapy’s domain is thus the ability of a crawler to bypass such blockages while respecting a website’s Terms of use.

Capabilities and Context

The main use case of web crawlers that comes to mind is webpage-indexing by a search engine bot, such as Google’s GoogleBot, Microsoft’s Bingbot and Yahoo!’s Yahoo! Slurp. Scrapy can be used to replicate such bots to create one’s own search engine.

As mentioned before, Scrapy follows a “batteries included” philosophy, making it an appealing tool for smaller teams, or even solo developers, which do not want to spend much time on developing the crawler or on the extraction and storage pipelines. These small teams prefer to spend their time developing the logic of what components the crawler must interact with, how that interaction should be or focus on the data-processing once the data is extracted and fetched.

Scrapy can be used for a variety of use cases and is used in over 28k repositories on Github2. A walkthrough of that list revealed a variety of use cases from projects of very variable complexity. The most common of them are summarized in the following list:

Scrapy’s capability of being employed for a variety of tasks is thanks to it being agnostic of the use case that the scraping will be used for. The users of Scrapy establish a context by implementing a Spider which incorporates the algorithms that are required for the application and feed it with a starting set of URLs and some configurations, in order to tame the Spider and wait for the results. These interactions are depicted in the Context diagram. Scrapy only interacts with the webpage/ API that will scrape or crawl and with the receiver(s) of the structured output data.

Figure: Context diagram

Stakeholder Analysis

Considering stakeholders of a system requires clear determination of its boundaries. For this analysis we define stakeholders as actors vital to Scrapy: without these actors, Scrapy would not be able to function properly.

Stakeholders are placed in a Power/Interest grid, which is useful for prioritizing stakeholders based on their power and interest in the project, as depicted below. The classification based on this Power/Interest is discussed in more detail below.

Figure: Power/Interest Grid for Scrapy

High Power, High Interest

This category contains Scrapy’s main contributors, including its founder, Zyte3. Zyte and the main contributors decide the priority of changes made to the Scrapy system, placing significant power into their hands. Zyte also has a high level of interest in the system, as Scrapy is an important component of their business model.

High Interest, Low Power

Owners of websites being scraped fit within this category, as scraping can have significantly negative effects on websites, while website owners can’t do much about it except state that they do not wish for their page to be crawled, something that can’t be enforced.

Other stakeholders with high interest and low power stakes in Scrapy are enterprise-level clients or researchers that highly depend on Scrapy for their products or research. Examples of such enterprise clients are Arbisoft4, which provides web-scraping solutions for its end-users needs, and companies that aggregate data from different sources in order to present a general overview of what’s available in a certain area of interest: e.g. third-party companies that show the cheapest flight information.

Low Power, Low Interest

In this category, we find other contributors of Scrapy, such as Computer Science students and other developers5 who contribute while having a variety of ‘low interest’ objectives, such as getting experience contributing to open source systems, Google Summer of Code (GSoC) candidates etc.

High Power, Low Interest

The last category consists of government regulators, as these entities have limited interest in specific systems such as Scrapy, but have considerable power through legislation such as data and privacy regulations.

Quality Attributes

Now that we’ve taken a look at Scrapy’s external factors, we shift our focus to Scrapy’s internal attributes. Since our time and resources are limited as software architects, we focus on Scrapy’s most important quality attributes in this section. From Scrapy’s website, which markets Scrapy as being fast and powerful, simple, easily extensible, and portable6, we deduct that Scrapy aims to achieve the following quality attributes7:

  • Performance
  • Usability
  • Extensibility
  • Portability

Performance is a vital quality attribute, as it determines the interaction time with the system. Generally, users prefer a system that responds quickly, preferably with high levels of throughput. For Scrapy performance is important, as web-scraping relies on submitting multiple requests to webpages per second. Any gains in performance for these individual requests can drastically improve the overall performance experienced by the user.

Extensibility determines how well Scrapy can evolve with new features. This is an important attribute, since Scrapy is used for a wide variety of use-cases. Scrapy offers out-of-the-box ‘Spiders’ which can be used for common scenarios, while also allowing developers to extend them by defining their own rules, which improves the system’s extensibility.

Usability concerns the intuitiveness and convenience of the system in terms of its use. By providing a complete tool that deals with most of web-scraping’s difficulties, Scrapy is a good tool for beginners. Due to its other quality attributes, such as its high levels of performance and extensibility, Scrapy is also a suitable tool for experts, something easily realized by the number of companies using Scrapy. This use by a variety of users, ranging from beginners to experts, shows that Scrapy has a high level of usability.

Lastly, portability is vital for Scrapy, as it determines how well Scrapy can be switched between platforms. A difficulty of web-scraping is that changes in website user interfaces require changes in the script8. To deal with such changes, Scrapy needs to be portable. Additionally, Scrapy is portable as it is written in Python and can be run on any platform, be it Linux, Windows, Mac or BSD.

Ethical Dilemmas

Web-scraping can serve a variety of useful (societal) purposes. Some ethical aspects should however be considered, as web-scraping can have negative effects.

For instance, website owners can be negatively affected by web-scraping in terms of performance due to high levels of requests from web-scrapers. Additionally, website owners can prefer their information not to be scraped due to security, privacy or copyright concerns.

Guidelines are established under the term ‘ethical web-scraping’, which aim to minimize the harm of web-scraping9. For example, the requests rate should be kept low, since too many requests can overload a server. Additionally, web-scrapers should leave information regarding the purpose of the scrape and contact information, so that website owners know what is going on.

Website-owners can use a robot exclusion protocol by attaching a ‘robots.txt’ file to the root of their website. This file contains information on how the website can be scraped and whether there is information that should not be accessed10. These exclusion protocols are however not legally binding, but simply a statement of the owner’s wishes. Interestingly, Scrapy ignores such robot.txt files by default11. This causes users of Scrapy to ignore robot exclusion protocols, often without even being aware of it, and thus potentially harming the website they are scraping. This clearly shows that Scrapy’s intent is to maximize functionality and user-friendliness of their tool, while potentially harming the websites being scraped.

At the moment there are no clear, universal laws on the legality or illegality of web-scraping, which is why this balance between web-scrapers and website owners is mostly based on trust and respect. Web-scraping lawsuits are handled on a case-by-case scenario, although there are some common laws that one may be prosecuted under in the case of “unethical web-scraping”12. Since web-scraping is not illegal and Scrapy is freely distributed under a BSD licence13 which clearly states that they are not liable for any sort of damages that may occur from the distribution of Scrapy, the legal aspects of web-scraping are up to the users of Scrapy.

A broader ethical consideration concerns the analysis of aggregated data. Web-scraping facilitates the gathering of large quantities of data to be used for analysis. This analysis of large quantities of data leads to ethical dilemmas since aggregated data gathered through scraping can for example lead to invasion of privacy14, since even anonymized data can be re-identified by cross-referencing datasets15.

The Future of Scrapy

Scrapy has been around for a long time - its initial release was in 2008 - and can thus be considered a near-finished product. It is still regularly being updated, but these updates consist mainly of bug fixes and small new features16.

In its wider context, there could however be some interesting development on the horizon for Scrapy in the longer term. As the application of data analytics is growing in a variety of business domains, such as marketing, finance and strategy, web-scraping is expected to grow along with this. Such developments could mean that Scrapy’s user base will grow considerably in the years to come, which would require more focus on quality attributes such as scalability.

References


  1. Companies that use Scrapy. Retrieved 22 February 2022, from https://stackshare.io/scrapy ↩︎

  2. Network Dependents · scrapy/scrapy. (n.d.). Retrieved 4 March 2022, from https://github.com/scrapy/scrapy/network/dependents?package_id=UGFja2FnZS01MjU3Mjk3NA%3D%3D ↩︎

  3. Zyte (Formerly Scrapinghub). (2022, January 26). Zyte (formerly Scrapinghub) We’re Data Driven. Retrieved 24 February 2022, from https://www.zyte.com/meet-zyte/ ↩︎

  4. About Arbisoft. (2021, February 8). Retrieved 23 February 2022, from https://arbisoft.com/about-us/ ↩︎

  5. Contributors to scrapy/scrapy. Retrieved 24 February 2022, from https://github.com/scrapy/scrapy/graphs/contributors ↩︎

  6. Scrapy | A Fast and Powerful Scraping and Web Crawling Framework. (n.d.). Retrieved 1 March 2022, from https://scrapy.org/ ↩︎

  7. Pautasso, C. (2021)._ Software Architecture - Visual Lecture Notes_ (1st ed.). LeanPub. Retrieved from https://leanpub.com/software-architecture ↩︎

  8. Jünger, J. (2018). Mapping the field of automated data collection on the web: Collection approaches, data types, and research logic. Computational social science in the age of big data. Concepts, methodologies, tools, and application. Herbert van Halem, Köln, 104-130. ↩︎

  9. Zahringer, D. (2021, May 13). Ethical Issues When Scraping the Web. Retrieved 23 February 2022, from https://www.tmprod.com/blog/2021/ethical-issues-when-scraping-the-web/ ↩︎

  10. Schellekens, M. H. M. (2013). Robot. txt: balancing interests of content producers and content users. Bridging Distances in Technology and Regulation, 173. ↩︎

  11. Settings — Scrapy 2.5.1 documentation. Retrieved 25 February 2022, from https://docs.scrapy.org/en/latest/topics/settings.html#robotstxt-obey ↩︎

  12. Krotov, V., & Silva, L. (2020). Legality and ethics of web scraping. Communications of the Association for Information Systems, 47, 539–563. https://doi.org/10.17705/1cais.04724  ↩︎

  13. Scrapy. (n.d). License — Scrapy 2.6.1 GitHub. Retrieved 25 February 2022, from https://github.com/scrapy/scrapy/blob/master/LICENSE ↩︎

  14. Luscombe, A., Dick, K., & Walby, K. (2021). Algorithmic thinking in the public interest: navigating technical, legal, and ethical hurdles to web scraping in the social sciences. Quality & Quantity, 1-22. ↩︎

  15. Lubarsky, B. (2010). Re-identification of “anonymized” data. Georgetown Law Technology Review. Available online: https://www. georgetownlawtechreview. org/re-identification-of-anonymized-data/GLTR-04-2017. ↩︎

  16. Scrapy. (n.d.). Release notes — Scrapy 2.6.1 documentation. Retrieved 2 March 2022, from https://docs.scrapy.org/en/latest/news.html ↩︎

Scrapy
Authors
Ana, Tatabitovska
Florentin, Arsene
Orestis, Kanaris
Thomas, Gielbert