Jiaxi Liu (Jesse)

Master’s Graduate

Software Engineer | Scalable APIs · Web Scraping · Data Integration · Code Quality & Refactoring

Back to Blog

Scrapy Review: Engine, Scheduler, Downloader, Spider, Pipeline, and Selectors

Scrapy is not just an HTML parser. It is a full crawling framework for request scheduling, downloading, parsing, pipelines, and middleware.

Core Components

Scrapy Engine coordinates everything.

Scheduler receives Requests, queues them, and returns them to the engine.

Downloader sends requests and returns Responses.

Spider parses Responses, extracts Items, and yields new URLs.

Item Pipeline cleans, filters, and stores data.

Middleware intercepts requests and responses, useful for proxies, User-Agent, cookies, or browser rendering.

Runtime Flow

  1. Spider provides initial URLs.
  2. Engine sends Requests to Scheduler.
  3. Scheduler queues them.
  4. Engine asks Scheduler for a Request.
  5. Downloader downloads the page.
  6. Response goes to Spider.
  7. Spider extracts Items and new Requests.
  8. Items go to Pipeline, new Requests go to Scheduler.
  9. The crawler stops when the queue is empty.

Selectors

Scrapy uses Selectors to parse HTML/XML.

def parse(self, response):
    title = response.xpath("//title/text()").get()

Common XPath:

//tag
//div[@class="content"]
//a/@href
//h1/text()
//ul/li[position()=1]

CSS selectors:

title = response.css("title::text").get()
links = response.css("a::attr(href)").getall()

get and getall

.get() returns the first match.

title = response.xpath("//title/text()").get()

.getall() returns all matches.

tags = response.css("a.tag::text").getall()

Pagination

def parse(self, response):
    for quote in response.css("div.quote"):
        yield {
            "text": quote.css("span.text::text").get(),
            "author": quote.css("small.author::text").get(),
        }
 
    next_page = response.css("li.next a::attr(href)").get()
    if next_page:
        yield response.follow(next_page, self.parse)

Tool Selection

Typical priority:

  1. Static HTML: Scrapy XPath/CSS
  2. Embedded JSON: parse JSON directly
  3. Light JavaScript: scrapy-playwright
  4. Heavy interactions: Selenium
  5. Messy HTML: BeautifulSoup for cleanup

Scrapy's value is framework structure, scalability, and extensibility.