Scrapy

Table of Contents

General

Performance

Settings

To increase concurrency, one ought to have a look at the following settings:

  • REACTOR_THREADPOOL_MAXSIZE
  • CONCURRENT_REQUESTS

DNS resolution

scrapy performs DNS resolution in a blocking manner with the usage of a dedicated threadpool. Thus, simply increasing the number of concurrent requests might not lead to the performance increase you might expect.

This is where REACTOR_THREADPOOL_MAXSIZE comes in; it allows you to adjust the size of the threadpool, and thus increasing the number of threads available for the blocking DNS resolutions.

Why is it performed in a blocking manner? Not sure. I know that using asyncio we can actually perform DNS resolutions, well, asynchronously!

Troubleshooting

Too many open files

Follow the these instructions to up the limit for number of open file-descriptors.

To also allow great control of the number of file-descriptors used by scrapy you can change the settings REACTOR_THREADPOOL_MAXSIZE and CONCURRENT_REQUESTS. To handle this issue, lower both or any of them ought to negate that.