Scrapy
Table of Contents
General
Performance
Settings
To increase concurrency, one ought to have a look at the following settings:
REACTOR_THREADPOOL_MAXSIZE
CONCURRENT_REQUESTS
DNS resolution
scrapy
performs DNS resolution in a blocking manner with the usage
of a dedicated threadpool. Thus, simply increasing the number of concurrent
requests might not lead to the performance increase you might expect.
This is where REACTOR_THREADPOOL_MAXSIZE
comes in; it allows you to
adjust the size of the threadpool, and thus increasing the number of threads
available for the blocking DNS resolutions.
Why is it performed in a blocking manner? Not sure. I know that using asyncio
we can actually perform DNS resolutions, well, asynchronously!
Troubleshooting
Too many open files
Follow the these instructions to up the limit for number of open file-descriptors.
To also allow great control of the number of file-descriptors used by scrapy
you can change the settings REACTOR_THREADPOOL_MAXSIZE
and CONCURRENT_REQUESTS
.
To handle this issue, lower both or any of them ought to negate that.