-
Notifications
You must be signed in to change notification settings - Fork 1
How to scrap data
Samuel Depardieu edited this page Jan 12, 2018
·
5 revisions
Refer to How to set up guide
The project actually have two working spiders:
- who_iris: Search articles and get all documents until the last page
- who_iris_single_page : Get all document of a single research page
source env/bin/activate
scrapy crawl [name]
available spiders:
- master branch:
* who_iris
* who_iris_single_page (for limited scraping, with result par pages as limiter)
- nice_scraping branch:
* nice
If settings are in their initial state, this will output a json file in the results folder.
scrapy crawl [name] -s [setting to change]
Settings available to change:
# Changes Results per page number/choose the number of results from who_iris_single_page spider:
scrapy crawl [name] -s WHO_IRIS_RPP = 1000
# Choose one or more years to scrape from WHO website
scrapy crawl [name] -s WHO_IRIS_YEARS = [2012, ..]
# Change the job folder (to start again a scrap from the beginning eg):
scrapy crawl [name] -s JOBDIR = 'crawl/[job_name]'
# Change the logging settings:
scrapy crawl [name] -s LOG_LEVEL = INFO|WARNING|DEBUG|ERROR
scrapy crawl [name] -s LOG_ENABLED = True|False
# Change the location of the exported file (locally, Amazon s3 or DSX. S3 and DSX need to enter credentials):
scrapy crawl [name] -s FEED_CONFIG = DSX|S3|LOCAL
# Choose to keep or not the PDF file on a keyword match
scrapy crawl [name] -s KEEP_PDF = False|True
# Choose to only download the pdf and the metadata, without looking for keywords and sections
scrapy crawl [name] -s DOWNLOAD_ONLY = False|True
(Notice: You should clone the settings file and modify the clone. The settings file used by scrapy can be changed in the scrapy.cfg file.)
# Change the output method to an AWS S3 bucket:
First, change the values of the fields AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and FEED_URI to match yours
Then, change FEED_CONFIG to 'S3'