Skip to content

How to scrap data

Samuel Depardieu edited this page Jan 12, 2018 · 5 revisions

If you don't have a functional dev environment

Refer to How to set up guide

If you have your dev environment ready

The project actually have two working spiders:

  • who_iris: Search articles and get all documents until the last page
  • who_iris_single_page : Get all document of a single research page

Activate the virtual env and run a spider:

source env/bin/activate
scrapy crawl [name]

available spiders:
 - master branch:
   * who_iris
   * who_iris_single_page (for limited scraping, with result par pages as limiter)
 - nice_scraping branch:
   * nice

If settings are in their initial state, this will output a json file in the results folder.

Changing the settings inline:

scrapy crawl [name] -s [setting to change]

Settings available to change:

# Changes Results per page number/choose the number of results from who_iris_single_page spider:
scrapy crawl [name] -s WHO_IRIS_RPP = 1000

# Choose one or more years to scrape from WHO website
scrapy crawl [name] -s WHO_IRIS_YEARS = [2012, ..]

# Change the job folder (to start again a scrap from the beginning eg):
scrapy crawl [name] -s JOBDIR = 'crawl/[job_name]'

# Change the logging settings:
scrapy crawl [name] -s LOG_LEVEL = INFO|WARNING|DEBUG|ERROR
scrapy crawl [name] -s LOG_ENABLED = True|False

# Change the location of the exported file (locally, Amazon s3 or DSX. S3 and DSX need to enter credentials):
scrapy crawl [name] -s FEED_CONFIG = DSX|S3|LOCAL

# Choose to keep or not the PDF file on a keyword match
scrapy crawl [name] -s KEEP_PDF = False|True

# Choose to only download the pdf and the metadata, without looking for keywords and sections
scrapy crawl [name] -s DOWNLOAD_ONLY = False|True

Settings to change in the settings file:

(Notice: You should clone the settings file and modify the clone. The settings file used by scrapy can be changed in the scrapy.cfg file.)

# Change the output method to an AWS S3 bucket:
First, change the values of the fields AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and FEED_URI to match yours
Then, change FEED_CONFIG to 'S3'