Before using this software read article: Is web scraping perfectly legal ?
Licence: MIT
- Vpc - Networking - (ipv4, ipv6, public & private subnets)
- Private dns zone
- Mongodb instance
- Elastic search domain
- Lambda cloud formation custom resource
- AWS CodeBuild
- AWS CloudWatch (Events, Logs)
- AWS CloudFormation
- Debugging mode:
docker-compose up -d
sbt run -jvm-debug 5005 -J-Xmx4G
- Prod mode:
You should change prod.conf and set logging to ERROR mode if you running production mode.
docker-compose up -d
sbt runProd -J-Xmx4G
- Docker:
docker-compose up -d
sbt docker:publishLocal
docker run -d -p 9000:9000 sphere-api-crawlers:1.0-SNAPSHOT
- HOST: http://localhost:9000
- PATH: /crawler/v2
Multiple jobs:
"url": "",
"config": {
"concurrency": 1,
"throttle": 1000
"url": "",
"include": [
"config": {
"concurrency": 1,
"throttle": 1000
"url": "",
"exclude": [
"config": {
"concurrency": 1,
"throttle": 1000
- HOST: http://localhost:9000
- PATH: /crawler/v2
Multiple jobs:
"url": ""
Task | Type | Description |
url | String | link to crawl |
include | List[String] | crawl only paths which are in include list |
exclude | List[String] | crawl everything except paths in exclude list |
Config - Key | Type | Description |
throttle | Integer | crawling delay in ms |
concurrency | Integer | number of concurrent ops |
withIndexThrottle | Boolean | see index throttle formula |
withStripOtherQueries | Boolean | in combination with include |
withIndexThrottle - true - default false: Current pending size * throttle = time of delay If concurrency is 5 and throttle 1000, sphere will crawl 5 pages at least each second, delay is prolonged based on current pending queue, so if current pending queue is 100 sphere will crawl 5 pages each 100 seconds so if page have a lot of links and if withIndexThrottle is enabled throttle should not be number bigger than 10.
withIndexThrottle - false is default behavior: throttle = time of delay If concurrency is 5 and throttle 1000, sphere will crawl 5 pages exactly each second.
You can find statistics and info in elastic search. /crawler/_search
You can find all crawled pages in folder data/storage/.