Spider based on storm platform
- limitation
- reset interval
- expire time
- parallelism
System will refetch settings after a certain time (cache), so it is possible to update settings dynamically.
There are one spout(URLReader
) and five bolts in these topology. Bolts include URLFilter
, Downloader
, HTMLParser
, HTMLSaver
, URLSaver
This bolt is the controller, in charge of :
- Handle repeated urls
- Pattern download count, ignore limitation exceeded pattern.
There something to (or can to be) configured
allowed_url_patterns (Redis sorted list, required, priority from highest score(5) to lowest score(1), zrevrangeBYScore): allowed url patterns to be downloaded
- **limitation**: download count limitation in an interval
- **interval**: duration to reset count
- **expire**: cache time
- **parallelism**: max number of workers working on this pattern(host)
- ignore nun-text pages (binary file)
- consume faster