You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The import script has gotten pretty crazy and messy over time, and we could remove a lot of the complexity. Some is just because it’s taken us a while to learn the ins- and outs- of the Wayback APIs and their peculiarities, but others are just because something was expedient at the time.
A while back, I had a bunch of ideas about how this script could be clearer and more pipeline-y, with a series of generator-based tasks that run on threads connected by FiniteQueue. I’ve played that out somewhat in the task sheets script. We sort of do that here, but various filtering, summarization, and error handling bits that should be separate workflow items are mixed together, and what’s actually happening isn’t always clear.
(There might also be some better tools for this now. Things like Databay and Prefect either didn’t exist or I didn’t know about them at the time. Bonobo looked to be in a messy total rewrite and didn’t have some of the facilities we needed, but may be better now.)
This probably isn’t high-priority enough to fit on the 2020 roadmap, but would be some nice cleanup to do if there’s time.
The text was updated successfully, but these errors were encountered:
The import script has gotten pretty crazy and messy over time, and we could remove a lot of the complexity. Some is just because it’s taken us a while to learn the ins- and outs- of the Wayback APIs and their peculiarities, but others are just because something was expedient at the time.
A while back, I had a bunch of ideas about how this script could be clearer and more pipeline-y, with a series of generator-based tasks that run on threads connected by
FiniteQueue
. I’ve played that out somewhat in the task sheets script. We sort of do that here, but various filtering, summarization, and error handling bits that should be separate workflow items are mixed together, and what’s actually happening isn’t always clear.(There might also be some better tools for this now. Things like Databay and Prefect either didn’t exist or I didn’t know about them at the time. Bonobo looked to be in a messy total rewrite and didn’t have some of the facilities we needed, but may be better now.)
This probably isn’t high-priority enough to fit on the 2020 roadmap, but would be some nice cleanup to do if there’s time.
The text was updated successfully, but these errors were encountered: