Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import script should follow more of a pipeline style #669

Open
Mr0grog opened this issue Nov 16, 2020 · 1 comment
Open

Import script should follow more of a pipeline style #669

Mr0grog opened this issue Nov 16, 2020 · 1 comment

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Nov 16, 2020

The import script has gotten pretty crazy and messy over time, and we could remove a lot of the complexity. Some is just because it’s taken us a while to learn the ins- and outs- of the Wayback APIs and their peculiarities, but others are just because something was expedient at the time.

A while back, I had a bunch of ideas about how this script could be clearer and more pipeline-y, with a series of generator-based tasks that run on threads connected by FiniteQueue. I’ve played that out somewhat in the task sheets script. We sort of do that here, but various filtering, summarization, and error handling bits that should be separate workflow items are mixed together, and what’s actually happening isn’t always clear.

(There might also be some better tools for this now. Things like Databay and Prefect either didn’t exist or I didn’t know about them at the time. Bonobo looked to be in a messy total rewrite and didn’t have some of the facilities we needed, but may be better now.)

This probably isn’t high-priority enough to fit on the 2020 roadmap, but would be some nice cleanup to do if there’s time.

@Mr0grog
Copy link
Member Author

Mr0grog commented Nov 16, 2020

Potentially useful sketch I did of this a while back (left is current flow, right is broken up into more pipeline-y bits):

import-flow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant