-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generalizable handling of slashes in DOI URLs #455
Comments
Hi @remrama! Thanks a lot for putting all these thoughts into tackling this issue. I totally agree that the current way we handle this is not great: as you said the patch I included in #340 is just a workaround. If at any point we would like to include a DOI service provider that makes our conditions to clash when trying to figure out the I haven't thought about using a different separator between the DOI and the filename, but I like the idea! I'm curious why choosing
I would like to hear your thoughts on these. BTW, I also don't like the other possible solutions you mentioned (thanks for adding them as well!). The only one I could consider is putting the filename between brackets... but only if using another character as separator has a major flaw. I foresee two possible options to implement such idea:
The good thing of option 1 is that we would allow our users to make a smooth change, rather than braking their workflows as soon as we release Pooch 2.0.0 (assuming the don't pin Pooch to < 2.0.0). But, I suspect it might require more work. We would also need to raise FutureWarnings informing the users of the new separator, and about the deprecation of the slash as separator for Pooch 2.0.0 (maybe it could be a DeprecationWarning, since end users don't need to know about it, but developers do). Let me know what do you think! I would like to continue the conversation. I'd also like to hear @leouieda's ideas on this (although I don't expect him to be around for a few weeks). |
Description of problem
In trying to build a
DOIDownloader
for the PhysioNet repository, I was introduced to the issue of slashes in DOI URLs. This seems to be an ongoing concern for the use of DOI URLs in Pooch. A summary of the issue is that with a structure likedoi:<doi>/<filename>
, there is no reliable way to parse the repository DOI from the filename. DOIs and filenames both allow an undefined number of slashes.This issue was brought up in #336 when it was realized that Zenodo filenames might have slashes in them. A fix was proposed in #337 but a long conversation there highlighted the difficulties and caveats to consider. A solution was merged in #340 (see also #341), but this is more of a workaround, as it only handles the Zenodo case (with a quick
if repo == 'zenodo'
type of fix on L187).The
DOIDownloader
seems to be rather popular, with open requests to add support for other repositories (Dryad in #381, NIST in #441, and re3data and PANGEA in #351). Given that there is not really a reliable way to predict which DOIs will have slashes in the suffix and/or the filename, it seems like a generalizable solution to parsing Pooch's DOI URL format is warranted.Proposed solution
Could DOI URLs be structured with a
?
separating the repository DOI and the filename, instead of (yet another) slash?Things I like about this:
pooch.retrieve
,DOIDownloader()
, andpup.fetch
after loading registry with DOI).pooch.retrieve
using eitherdoi:10.11588/data/TKCFEF/B6S0HJ
ordoi:10.11588/data/TKCFEF?tiny-data.txt
. The proposed format seems clearer to me that the filename is not part of the actual DOI.doi:10.1000/123%3F567
instead ofdoi:10.1000/1234567
.pup.fetch
afterload_registry_from_doi()
. The only change is when downloading a single DOI file withpooch.retrieve
, where they would have to replace their repo-fname separator with'?'
.downloaders.py
, stating that DOI handling should not rely on the non-existence of trailing slashes. With the proposed solution, you would not need trailing slashes on the DOI at all.Other potential solutions that I don't like as much:
doi:<repository_doi>?fname=tiny-data.txt
(seems overkill).archive_url
with consecutive slash splits (seems possible this could get a user to a repo they did not intend).10.11588/data/TKCFEF/tiny-data.txt
would become10.11588/data%2FTKCFEF/tiny-data.txt
. (This seems weird if you're a casual user not understanding the issue.)Are you willing to implement this?
Yes, I've put the small changes required for this here on a fork. I have not run pytests or updated documentation, but I would if there was interested in this being merged. But I have tested it manually and it seems to work on new cases without breaking old ones.
The text was updated successfully, but these errors were encountered: