generalizable handling of slashes in DOI URLs #455

remrama · 2024-12-27T19:49:12Z

Description of problem

In trying to build a DOIDownloader for the PhysioNet repository, I was introduced to the issue of slashes in DOI URLs. This seems to be an ongoing concern for the use of DOI URLs in Pooch. A summary of the issue is that with a structure like doi:<doi>/<filename>, there is no reliable way to parse the repository DOI from the filename. DOIs and filenames both allow an undefined number of slashes.

This issue was brought up in #336 when it was realized that Zenodo filenames might have slashes in them. A fix was proposed in #337 but a long conversation there highlighted the difficulties and caveats to consider. A solution was merged in #340 (see also #341), but this is more of a workaround, as it only handles the Zenodo case (with a quick if repo == 'zenodo' type of fix on L187).

The DOIDownloader seems to be rather popular, with open requests to add support for other repositories (Dryad in #381, NIST in #441, and re3data and PANGEA in #351). Given that there is not really a reliable way to predict which DOIs will have slashes in the suffix and/or the filename, it seems like a generalizable solution to parsing Pooch's DOI URL format is warranted.

Proposed solution

Could DOI URLs be structured with a ? separating the repository DOI and the filename, instead of (yet another) slash?

# Current implementation
# (Where do DOIs end and filenames begin?)
url = "doi:10.6084/m9.figshare.14763051.v1/tiny-data.txt"
url = "doi:10.11588/data/TKCFEF/tiny-data.txt"
url = "doi:10.5281/zenodo.4924875/tiny-data.txt"
url = "doi:10.5281/zenodo.7632643/santisoler/pooch-test-data-v1.zip"
url = "doi:10.13026/C2X676/SC-subjects.xls"
url = "doi:10.13026/C2X676/sleep-cassette/SC4001EC-Hypnogram.edf"

# Proposed implementation
url = "doi:10.6084/m9.figshare.14763051.v1?tiny-data.txt"
url = "doi:10.11588/data/TKCFEF?tiny-data.txt"
url = "doi:10.5281/zenodo.4924875?tiny-data.txt"
url = "doi:10.5281/zenodo.7632643?santisoler/pooch-test-data-v1.zip"
url = "doi:10.13026/C2X676?SC-subjects.xls"
url = "doi:10.13026/C2X676?sleep-cassette/SC4001EC-Hypnogram.edf"

Things I like about this:

I've tested this and it works on all possible slash situations (no extra slash, extra slash in DOI, extra slash in filename).
It requires minimal updates to Pooch's codebase.
It works in all 3 usecases (pooch.retrieve, DOIDownloader(), and pup.fetch after loading registry with DOI).
It might help some confusion that has previously arisen between using a repository DOI (current usecase) and a file DOI (see DataVerse file DOIs in pooch.retrieve #356). A solution like this might help clarify to the user what they are requesting with the DOIs, where a user could get a single file from pooch.retrieve using either doi:10.11588/data/TKCFEF/B6S0HJ or doi:10.11588/data/TKCFEF?tiny-data.txt. The proposed format seems clearer to me that the filename is not part of the actual DOI.
In a strange case where a DOI actually included a question mark (I've never seen this), it could be hex coded, per DOI recommendations. For example, they could use doi:10.1000/123%3F567 instead of doi:10.1000/1234567.
Nothing really changes on the user's end if they are using the DOI downloader through pup.fetch after load_registry_from_doi(). The only change is when downloading a single DOI file with pooch.retrieve, where they would have to replace their repo-fname separator with '?'.
Maybe this helps address a comment on L676 in downloaders.py, stating that DOI handling should not rely on the non-existence of trailing slashes. With the proposed solution, you would not need trailing slashes on the DOI at all.

Other potential solutions that I don't like as much:

Filename in brackets or something (just uglier to me).
A more parameter-focused approach, specifying the filename more directly like doi:<repository_doi>?fname=tiny-data.txt (seems overkill).
Retry resolving the archive_url with consecutive slash splits (seems possible this could get a user to a repo they did not intend).
Require extra slashes in the repository DOI to be hex coded. For example, 10.11588/data/TKCFEF/tiny-data.txt would become 10.11588/data%2FTKCFEF/tiny-data.txt. (This seems weird if you're a casual user not understanding the issue.)

Are you willing to implement this?

Yes, I've put the small changes required for this here on a fork. I have not run pytests or updated documentation, but I would if there was interested in this being merged. But I have tested it manually and it seems to work on new cases without breaking old ones.

The text was updated successfully, but these errors were encountered:

santisoler · 2025-01-08T20:08:29Z

Hi @remrama! Thanks a lot for putting all these thoughts into tackling this issue.

I totally agree that the current way we handle this is not great: as you said the patch I included in #340 is just a workaround. If at any point we would like to include a DOI service provider that makes our conditions to clash when trying to figure out the netlock, then we just won't be able to add it.

I haven't thought about using a different separator between the DOI and the filename, but I like the idea! I'm curious why choosing ?. Just adding some ideas:

What about using a character non reserved by the RFC 3986? Something like |? The examples you showed before would look like this:

url = "doi:10.6084/m9.figshare.14763051.v1|tiny-data.txt"
url = "doi:10.11588/data/TKCFEF|tiny-data.txt"
url = "doi:10.5281/zenodo.4924875|tiny-data.txt"
url = "doi:10.5281/zenodo.7632643|santisoler/pooch-test-data-v1.zip"
url = "doi:10.13026/C2X676|SC-subjects.xls"
url = "doi:10.13026/C2X676|sleep-cassette/SC4001EC-Hypnogram.edf"

Why not using another : as separator? Although : is reserved by RFC 3986, as long as no DOI uses :, we should be ok. This choice would make : a consistent separator (we already use it to specify the protocol). Moreover, it kind of resembles the syntax for scp (scp user@server:/home/user/myfile .), so some users would feel it more familiar.

url = "doi:10.6084/m9.figshare.14763051.v1:tiny-data.txt"
url = "doi:10.11588/data/TKCFEF:tiny-data.txt"
url = "doi:10.5281/zenodo.4924875:tiny-data.txt"
url = "doi:10.5281/zenodo.7632643:santisoler/pooch-test-data-v1.zip"
url = "doi:10.13026/C2X676:SC-subjects.xls"
url = "doi:10.13026/C2X676:sleep-cassette/SC4001EC-Hypnogram.edf"

Another idea (but not for Pooch < 2.0.0) would be to use a space. The thing with this option is that we would need to move out from the registry.txt files into something like JSON (an idea we've been discussing for Pooch >= 2.0.0) where we could better handle spaces in the URLs. The previous urls would look like this:

url = "doi:10.6084/m9.figshare.14763051.v1 tiny-data.txt"
url = "doi:10.11588/data/TKCFEF tiny-data.txt"
url = "doi:10.5281/zenodo.4924875 tiny-data.txt"
url = "doi:10.5281/zenodo.7632643 santisoler/pooch-test-data-v1.zip"
url = "doi:10.13026/C2X676 SC-subjects.xls"
url = "doi:10.13026/C2X676 sleep-cassette/SC4001EC-Hypnogram.edf"

I would like to hear your thoughts on these. BTW, I also don't like the other possible solutions you mentioned (thanks for adding them as well!). The only one I could consider is putting the filename between brackets... but only if using another character as separator has a major flaw.

I foresee two possible options to implement such idea:

We could ship it on Pooch > 1.8.2 < 2.0.0, but we would have to keep backward compatibility (and be very careful about it). That will force us to duplicate tests: we should keep the current tests with the slash as separator, while adding tests for the implementation with the new separator.
We could start working on it, but ship it on Pooch 2.0.0, so we don't need to worry about backward compatibility.

The good thing of option 1 is that we would allow our users to make a smooth change, rather than braking their workflows as soon as we release Pooch 2.0.0 (assuming the don't pin Pooch to < 2.0.0). But, I suspect it might require more work. We would also need to raise FutureWarnings informing the users of the new separator, and about the deprecation of the slash as separator for Pooch 2.0.0 (maybe it could be a DeprecationWarning, since end users don't need to know about it, but developers do).

Let me know what do you think! I would like to continue the conversation.

I'd also like to hear @leouieda's ideas on this (although I don't expect him to be around for a few weeks).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generalizable handling of slashes in DOI URLs #455

generalizable handling of slashes in DOI URLs #455

remrama commented Dec 27, 2024

santisoler commented Jan 8, 2025

generalizable handling of slashes in DOI URLs #455

generalizable handling of slashes in DOI URLs #455

Comments

remrama commented Dec 27, 2024

Description of problem

Proposed solution

Are you willing to implement this?

santisoler commented Jan 8, 2025