Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add checksums calculation support #188

Open
kouylekov-usit opened this issue Aug 28, 2023 · 7 comments
Open

feat: Add checksums calculation support #188

kouylekov-usit opened this issue Aug 28, 2023 · 7 comments

Comments

@kouylekov-usit
Copy link
Contributor

kouylekov-usit commented Aug 28, 2023

Proposal: Create a api for checksum calculation.

Description:

The goal is to create an API that will calculate the checksum of imported files. The API will listen to rabbitmq message queue. The file API will add files to be checksum calculated. The files to be checksum calculated will be decided by the import request. If the request (PUT or END PATCH) has ?checksum URL parameter. The file API will queue a checksum job. Once calculated the check sum will be stored in .filename.checksum file. The result of the checksum can be displayed in the file info when the user is listing the imported files.

@leondutoit
Copy link
Collaborator

Where would it be provided?

@kouylekov-usit
Copy link
Contributor Author

Where would it be provided?

I did not have time to make the proposal last evening. I got thinking during a meeting with a project.

@leondutoit
Copy link
Collaborator

The complicating factor is that the headers are written before the response is sent, so if you calculate it while reading the data, you can only send it in another response.

The only feasible solution that I've been able to come up with is storing the calculated hash in a persistent cache, if the client requests it while downloading the file, and then the client can fetch the checksum after download with another request. And then having already calculated and cached it, it would be efficient.

@kouylekov-usit kouylekov-usit changed the title Add checksums in output post response feat: Add checksums calculation support Aug 29, 2023
@leondutoit
Copy link
Collaborator

I actually started addressing this when I wrote https://github.com/unioslo/tsd-api-lib a while ago, but never took it further.

@egiltane
Copy link
Contributor

A couple of thoughts:

General:

  • Calculating cryptographic hashes (i. e. hashes that satisfy cryptographic properties rather than serving the purpose of securing transport) is noticeably expensive, especially when performed on files that comprise multiple GiB.

Real-time calculation:

  • For PATCHing, initialisation vectors, if any, will have to be saved as part of the transactional state across requests.

Postum calculation:

  • Expensive operations call for asynchronisicity (read: background tasks or spooling).
  • Asynchronisicity calls for transactions (transaction IDs and the management thereof).
  • To warrant concurrency, one will most likely need exclusive locking (write locks).
  • Locking on NFS requires at least fnctl() (via LOCK_EX), as flock() won’t work reliably across file systems. Preferably one would even go for something more robust and custom, such as globally unique identifiers.
  • In essence, hashing will increase the complexity of the code significantly.
  • Complexity impedes robustness.

@kouylekov-usit
Copy link
Contributor Author

I agree with the sentiment.

A couple of thoughts:

General:

* Calculating cryptographic hashes (i. e. hashes that satisfy cryptographic properties rather than serving the purpose of securing transport) is noticeably expensive, especially when performed on files that comprise multiple GiB.

Real-time calculation:

* For PATCHing, initialisation vectors, if any, will have to be saved as part of the transactional state across requests.

Postum calculation:

* Expensive operations call for asynchronisicity (read: background tasks or spooling).

* Asynchronisicity calls for transactions (transaction IDs and the management thereof).

* To warrant concurrency, one will most likely need exclusive locking (write locks).

* Locking on NFS requires at least `fnctl()` (via `LOCK_EX`), as `flock()` won’t work reliably across file systems. Preferably one would even go for something more robust and custom, such as globally unique identifiers.

* In essence, hashing will increase the complexity of the code significantly.

* Complexity impedes robustness.

The point of my proposal is that the file API simply queues a request for calculation. All the calculation can be done be separate service that will handle all these issues.

@leondutoit
Copy link
Collaborator

I spent quite a lot of time thinking about this about a year ago, and ended up abandoning any implementation work because I felt that the added complexity would not be worth it.

That said, calculating checksums for downloads are much simpler than uploads - files are streamed by a single process, and are read from disk and written to the network in chunks. It would be trivial to pass the chunk to a hash function before flushing it to the network. That is what I made a proof-of-concept for here: https://github.com/unioslo/tsd-api-lib - the idea was that the final checksum would be stored in a cache (backed by a database), for fast retrieval in separate request.

For uploads @petterreinholdtsen and I explored many possible ideas. Calculating checksums on-the-fly, while the API is handling the upload, is basically just too complex to be worth it. Since you have multiple processes writing different chunks of the same file, and since hashing is stateful, you would need to multiplex the incoming data to an external hashing service. And for that to be correct, you would have to send the data to the chunking service after reading back what has been written to disk, because hashing what is handled by the API is not enough - you have to hash what is on disk.

This means that upload hashing has to be async, handled by a RabbitMQ listener service. It would still potentially take a very long time to hash say a 500+ GB file, which we sometimes see in production. That service could write hashes to the same DB in which the download hashes are kept, so that they could be requested via the file API, e.g. with a HEAD request, and some header indicating that one wants the latest hash value of the file. In the case of the hash not being completely calculated one could return https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/202 or https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503, and then the client just has to try again later.

But all in all, I am not sure this is really worth the effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants