-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add checksums calculation support #188
Comments
Where would it be provided? |
I did not have time to make the proposal last evening. I got thinking during a meeting with a project. |
The complicating factor is that the headers are written before the response is sent, so if you calculate it while reading the data, you can only send it in another response. The only feasible solution that I've been able to come up with is storing the calculated hash in a persistent cache, if the client requests it while downloading the file, and then the client can fetch the checksum after download with another request. And then having already calculated and cached it, it would be efficient. |
I actually started addressing this when I wrote https://github.com/unioslo/tsd-api-lib a while ago, but never took it further. |
A couple of thoughts: General:
Real-time calculation:
Postum calculation:
|
I agree with the sentiment.
The point of my proposal is that the file API simply queues a request for calculation. All the calculation can be done be separate service that will handle all these issues. |
I spent quite a lot of time thinking about this about a year ago, and ended up abandoning any implementation work because I felt that the added complexity would not be worth it. That said, calculating checksums for downloads are much simpler than uploads - files are streamed by a single process, and are read from disk and written to the network in chunks. It would be trivial to pass the chunk to a hash function before flushing it to the network. That is what I made a proof-of-concept for here: https://github.com/unioslo/tsd-api-lib - the idea was that the final checksum would be stored in a cache (backed by a database), for fast retrieval in separate request. For uploads @petterreinholdtsen and I explored many possible ideas. Calculating checksums on-the-fly, while the API is handling the upload, is basically just too complex to be worth it. Since you have multiple processes writing different chunks of the same file, and since hashing is stateful, you would need to multiplex the incoming data to an external hashing service. And for that to be correct, you would have to send the data to the chunking service after reading back what has been written to disk, because hashing what is handled by the API is not enough - you have to hash what is on disk. This means that upload hashing has to be async, handled by a RabbitMQ listener service. It would still potentially take a very long time to hash say a 500+ GB file, which we sometimes see in production. That service could write hashes to the same DB in which the download hashes are kept, so that they could be requested via the file API, e.g. with a But all in all, I am not sure this is really worth the effort. |
Proposal: Create a api for checksum calculation.
Description:
The goal is to create an API that will calculate the checksum of imported files. The API will listen to rabbitmq message queue. The file API will add files to be checksum calculated. The files to be checksum calculated will be decided by the import request. If the request (PUT or END PATCH) has
?checksum
URL parameter. The file API will queue a checksum job. Once calculated the check sum will be stored in .filename.checksum file. The result of the checksum can be displayed in the file info when the user is listing the imported files.The text was updated successfully, but these errors were encountered: