-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to put block: snapshot does not exist #43
Comments
Another one today. Most of the same block numbers...
|
While trying to reproduce this, I also saw the following error; not sure if it's related, but I'd guess not. I'm putting it here because the same fix might help, if we have to impose waits / retries.
|
I ran 5 upload tests each with a 10MB file and a ~1GB file, each time rerunning coldsnap until I got a failure. For the 10MB file, it failed after 31, 177, 68, 48, and 968 attempts respectively, all but one with the "snapshot does not exist" error; the last gave For the 1GB file, it failed after 27, 7, 4, 33, and 130 attempts respectively, always with the "connection closed before message completed" error. I hadn't seen this before, but now it seems consistent, at least when running somewhat intensive tests with the bigger file. The connection could be closed for any number of reasons... (For reference, I did also see I'm going to try to instrument the error case a bit more thoroughly to see if I can get better clues about the state of the connection and the snapshot at the point of error. For example, we only keep the error message from the last retry for each block, and I'll try keeping all of those; I'm also going to check the result of the start_snapshot call, which we currently just check was successful, but there's a separate 'status' field inside that could be |
I did learn a few things from the extra instrumentation, running with the 1GB file. Retry counts:
Retry causes:
It'd be great to figure out why we get the connection closed errors. It's tempting to raise the retry limit to 5 to work around it. Obviously, none of these errors are the same as the original one in this issue, where EBS thinks the snapshot ID doesn't exist. Increasing retries seems less likely to fix that issue, since it's almost surely a timing issue, and we're not waiting between retries. It's tempting to add a wait. In the past, when faced with similar resource timing issues, we've used the strategy of building a separate client object, ideally one that talks to a different endpoint, and checking the existence of the resource with that. It's unfortunate but it might help here. |
As a test, I bumped block retry count from 3 to 5, and added an increasing wait time after each each block retry - 1 second, 2, 3, 4, 5. I was finally able to catch the interesting case, "snapshot does not exist." (I don't think it was related to those changes, I just happened to catch it.) As you'd expect, when you see that error, you see it a lot - hundreds of blocks continued to fail with that error after >15s of total delay, killing the upload. However, very near the end, a handful of later blocks did manage to succeed in uploading after 1 not-found error. (I only logged ones that had some kind of failure, not currently the status of every block, so there could have been more.) This implies that we could get some mileage out of confirming we can describe the snapshot after starting it, before we start uploading. However, I'm still not confident that it's a linear ordering, meaning that once we see an upload without a not-found error, I think we can still see a not-found error. I believe rusoto uses a connection pool, and we're not necessarily sending requests on the same connection that saw the resource... but in combination with backoff-retries it may be enough. [edit] Actually, I'm fairly sure the ordering isn't linear, and that the waits wouldn't be enough in combination. I had another failed run with only one block failing due to the snapshot not being found. The per-connection results seem pretty inconsistent. |
We do still occasionally see this error, even with the retries from #56 - twice in the past two weeks, I believe, during CI runs. |
Saw this error when trying to upload a ~2GB file. I'm not sure if there's a timing issue in coldsnap, where it needs to wait for some confirmation before uploading blocks, or if it's on the EBS side, or...
The text was updated successfully, but these errors were encountered: