Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries #110

Open
steve-chavez opened this issue Sep 25, 2023 · 9 comments
Open

Add retries #110

steve-chavez opened this issue Sep 25, 2023 · 9 comments
Labels

Comments

@steve-chavez
Copy link
Member

Hey there! I am wanting a new feature that doesn't seem to be present in supabase regarding webhooks. Sometimes my API can get backed up or is down for a few seconds, and webhooks may fail. Some functions rely on webhooks to work and provide accurate data to customers, so like Stripe does, I would like an automated retry system.

If a webhook fails, it will automatically try in a time like 30 minutes. If that fails, it retries in an hour, maybe to 5 retries total. That way, no webhook is missed.

Originally posted by @Wamy-Dev in supabase/supabase#17664

@steve-chavez
Copy link
Member Author

This is already possible with some work as demonstrated in https://github.com/supabase/pg_net#retrying-failed-requests but it would be better to have retries built-in

Curl does retries https://everything.curl.dev/usingcurl/downloads/retry but libcurl doesn't have this functionality, we need to implement our own.

@TheOtherBrian1
Copy link
Contributor

TheOtherBrian1 commented Sep 25, 2023

In Issue 62, you mentioned the following:

An alternative [to error recovery] could be logging the failed http requests. Then the user would have to search the pg logs.

Originally, the plan was to provide users with the option to log error messages to a designated logger for handling:

... We [can] optionally log something based on the status code when the response is handled?

We should have a config to turn this on/off, maybe named like pg_net.log_http_errors(off by default).
-Issue 49

However, this feature has not been implemented. The config variable for enabling logging is absent in the worker.c codebase. Nonetheless, there may be potential benefits in expanding upon this concept.

Rather than enabling the recording of request failure messages in a log file, the config variable could instruct the extension to automatically reintegrate all dropped requests back into the "net.http_request_queue" table. This approach, unlike the existing solution, would demand the addition of just a single variable and the replacement of one of the error logs with the C equivalent of an INSERT statement.

It is important to emphasize that this suggestion would exclusively monitor requests that failed due to an extension-related error. Requests resulting in 400 or 500 HTTP status errors are already logged in net._http_response and should be managed by users. Such failures may stem from human errors in request endpoint formatting, potentially leading to an endless loop of rejected retry attempts. For users wishing to oversee these specific failed requests, the existing README suggestion provides a safer mechanism, albeit with greater overhead.

@steve-chavez
Copy link
Member Author

@TheOtherBrian1 Great points.

Requests resulting in 400 or 500 HTTP status errors are already logged in net._http_response and should be managed by users. Such failures may stem from human errors in request endpoint formatting, potentially leading to an endless loop of rejected retry attempts

True. I think the only safe way to auto-retry would be by following the Retry-After header, however I don't think all services implement it.

We should have a config to turn this on/off, maybe named like pg_net.log_http_errors(off by default).
-#49 (comment)
However, this feature has not been implemented.

Agree, that would be higher priority than this idea.

Rather than enabling the recording of request failure messages in a log file, the config variable could instruct the extension to automatically reintegrate all dropped requests back into the "net.http_request_queue" table
It is important to emphasize that this suggestion would exclusively monitor requests that failed due to an extension-related error.

Hm, do you have examples of extension-related errors? I don't recall one. IIUC it's an internal error rather than a 400 bad request for example.

@TheOtherBrian1
Copy link
Contributor

Hm, do you have examples of extension-related errors?

I haven't experienced any yet, but I've thought about the potential for them to happen.

@steve-chavez
Copy link
Member Author

Such failures may stem from human errors in request endpoint formatting, potentially leading to an endless loop of rejected retry attempts.

we run the retries on an exponential scale that tries with increasing periods, that ensure resilience to up to 24 hours, in case of connectivity downtime due to updates, issues etc.

^ From https://github.com/orgs/supabase/discussions/17664#discussioncomment-7208243

We could cap the retrying as suggested above, that way we don't do endless loops.

@lookevink
Copy link

Currently using Supabase to fire off webhooks to vercel endpoints.

Vercel has been running into issues with Undici when there's CPU contention, so making retries w/ exponential backoff almost mandatory to be used for event-driven applications.

Wasn't quite sure how to implement the official retry solution:
https://github.com/supabase/pg_net#retrying-failed-requests

Instead used QStash. so far working as a charm

@shawnmclean
Copy link

shawnmclean commented Oct 6, 2024

@lookevink is QStash still the way to go for an event driven system or have you learnt anything new so far?

I need to get db changes on a background worker on trigger.dev. I was hoping to send straight, or at least through my app on vercel and then to trigger.dev.

Going to QStash for durability seems like adding more complexity.

Do you mind explaining the vercel undici CPU issue? I thought each request was its own serverless call?

@lookevink
Copy link

lookevink commented Oct 6, 2024 via email

@TheOtherBrian1
Copy link
Contributor

TheOtherBrian1 commented Oct 8, 2024

@shawnmclean, I've been thinking about rewriting the pg_net retry section. I have an early draft. I'm fairly confident in stages 1-6 and 8. It can be used as a basis for any retry system.

I'm personally not a huge fan of using PL/PgSQL for retries, even though that's what I'm currently working on. It defeats the purpose of using unlogged tables for performance gains - a foundational advantage of pg_net. However, this approach does make the solution configurable for Postgres developers and gives them the most control

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants