Hi, thank you for the quick response.
We ourselves have been chasing after this issue for quite a while. For the past few months, we were only getting reports from our users, and screenshots of the duplicate messages until a few days ago it hit our admins’ accounts simultaneously.
Here is what we know so far:
- We are not using the expo server SDK, wrote our own in Golang with Go’s HTTP package, it’s just a plain post request without retry (for errors, we log it and move on). The requests sent to expo server seem to be processed multiple times.
- Previously sending about 2~3k pushes a day, send every second (skip if there is none waiting in queue), and throttled to 10 concurrent requests maximum.
- Throttled it to a maximum of 2 concurrent requests yesterday, still got quite a few 502 (not sure if any duplicates happened, been trying to build more detailed e2e monitoring from the client-side)
- We do check with Expo server using the push receipt about 30s later, to disable push tokens that had been deactivated, and we stop sending to them.
- It tends to happen during our peak time (around 2pm)
- It seems to happen when we send more than 4,000 push notifications a day (potentially 2-3 concurrent requests to Expo server temporarily)
- It seems (still trying to confirm) that on the days when this happens, we get a lot of this error from expo endpoint:
<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.17.8</center>\r\n</body>\r\n</html>
- I’m not 100% sure if this 502 is the problem, but it could be.
Here is our plan:
We thought about building some e2e test: have the server send push to a test device every second with a timestamp, the test device will see if it receives any duplicate timestamp, and report the problem to the server. The server will also log the time of these 502 error to see if they happen around the same time.
I can report back the result in a few days.
In the mean time, I think if you have a script to send > 5k messages with 24 hours period, and maybe send 2-3 concurrent requests around 2pm (not sure if the bottleneck is with the push server overall, or has to do with volume per account), you’ll be able to log some of these errors.
My guess is this might have something to do with the 502? I wonder if the Nginx server is configured to do some sort of retry?