Impact
Between 3:30am PT and 7:08am PT chatgpt.com and a small number of APIs experienced elevated error rates.
Root Cause & Remediation
The root cause of the initial issue was a failure with Cosmos DB in ChatGPT. Our service provider recovered the affected DBs at 4:25am PT.
The second issue was caused by web service pods crash-looping due to their health checks failing, marking pods as unhealthy and subsequently not having enough resources available to service all the requests to the web layer. At this time we believed this was due to users retrying requests. These issues recovered after we implemented a fix to the web server’s health checking logic.
Prevention
We know that extended outages can impact your products, businesses, and lives. We will prioritize the preventative measures outlined above to continue improving our reliability.