Production API Error Elevation - Cloud provider issues

Incident Report for Stash Alloy

Resolved

We're pleased to share that all system latencies and performance metrics have returned to expected levels. All internal tests are passing, and we are no longer seeing errors related to the third-party component.

Our teams work closely with AWS and the third-party vendor to conduct a comprehensive post mortem of the incident. Please email support@alloy.com if you would like the RCA once it's available.

We appreciate your patience and understanding throughout this event.

Posted Oct 20, 2025 - 21:30 EDT

Monitoring

We’ve completed a release to production after confirming that all errors were resolved in staging. The production environment is now showing signs of recovery. We are actively monitoring progress.

Posted Oct 20, 2025 - 20:47 EDT

Update

Following a best-practice recommendation from our third-party vendor, we are deploying a change to our staging environment for validation and testing. We’ll proceed to production once we confirm the fix is effective. Next update to come after we complete the staging release and confirm results - we’ll aim for an update in about 30 minutes.

Posted Oct 20, 2025 - 20:17 EDT

Update

We believe we have identified a third-party component that is throttling requests as part of recovery efforts to mitigate the ongoing impact from the AWS outage. This throttling is contributing to elevated network latency across Alloy services.

We are optimistic that we can mitigate some of this impact with changes on our side and are actively working on those adjustments now. We will post another update at 8 PM ET or once substantial updates become available.

Posted Oct 20, 2025 - 18:41 EDT

Update

AWS continues to experience residual latency. All components are currently seeing elevated latency.

Our engineering team is recycling service pods, and we are beginning to see signs of recovery. We are also investigating additional areas to further mitigate the impact.

We will continue to closely monitor performance and share updates as they become available.

Posted Oct 20, 2025 - 17:57 EDT

Update

At this point, we again unfortunately have no substantial update to share.

We continue to investigate this with AWS at the highest priority.

Posted Oct 20, 2025 - 17:03 EDT

Update

We are continuing to investigate connectivity issues within our cloud services. Customers may also experience Dashboard slowness while the issue persists.
We’ll provide further updates as more information becomes available.

Posted Oct 20, 2025 - 16:31 EDT

Update

We are continuing to investigate connectivity issues within our cloud services. Customers may experience API timeouts while the issue persists.
We’ll provide further updates as more information becomes available.

Posted Oct 20, 2025 - 16:22 EDT

Investigating

Our monitoring has detected an increase in automated test failures, and our engineering team is currently investigating.

Posted Oct 20, 2025 - 15:42 EDT

Monitoring

Network connectivity issues have been resolved. The Alloy Engineering team has rebuilt internal queues, and all internal tests are now passing. Operations have been restored as of 2:40 PM ET.

We are continuing to monitor system performance to ensure full stability.

Posted Oct 20, 2025 - 14:59 EDT

Update

AWS has reported progress in EC2 recovery, but Alloy’s full recovery remains dependent on AWS systems stabilizing. At this time, we are observing the following:

- Webhooks: Backlog processing has begun, but a large queue remains. Real-time webhook delivery continues to experience delays.
- Journeys: There continue to be intermittent Journeys failures. Some Journey Applications are failing to write to S3 - those will need to be retried once the incident is resolved.
- Customer Dashboard: Intermittent latency may occur when loading or navigating the dashboard.
- API: Intermittent latency persists for some requests.

Our team continues to closely monitor AWS recovery and system performance. We’ll provide further updates as additional progress is made as close to 30 minute intervals as possible.

Posted Oct 20, 2025 - 14:32 EDT

Update

At this point, we unfortunately have no substantial update to share.

We continue to investigate this with AWS at the highest priority.

Thank you for the ongoing patience. We will keep you updated every 30 minutes or as more substantial information becomes available.

Posted Oct 20, 2025 - 13:35 EDT

Update

AWS continues to investigate ongoing EC2 and networking issues. Alloy services remain impacted.

Our team remains in contact with AWS and will share updates every 30 minutes or as more substantial information becomes available.

Posted Oct 20, 2025 - 13:07 EDT

Update

We are still investigating the issue.

AWS has implemented additional mitigation steps and is observing some recovery, but operations remain degraded and have not been fully restored.

Alloy systems have not yet stabilized, and we are working closely with AWS support to restore full functionality.

Posted Oct 20, 2025 - 12:31 EDT

Investigating

We are still addressing the issues following the recent AWS outage - you can follow their status at https://health.aws.amazon.com/health/status.

Customers may experience issues in the following areas:

- API: We are investigating intermittent unavailability in the API. Requests may return 5XX errors.
- Journeys and Webhooks: Ongoing latency as the Alloy Engineering team continues to work on resolution. If there were any Journey Applications in a pending state at 11:29 AM ET they will not complete successfully, and any associated asynchronous tasks have been lost. If a Journey Application was run during this time and is not in a terminal state (e.g., Approved or Denied) - such as those waiting on step-ups or webhook actions - it will remain in an incomplete state. API calls for these applications will need to be re-run.
- Logging into Alloy and Dashboard: Intermittent may occur while logging in, navigating the dashboard, or submitting reviews. Some users may intermittently see gateway timeout errors or blank screens.
- Third party integrations: Services that depend on AWS may also experience degraded performance, causing applications to fail and result in partial results.

Our engineering team is actively working to restore all services to full capacity as quickly as possible. We’ll continue to provide updates as more information becomes available

Posted Oct 20, 2025 - 12:03 EDT

Identified

We are currently experiencing degraded performance affecting both the API and Dashboard. The issue is linked to the ongoing AWS outage — you can follow their status here: https://health.aws.amazon.com/health/status. Third party integrations are also affected by the outage.

Our engineering team is actively working to restore all services to full capacity as quickly as possible. We'll continue to provide updates as more information becomes available.

Posted Oct 20, 2025 - 11:14 EDT

This incident affected: Production API, Customer Dashboard, and Webhooks.