Post-mortem: March 30th, 2022 Downtime
Today at around 13:20 UTC, due to networking issues and several instance terminations in our cloud provider in our Region/AZ, we experienced a downtime.
The issue was resolved at 14:30 UTC.
At 15:06 UTC, for another 8 minutes, there were sync failures and the issue was permanently resolved at 15:14 UTC.
The first investigation process was completed at 16:00 UTC.
Background
Aniview Serving instances is a combination of hundreds of servers that are working cohesively with real time databases in order to perform at the best and in real time.
Every update on the UI is immediately propagated to our serving instances, every impression or data that is being received is immediately propagated for serving and analytics. This makes our product perform in a superior manner.
Today at around 13:20 UTC, there was network disruption in one of our data centers in our cloud provider. This issue has caused a chain reaction that necessary data that is required for serving was missing.
We discovered an anomaly one minute after the issue had begun and started investigating the root cause.
It took us another 13 minutes(13:34 UTC) to find the networking and termination of our real time databases.
Our next step was to reload the data back to the serving infrastructure and due to some networking problems and high load of synchronization, it failed to do so.
After 20 minutes, at 13:55 UTC, we decided not to wait for the serving infrastructure to sync back with the old cached data, so we purged the data and reloaded it.
It took another 25 minutes to re-deploy the entire cache servers and serving infrastructure, and 7 minutes after at 14:30 UTC the data was completely loaded back.
While we were monitoring the servers, we saw that some servers were still out of sync and between 16:06-16:14 UTC, we loaded the entire data again and the issue was resolved permanently.
Mitigation & Next Steps
As we always do, in these unpredictable situations, we always try to find how we can improve and make our infrastructure more resilient and fault tolerant to any surprise that might arrive.
We want to assure you, our partners, that we learn from these issues and will perform better next time.
We have already deployed several improvements to our sync capabilities, to improve a network and data disaster to be performed in minutes and not in hours or half hours.
We have improved our capabilities to clone almost any component to any region/cloud providers in a faster way to avoid cloud providers internal issues.
We are working to improve our communications during such incidents and to give more visibility and not only through our management console.
Our R&D department and Support team are still working on internal processes to reduce to the minimum the risks that are involved and improve our resolution time.
Summary
Aniview SLA for its Serving and Analytics components is remarkable.
Our Analytics SLA is 99.95%.
Our Serving SLA is 99.9%, and last year we finished it with +99.99% uptime.
We are trying to work as hard as we can to make everything that you as a valuable partner do, matter. It is important for us and we do not take it lightly, and we spend our utmost resources to be market leaders.
Thank you for your understanding, we don’t take it for granted.