We apologize for this incident. We know that you choose to run your applications on Google App Engine to obtain flexible, reliable, high-performance service, and in this incident we have not delivered the level of reliability for which we strive. Our engineers have been working hard to analyze what went wrong and ensure incidents of this type will not recur. In addition, some Flexible Environment applications were unable to deploy new versions during this incident. As part of this procedure, we first move a proportion of apps to a new datacenter in which capacity has already been provisioned. We then gracefully drain traffic from an equivalent proportion of servers in the downsized datacenter in order to reclaim resources.
|Published (Last):||23 March 2016|
|PDF File Size:||7.97 Mb|
|ePub File Size:||17.18 Mb|
|Price:||Free* [*Free Regsitration Required]|
We sincerely apologize for the impact of this incident on your application or service. We recognize the severity of this incident and will be undertaking a detailed review to fully understand the ways in which we must change our systems to prevent a recurrence.
Some customers experienced elevated Datastore latency and errors while Memcache was unavailable. At this time, we believe that all the Datastore issues were caused by surges of Datastore activity due to Memcache being unavailable.
Datastore experienced elevated load on its servers when the outage ended due to a surge in traffic. Some applications in the US experienced elevated latency on gets between and , and elevated latency on puts between and ROOT CAUSE The App Engine Memcache service requires a globally consistent view of the current serving datacenter for each application in order to guarantee strong consistency when traffic fails over to alternate datacenters.
The configuration which maps applications to datacenters is stored in a global database. The incident occurred when the specific database entity that holds the configuration became unavailable for both reads and writes following a configuration update. App Engine Memcache is designed in such a way that the configuration is considered invalid if it cannot be refreshed within 20 seconds.
When the configuration could not be fetched by clients, Memcache became unavailable. Following normal practices, our engineers immediately looked for recent changes that may have triggered the incident. At , we attempted to revert the latest change to the configuration file.
This configuration rollback required an update to the configuration in the global database, which also failed. At , engineers were able to update the configuration by sending an update request with a sufficiently long deadline. This caused all replicas of the database to synchronize and allowed clients to read the mapping configuration. As a temporary mitigation, we have reduced the number of readers of the global configuration, which avoids the contention during write and led to the unavailability during the incident.
Engineering projects are already under way to regionalize this configuration and thereby limit the blast radius of similar failure patterns in the future. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation. This is the final update for this incident.
Nov 06, The Memcache service is still recovering from the outage. The rate of errors continues to decrease and we expect a full resolution of this incident in the near future.
The Memcache service is still recovering from the outage. Nov 06, The issue with Memcache and MVM availability should be resolved for the majority of projects and we expect a full resolution in the near future. The issue with Memcache and MVM availability should be resolved for the majority of projects and we expect a full resolution in the near future. At this time we are gradually ramping up traffic to Memcache and we see that the rate of errors is decreasing.
Other services affected by the outage, such as MVM instances, should be normalizing in the near future. Our Engineering Team believes they have identified the root cause of the errors and is working to mitigate.
Current data indicates that all projects using Memcache are affected by this issue. For everyone who is affected, we apologize for any inconvenience you may be experiencing. Current data indicate s that all projects using Memcache are affected by this issue.
We are investigating an issue with Google App Engine and Memcache.
Google App Engine Incident #19013