Service Image Registry seems to be down
On 2022-05-30 between 14:38 UTC+2 (CEST) and 15:23 UTC+2 an incident happened on APPUiO Cloud in the cloudscale.ch - LPG 2 zone which led to limited service quality on APPUiO Cloud.
The root cause of the problem was that the load balancer hit a configured limit which is there to protect from over-usage of APPUiO Cloud resources and to prevent denial-of-service scenarios. Specifically, to prevent the load balancer from maintaining more connections than the server’s memory resources allow and blocking all access to the systems; In this situation, it led to the load balancer no longer accepting some connections. As a result, some services running on the platform experienced a reduction in service quality. That included some customer applications as well as VSHN internal and external services.
We received the first alerts at 14:38 UTC+2 and our Responsible Ops started to investigate the issue. After the initial situation assessment, a task force of 4 engineers started working to narrow down the issue and eventually find the root cause. At 15:23 a service restart was the best work-around and allowed us to stabilize the services.
The task force then monitored the situation actively and worked on finding the root problem to provide a permanent solution. At 17:31 the configuration change was applied to the load balancer and the permanent solution was in place.
In the aftermath of the incident, we found that the configuration was very restrictive and got triggered by just slightly abnormal traffic. There is always the problem with such limitations in finding the balance between protection and being overprotective. In this case we were too aggressive and as a follow-up, we will implement better monitoring for the configuration limits to have an early warning if they are too strict.
This incident was handled internally with the ticket ISMS-1070.
Service is back up.
According to our monitoring system this service has become unresponsive, we’re investigating.