We had two issues following our release. They were being investigated around the same time but the causes and fixes were different. One related to inability for some customers to upload/preview Shared Health Summaries and the other related to inability for some customers to login.
Impact
Shared Health Summary. All customers were unable to upload/preview shared health summaries. From the logs, 10 customers experienced this as they used that area within Helix during this time.
Login. Customers who had not previously logged in, were unable to login. They would simply see a loading spinner that would perpetually load - this is what was reported. This may have manifested itself in other ways too such as inability to proceed.
Response
SRE team contacted other engineering members on Teams to promptly investigate.
Timeline
All times in 24h AEDT
Root Cause
Shared Health Summary. Caused by a mismatch in terms of library required by components within Helix.
Login. Caused by an unhealthy server instance being automatically created based on load.
Resolution
Shared Health Summary. We fixed the issue within 1 hour and 30 mins. However, we decided to deploy the hot fix at 10pm on 16/09/21.
Login. We fixed the issue within 8 minutes by restarting the app service.