Degraded performance on Show My Homework
Incident Report for Satchel One
Postmortem

Cause

Given that the vast majority of schools have returned over the past week or so, the Engineering team at Satchel anticipated increasing activity on our website and examination of past activity on the site both for last week and for last year suggested we had sufficient capacity within our infrastructure to accommodate expected traffic.

However, despite this, the amount and scale of user activity on Show My Homework on the afternoon of 11th September caused our database performance to degrade to such a level whereby we were no longer able to serve any information to users. Furthermore, as performance slowed, the degradation was such that users could not log in either to the website or to our mobile applications. In short, part of our database infrastructure ground to a standstill. Despite changes made to overcome the performance issues on the 11th, the changes made proved insufficient, as witnessed by the outage on the 12th.

Resolution

Underpinning Show My Homework is a collection of API Services which ultimately reference information inside the Show My Homework database. One of the challenges for any heavily used site like ours is to figure out the infrastructure needed to run these services in such a way that they respond quickly to user requests whilst at the same time not overwhelming our database. Our existing configuration for suggested we had the balance between user activity, available API services and our database capacity correct, however clearly we got this wrong.

To summarise in brief the changes made over the duration of the two outages: * We introduced a further three read replicas on our MySQL database to handle incoming reads. This effectively means we can now handle substantially more database queries at a given point in time compared with the available capacity at the start of the week. Part of the delay experienced by users was in getting these built and made ready for use by the site. * We increased the infrastructure running the API services by 50%, again allowing us to handle more user requests from the website.

At no time was any data lost or compromised.

Post-Mortem

Despite our planning, we underestimated the amount of activity we were likely to receive, especially when schools started the new academic year. Therefore, the key lesson learnt by Satchel is in recognising that our assessment of available operational capacity was insufficient. In addition, whilst we have done much over the past year to improve site performance, there are still parts of the website that need to be tuned to improve performance and reduce the occurrence of such issues as we experienced yesterday. The Engineering team will ensure the work in optimising our site continues.

We believe what we now have in place will be sufficient for our needs over the coming months, but longer term, as we continue to grow our user base and product line, we recognise that we need a more scalable solution. This is something the CTO will ensure the Engineering team puts in place well in advance of September 2018.

Again, please accept our apologies for yesterday’s outage; Satchel appreciates that at this early point in the academic year, the problems with Show My Homework were unacceptable to all our users. The Engineering team here will continue to work to improve matters and ensure there is no repeat of this issue.

Posted Sep 13, 2017 - 22:05 BST

Resolved
Following incidents on both Monday 11th and Tuesday 12th, we have now revised aspects of our infrastructure to ensure we are able to accommodate user traffic on the site. All aspects of the site are operational.

A full post-mortem regarding these incidents has been circulated to all Show My Homework schools regarding this matter.

Again, please accept our apologies for this outage; Satchel appreciates that at this early point in the academic year, the problems with Show My Homework were unacceptable to all our users. The Satchel Engineering team will continue to work to improve matters and ensure there is no repeat of this issue.
Posted Sep 13, 2017 - 21:58 BST
Monitoring
A number of changes have been made to the operational environment to handle increased traffic volumes to Show My Homework. The site is now accessible but you may see the occasional reduction in performance. The Satchel Engineering team are monitoring the situation to ensure no further downtime is experienced but all features are available for use.
Posted Sep 12, 2017 - 20:10 BST
Investigating
We are investigating a degradation in performance of the site, hence users will experience difficulty either logging into the site or using the site. This issue will also impact upon users of SMHW mobile applications at present. Please accept our apologies for this interruption to service.
Posted Sep 12, 2017 - 16:24 BST