Given that the vast majority of schools have returned over the past week or so, the Engineering team at Satchel anticipated increasing activity on our website and examination of past activity on the site both for last week and for last year suggested we had sufficient capacity within our infrastructure to accommodate expected traffic.
However, despite this, the amount and scale of user activity on Show My Homework on the afternoon of 11th September caused our database performance to degrade to such a level whereby we were no longer able to serve any information to users. Furthermore, as performance slowed, the degradation was such that users could not log in either to the website or to our mobile applications. In short, part of our database infrastructure ground to a standstill. Despite changes made to overcome the performance issues on the 11th, the changes made proved insufficient, as witnessed by the outage on the 12th.
Underpinning Show My Homework is a collection of API Services which ultimately reference information inside the Show My Homework database. One of the challenges for any heavily used site like ours is to figure out the infrastructure needed to run these services in such a way that they respond quickly to user requests whilst at the same time not overwhelming our database. Our existing configuration for suggested we had the balance between user activity, available API services and our database capacity correct, however clearly we got this wrong.
To summarise in brief the changes made over the duration of the two outages: * We introduced a further three read replicas on our MySQL database to handle incoming reads. This effectively means we can now handle substantially more database queries at a given point in time compared with the available capacity at the start of the week. Part of the delay experienced by users was in getting these built and made ready for use by the site. * We increased the infrastructure running the API services by 50%, again allowing us to handle more user requests from the website.
At no time was any data lost or compromised.
Despite our planning, we underestimated the amount of activity we were likely to receive, especially when schools started the new academic year. Therefore, the key lesson learnt by Satchel is in recognising that our assessment of available operational capacity was insufficient. In addition, whilst we have done much over the past year to improve site performance, there are still parts of the website that need to be tuned to improve performance and reduce the occurrence of such issues as we experienced yesterday. The Engineering team will ensure the work in optimising our site continues.
We believe what we now have in place will be sufficient for our needs over the coming months, but longer term, as we continue to grow our user base and product line, we recognise that we need a more scalable solution. This is something the CTO will ensure the Engineering team puts in place well in advance of September 2018.
Again, please accept our apologies for yesterday’s outage; Satchel appreciates that at this early point in the academic year, the problems with Show My Homework were unacceptable to all our users. The Engineering team here will continue to work to improve matters and ensure there is no repeat of this issue.