Our main website for work is hosted in the cloud and is maintained by a company we’ve worked with for over 15 years. It’s generally very reliable and so are the periodic patches/updates they do.
There was a patch scheduled for a recent Wednesday night and I made a mental note to check on things on Thursday morning – just in case. This check is usually nothing more than logging in to the website and making sure things are where I expect them to be – then updating the team.
Thursday morning I opened up my email to a flurry of system messages indicating an extended downtime – much longer than we usually see. I quickly checked and the system let me log in – and we appeared to be up and running – so my message to the team was that things looked a little bumpier than usual and we should be a little extra vigilant in our review.
And then things started to get weird.
I edited a page just fine, but the second page wouldn’t save. I could download an image, but not upload it to a new folder. If I changed the format and tried again, it would work. Some pages wouldn’t update and some would.
I reached out to my colleagues and their testing was all over the map. Some things that worked for me didn’t work for them. In a couple of cases, an entire function wouldn’t load.
Then I got a note from another editor on campus reporting a new problem.
But, there was no pattern. My notes continued to grow with problems and exceptions – making it challenging to even figure out how to report. I opened a ticket with our support, assumed it was the patch, and shared all the conflicting notes.
I updated the team and we continued to try and test. It wasn’t time to update all the editors across campus, but I was starting to think about that email.
Support got back to me asking for screenshots and noting they were looking into it. Which suggested that this was just us and not a buggy patch that they would roll back from all the clients.
I sent out a note to the editors and explained where we were and that they should stop editing until we had more information. I promised to keep them in the loop.
My boss was out this week and I sent a note to his peers to update them in case they got any questions.
And then, things got worse.
I got another update from support – another person this time – and they reported that the night before (and continuing into the day) our website was the subject of a DDOS attack.
DDOS stands for Distributed Denial of Service – where hundreds or thousands of compromised computers in a network are directed to send potentially millions of requests to a website.
This is usually more than enough to crash the site. And keep it down until they stop.
Our support had been battling this overnight and managed to get some degree of stability by blocking traffic from 14 different countries. It cut the attack in half and got the site up and running again, but since the attack was ongoing it made the editing unstable.
And, they had tried to notify us of the problem by emailing a staff member that hadn’t worked at UA in over 5 years. Sigh.
I was supposed to lead a training session/office hours meeting with about 30 people that afternoon, but knew there was no way I could do that and keep on top of the problem.
I hadn’t created the meeting so I couldn’t cancel it in the system, but sent out an email to the potential participants to let them know the situation – and promised to join the call anyway for a few minutes in case anyone missed the cancellation notice.
I updated the team again, then got on a call with support about the situation and options – which ran long so I had to have another colleague hop on the call in my place to explain what was going on.
With the support call finished, I joined the call and used that time to copy the information from the support tickets to a separate system used by our IT department.
When I ended that call, our IT department reached out and suggested the proposed solution from support sounded like a scam.
The only good news was that during the support call it appeared that the attack had halted, for the time being.
I sent another note to the editors explaining where we were and told them I would do more testing in the evening. I finished up the workday with a call to the team on our next steps.
After a short break to eat, I logged in back and did another round of testing and tried to catch up on updates.
On Friday, I did more testing with the team and then sent a note to the editors to let them back into the system.
As the morning went on, support started to pressure me to make the switch to this new system. From my own research it had some serious problems and I needed a consultation with our IT department to clarify. We ultimately decided to leave things as they were for over the weekend and regroup on Monday.
In the meantime, a potential law student reached out from another country and reported through the channels that they couldn’t access our website.
Support confirmed the error they were getting matched with their country being blocked. I explained what was going on and promised a quick resolution.
It was an exhausting couple of days, but I did get some good feedback from the editors about how I handled the situation and I did my best with what I had at every step.
Still, I feel responsible as the caretaker and webmaster of the site.
I had some homework over the weekend to get a better handle on the solution – a CDN (Content Delivery Network) – and we regrouped on Monday to keep working the problem.
Though, I suspect I won’t sleep well until we’re finally past this hurdle since we could easily get attacked again – and there’s no way of knowing why.
Update: the CDN is working, though not without some bumps. Our log-in stopped working for a bit and we had to work around that. There’s also an update delay we’re trying to get used to. And I’ve got some subsites to add to a list to resolve.
But, we’re generally safer and now more stable.
Though, I kinda wish it had just been a buggy patch we could roll back.