[dba-Tech] backup, backup, backup

John Bartow jbartow at winhaven.net
Thu Sep 27 16:12:47 CDT 2018


wow

-----Original Message-----
From: dba-Tech <dba-tech-bounces at databaseadvisors.com> On Behalf Of Peter Brawley
Sent: Saturday, September 22, 2018 11:35 AM
To: Discussion of Hardware and Software issues <dba-tech at databaseadvisors.com>
Subject: [dba-Tech] backup, backup, backup

If you rely on hosting provider backups, think again. I got the email quoted below from a hosting provider with a good reputation that I use as a backup provider ...

PB

/While I was hoping to save some of this for the official RFO [Reason For Outage] - enough people are getting tremendously upset over this that I'm going to spell out what I can now - keeping in mind that I will provide more details when I can.// // //**What happened?**// // //First and foremost - this failure is not something that we planned on or expected.  A server administrator, the most experienced administrator we have, made a big mistake.  During some routine maintenance where they were supposed to perform a _file system trim_ they mistakenly performed a _block discard_.// // //**What does this mean?**// // //The server administrator essentially told our storage platform to drop all data rather than simply dropping data that had been marked as _deleted_ by our servers.// // //**Why is restoration taking so long?**// // //Initially we believed that only the primary operating system partition of the servers was damaged - so we worked to bring new machines online to connect to our storage to bring accounts back online.  Had our initial belief been correct - we'd have been back online in a few hours at most.// // //As it turns out our local data was corrupted beyond repair - to the point that we could not even mount the file systems to attempt data recovery.// // //Normally we would rely on snapshots in our storage platform - simply mounting a snapshot from prior to the incident and booting servers back up.  It would have taken minutes - if maybe an hour. We are not sure as of yet, and will need to investigate, but snapshots were disabled.  I wish I could tell you why - and I wish I knew why - but we don't know yet and will have to look into it.// // //We are working to restore cPanel backups from our off-site backup server in Phoenix Arizona.  While you would think the distance and connectivity was the issue - the real issue is the amount of I/O that backup server has available to it.  While it is a robust server with 24 drives - it can only read so much data so fast.  As these are high capacity spinn
ing drives - they have limits on speed.// // //Our disaster recovery server is our **last resort** to restore client data and, as it stands, is the _only_ copy we have remaining of all client data - except that which has already been restored which is back to being stored in triplicate.// // //**What will you do to prevent this in the future?**// // //We have, as we've been working on this and running into issues getting things back online quickly, discussing what changes we need to make to ensure that this both doesn't happen again as well as that we can restore quicker in the future should the need arise.  I will go into more detail about this once we are back online.// // //**We are sorry - we don't want you to be offline any more than you do.**// // //Personally I'm not going to be getting any sleep until every customer affected by this is back online.  I wish I could snap my fingers and have everybody back online or that I could go into the past and make a couple of _minor_ changes that would have prevented this.  I do wish, now that this has happened, that there was a quick and easy solution.// // //I understand you're upset / mad / angry / frustrated. Believe me - I am sitting here listening to each and every one of you about how upset you are - I know you're upset and I am sorry. We're human - and we make mistakes.  In this case **thankfully** we do have a last resort disaster recovery that we can pull data from.  There are _many_ providers that, having faced this many failures - a perfect storm so to speak - would have simply lost your data entirely.// // //This is the **first** major outage we've had in over a decade and while this is definitely major - our servers are online and we are actively working as quickly as possible to get all accounts restored and back online.  For clarity - the bottleneck here is not a staffing issue.  We evaluated numerous options to speed up the process and unfortunately short of copying the data off to faster disks - which we did try - there's nothing we can do to 
speed this up.  The process of copying the data off to faster disks was going to take just as long, if not longer, than the restoration process is taking on it's own.// // //Once everybody is back online - and there are accounts coming online every minute - we will be performing a complete post-mortem on this and will be writing a clear and transparent Reason For Outage [RFO] which we will be making available to all clients.// // //I hope that you understand that while this restoration process is ongoing there really isn't much to report beyond, "Accounts are still being restored as quickly as possible."  I wish there was some interesting update I could provide you like, "Suddenly things have sped up 100x!" but that's not the case.// // //I am personally doing my best to reach out to clients that have opened tickets are updated as to when their accounts are in the active restoration queue.  While we do have thousands of accounts to restore - our disaster recovery system actually transfers data substantially faster with fewer simultaneous transfers.  While it sounds counter-intuitive - we're actively watching the restoration processes and balancing the number of accounts being restored at once against the performance of the disaster recovery system to get as many people back online as quickly as possible.// // //Most sites are coming back online after restoration without issues, however, if once your account is restored you are still having issues - we are here to help.  While we are quite overwhelmed by tickets like, "WHY IS THIS NOT UP YET!?!?!"  "WHY ARE YOU DOWN SO LONG!?!??!!"  "FIX THIS NOWWWW!" - we are still trying to wade through all of that to help those that have come back online and are having issues - as few and far between as it has been.// // //If you have any questions - we will definitely answer them - but please understand that while we're restoring accounts we're really trying to focus on the restoration of services as well as resolving issues for those that are already resolved.// // //A
gain - I am sorry for the trouble this is causing you - we definitely don't want you offline any more than you do and will have all services restored as quickly as we can.// // //Sincerely,// /

_______________________________________________
dba-Tech mailing list
dba-Tech at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-tech
Website: http://www.databaseadvisors.com



More information about the dba-Tech mailing list