[dba-Tech] backup, backup, backup

Peter Brawley peter.brawley at earthlink.net
Sat Sep 22 11:34:57 CDT 2018


If you rely on hosting provider backups, think again. I got the email 
quoted below from a hosting provider with a good reputation that I use 
as a backup provider ...

PB

/While I was hoping to save some of this for the official RFO [Reason 
For Outage] - enough people are getting tremendously upset over this 
that I'm going to spell out what I can now - keeping in mind that I will 
provide more details when I can.//
//
//**What happened?**//
//
//First and foremost - this failure is not something that we planned on 
or expected.  A server administrator, the most experienced administrator 
we have, made a big mistake.  During some routine maintenance where they 
were supposed to perform a _file system trim_ they mistakenly performed 
a _block discard_.//
//
//**What does this mean?**//
//
//The server administrator essentially told our storage platform to drop 
all data rather than simply dropping data that had been marked as 
_deleted_ by our servers.//
//
//**Why is restoration taking so long?**//
//
//Initially we believed that only the primary operating system partition 
of the servers was damaged - so we worked to bring new machines online 
to connect to our storage to bring accounts back online.  Had our 
initial belief been correct - we'd have been back online in a few hours 
at most.//
//
//As it turns out our local data was corrupted beyond repair - to the 
point that we could not even mount the file systems to attempt data 
recovery.//
//
//Normally we would rely on snapshots in our storage platform - simply 
mounting a snapshot from prior to the incident and booting servers back 
up.  It would have taken minutes - if maybe an hour. We are not sure as 
of yet, and will need to investigate, but snapshots were disabled.  I 
wish I could tell you why - and I wish I knew why - but we don't know 
yet and will have to look into it.//
//
//We are working to restore cPanel backups from our off-site backup 
server in Phoenix Arizona.  While you would think the distance and 
connectivity was the issue - the real issue is the amount of I/O that 
backup server has available to it.  While it is a robust server with 24 
drives - it can only read so much data so fast.  As these are high 
capacity spinning drives - they have limits on speed.//
//
//Our disaster recovery server is our **last resort** to restore client 
data and, as it stands, is the _only_ copy we have remaining of all 
client data - except that which has already been restored which is back 
to being stored in triplicate.//
//
//**What will you do to prevent this in the future?**//
//
//We have, as we've been working on this and running into issues getting 
things back online quickly, discussing what changes we need to make to 
ensure that this both doesn't happen again as well as that we can 
restore quicker in the future should the need arise.  I will go into 
more detail about this once we are back online.//
//
//**We are sorry - we don't want you to be offline any more than you do.**//
//
//Personally I'm not going to be getting any sleep until every customer 
affected by this is back online.  I wish I could snap my fingers and 
have everybody back online or that I could go into the past and make a 
couple of _minor_ changes that would have prevented this.  I do wish, 
now that this has happened, that there was a quick and easy solution.//
//
//I understand you're upset / mad / angry / frustrated. Believe me - I 
am sitting here listening to each and every one of you about how upset 
you are - I know you're upset and I am sorry. We're human - and we make 
mistakes.  In this case **thankfully** we do have a last resort disaster 
recovery that we can pull data from.  There are _many_ providers that, 
having faced this many failures - a perfect storm so to speak - would 
have simply lost your data entirely.//
//
//This is the **first** major outage we've had in over a decade and 
while this is definitely major - our servers are online and we are 
actively working as quickly as possible to get all accounts restored and 
back online.  For clarity - the bottleneck here is not a staffing 
issue.  We evaluated numerous options to speed up the process and 
unfortunately short of copying the data off to faster disks - which we 
did try - there's nothing we can do to speed this up.  The process of 
copying the data off to faster disks was going to take just as long, if 
not longer, than the restoration process is taking on it's own.//
//
//Once everybody is back online - and there are accounts coming online 
every minute - we will be performing a complete post-mortem on this and 
will be writing a clear and transparent Reason For Outage [RFO] which we 
will be making available to all clients.//
//
//I hope that you understand that while this restoration process is 
ongoing there really isn't much to report beyond, "Accounts are still 
being restored as quickly as possible."  I wish there was some 
interesting update I could provide you like, "Suddenly things have sped 
up 100x!" but that's not the case.//
//
//I am personally doing my best to reach out to clients that have opened 
tickets are updated as to when their accounts are in the active 
restoration queue.  While we do have thousands of accounts to restore - 
our disaster recovery system actually transfers data substantially 
faster with fewer simultaneous transfers.  While it sounds 
counter-intuitive - we're actively watching the restoration processes 
and balancing the number of accounts being restored at once against the 
performance of the disaster recovery system to get as many people back 
online as quickly as possible.//
//
//Most sites are coming back online after restoration without issues, 
however, if once your account is restored you are still having issues - 
we are here to help.  While we are quite overwhelmed by tickets like, 
"WHY IS THIS NOT UP YET!?!?!"  "WHY ARE YOU DOWN SO LONG!?!??!!"  "FIX 
THIS NOWWWW!" - we are still trying to wade through all of that to help 
those that have come back online and are having issues - as few and far 
between as it has been.//
//
//If you have any questions - we will definitely answer them - but 
please understand that while we're restoring accounts we're really 
trying to focus on the restoration of services as well as resolving 
issues for those that are already resolved.//
//
//Again - I am sorry for the trouble this is causing you - we definitely 
don't want you offline any more than you do and will have all services 
restored as quickly as we can.//
//
//Sincerely,//
/



More information about the dba-Tech mailing list