[dba-Tech] backup, backup, backup
Peter Brawley
peter.brawley at earthlink.net
Sat Sep 22 11:34:57 CDT 2018
If you rely on hosting provider backups, think again. I got the email
quoted below from a hosting provider with a good reputation that I use
as a backup provider ...
PB
/While I was hoping to save some of this for the official RFO [Reason
For Outage] - enough people are getting tremendously upset over this
that I'm going to spell out what I can now - keeping in mind that I will
provide more details when I can.//
//
//**What happened?**//
//
//First and foremost - this failure is not something that we planned on
or expected. A server administrator, the most experienced administrator
we have, made a big mistake. During some routine maintenance where they
were supposed to perform a _file system trim_ they mistakenly performed
a _block discard_.//
//
//**What does this mean?**//
//
//The server administrator essentially told our storage platform to drop
all data rather than simply dropping data that had been marked as
_deleted_ by our servers.//
//
//**Why is restoration taking so long?**//
//
//Initially we believed that only the primary operating system partition
of the servers was damaged - so we worked to bring new machines online
to connect to our storage to bring accounts back online. Had our
initial belief been correct - we'd have been back online in a few hours
at most.//
//
//As it turns out our local data was corrupted beyond repair - to the
point that we could not even mount the file systems to attempt data
recovery.//
//
//Normally we would rely on snapshots in our storage platform - simply
mounting a snapshot from prior to the incident and booting servers back
up. It would have taken minutes - if maybe an hour. We are not sure as
of yet, and will need to investigate, but snapshots were disabled. I
wish I could tell you why - and I wish I knew why - but we don't know
yet and will have to look into it.//
//
//We are working to restore cPanel backups from our off-site backup
server in Phoenix Arizona. While you would think the distance and
connectivity was the issue - the real issue is the amount of I/O that
backup server has available to it. While it is a robust server with 24
drives - it can only read so much data so fast. As these are high
capacity spinning drives - they have limits on speed.//
//
//Our disaster recovery server is our **last resort** to restore client
data and, as it stands, is the _only_ copy we have remaining of all
client data - except that which has already been restored which is back
to being stored in triplicate.//
//
//**What will you do to prevent this in the future?**//
//
//We have, as we've been working on this and running into issues getting
things back online quickly, discussing what changes we need to make to
ensure that this both doesn't happen again as well as that we can
restore quicker in the future should the need arise. I will go into
more detail about this once we are back online.//
//
//**We are sorry - we don't want you to be offline any more than you do.**//
//
//Personally I'm not going to be getting any sleep until every customer
affected by this is back online. I wish I could snap my fingers and
have everybody back online or that I could go into the past and make a
couple of _minor_ changes that would have prevented this. I do wish,
now that this has happened, that there was a quick and easy solution.//
//
//I understand you're upset / mad / angry / frustrated. Believe me - I
am sitting here listening to each and every one of you about how upset
you are - I know you're upset and I am sorry. We're human - and we make
mistakes. In this case **thankfully** we do have a last resort disaster
recovery that we can pull data from. There are _many_ providers that,
having faced this many failures - a perfect storm so to speak - would
have simply lost your data entirely.//
//
//This is the **first** major outage we've had in over a decade and
while this is definitely major - our servers are online and we are
actively working as quickly as possible to get all accounts restored and
back online. For clarity - the bottleneck here is not a staffing
issue. We evaluated numerous options to speed up the process and
unfortunately short of copying the data off to faster disks - which we
did try - there's nothing we can do to speed this up. The process of
copying the data off to faster disks was going to take just as long, if
not longer, than the restoration process is taking on it's own.//
//
//Once everybody is back online - and there are accounts coming online
every minute - we will be performing a complete post-mortem on this and
will be writing a clear and transparent Reason For Outage [RFO] which we
will be making available to all clients.//
//
//I hope that you understand that while this restoration process is
ongoing there really isn't much to report beyond, "Accounts are still
being restored as quickly as possible." I wish there was some
interesting update I could provide you like, "Suddenly things have sped
up 100x!" but that's not the case.//
//
//I am personally doing my best to reach out to clients that have opened
tickets are updated as to when their accounts are in the active
restoration queue. While we do have thousands of accounts to restore -
our disaster recovery system actually transfers data substantially
faster with fewer simultaneous transfers. While it sounds
counter-intuitive - we're actively watching the restoration processes
and balancing the number of accounts being restored at once against the
performance of the disaster recovery system to get as many people back
online as quickly as possible.//
//
//Most sites are coming back online after restoration without issues,
however, if once your account is restored you are still having issues -
we are here to help. While we are quite overwhelmed by tickets like,
"WHY IS THIS NOT UP YET!?!?!" "WHY ARE YOU DOWN SO LONG!?!??!!" "FIX
THIS NOWWWW!" - we are still trying to wade through all of that to help
those that have come back online and are having issues - as few and far
between as it has been.//
//
//If you have any questions - we will definitely answer them - but
please understand that while we're restoring accounts we're really
trying to focus on the restoration of services as well as resolving
issues for those that are already resolved.//
//
//Again - I am sorry for the trouble this is causing you - we definitely
don't want you offline any more than you do and will have all services
restored as quickly as we can.//
//
//Sincerely,//
/
More information about the dba-Tech
mailing list