This included for our purposes the definition that a hot failover for
us means that data was constantly being replicated.
So on a five minute delay, our data was constantly being replicated to our hot
failover site, even though transactions were not, I should say,
production live incoming requests were not going to the hot failover site.
The resulting data was being replicated across multiple sites.
So our hot failover site was sitting there, waiting for transactions.
That means that we could use our, I mean effectively a redirector, but you could
call it your load balancer, you could call it something else, merely to change
its configuration, to merely redirect the input to our hot failover site.
So, in a matter of minutes, we can make the call, decide and hit a switch,
potentially in a configuration file or a single database field, that automatically
redirects input from our external services into our new hot failover site,
so that it will take us, we imagine less than 30 minutes to make the switch.
We have a 15 minute call window for on-call production support.
For example, when it was my turn to be on call,
I had to be within 15 minutes of having my machine up and on and me logged in, so
no matter where I went, I couldn't be more than 15 minutes minus how long it takes
me to get my computer ready, away from my computer when I'm on call.
It takes another 15 minutes to decide and flip the switch.
We decided that that 30 minutes was good enough,
that the data that we would have to replicate would be small enough.
Only one or two 15 minute windows, which is how granular our time windows went,
would need to be additionally replicated if that data was lost in between
production going down and hot failover getting the redirect.
This is something that we use quite a bit.
And I'll talk about BCP, that's our business continuity plan.
None of these cutover strategies actually work unless you've tested it.
The worst thing you can do is have to go through one of these cutovers and
it not work.
So you have to test it.
In fact, for our purposes, we did a hot failover test once every quarter.
Every three months we actually hit a switch and
send everything over to hot failover, and it ran for a full day.
We usually also coordinated this with additional updates, and
that's how our production server got updated.
We would hot failover, make sure that worked, that was our BCP test.
Then we would update production,
you would hot failover back to production, make sure the updates work properly so
that we could roll back just by firing back over to hot failover.
If we decided that production was good, we would make the same
set of updates to hot failover, and again they would just be sitting there running.
Production taking data and hot failover just waiting for information.