Good afternoon. My name is John Whitman.
I'm a staff System Engineer for the Networking and Security Business Unit here at VMware.
Today, I'm gonna be talking about disaster recovery using NSX and SRM on with automation,
data center recovery and what it's like today in
2018 for your software-defined data center.
Now, a quick agenda, we're going to talk about disaster recovery and
explain its challenges today that you might be facing in your current environment.
We'll talk about NSX multi-site networking in detail,
understanding Site Recovery Manager which is known as VMware SRM.
We'll talk about NSX and SRM and how to make DR successful.
Then, we'll talk about planned migrations,
physical infrastructure maintenance without an outage,
testing DR and NSX in a real-world environment and then a summary with some final notes.
So, let's talk about disaster recovery.
A lot of people put disaster recovery into an individualized bucket.
It means that, I've got my backups.
I've got my replication and I'm able to recover between two sites.
Some other customers also put disaster recovery into the bucket of "Well,
I've got five 9s.
I'm able to bring my environment up.
I'm able to actually work with my data."
But at the same time, what does it really mean to recover from an interruption.
Now, some of the things we'll talk about is RTO and RPO,
but there's other functions that come along with
a successful disaster recovery plan and runbook as well.
So, when you have a disaster,
it's not just an infrastructure failure,
it's recovering applications and getting that momentum and getting
that infrastructure available to the end-users and customers out in your environment.
Now, one thing to talk about is availability versus reliability.
Just because your infrastructure is available it doesn't mean that it's reliable.
You might be able to bring up the environment and have it functional but is a reliable.
Once you're on the DR site and you're running in your secondary data center,
are you able to actually function and work in that environment and be
productive without lag or without infrastructure outages in your secondary site?
Now, today, applications fail because of human errors.
You never find a data center that just crashes
into a hole in the ground or the entire grid fails,
it's generally because of human errors,
code that gets pushed out that's incorrect,
infrastructure outages and server issues inside of the data center.
So, bringing it this human error into
the disaster recovery environment is something that we want to do
alleviate when you're recovering on the DR side.
Now, most of you know what RTO and RPO means.
RTO is the recovery time objective,
that's how long does it take for me to actually recover
the environment and bring it back online when I have an outage.
RPO is the recovery point objective.
Where did my data last stop being recorded?
Is it five minutes? Is it 15 minutes?
Or is it near real time?
So, when I do come up on the secondary site,
how much of that day or how much of that data hub I lost?
But a lot of people don't understand what ROC and WOC is.
So, ROC is the rate of change.
Now, just because you have real-time asynchronous replication
between two sites or you have reliable backups,
doesn't necessarily mean that you have the bandwidth or
physical infrastructure capability to actually be able to record all of those changes.
If you have a large environment and you're having terabytes of
data every day that's changing and you're trying to push that over, let's say,
a one gig or even a 10 gig pipe is shared with other services,
you might not be able to recover at
the mandated business RPO based off of the rate of change, ROC.
Now, WOC is right order consistency and this is something that normally isn't taken into
consideration with disaster recovery and
business continuity solutions because right consistency means,
okay my data's there.
It's available. Is it corrupt?
Is it real and can I actually go and access it
and work in that database that just had a live failure?
So, right order consistency is very
important and that's usually at the storage and array-based level but
it really plays as a primary factor into your disaster recovery plan and scenario.
Now, let's talk about traditional challenges.
In a traditional DR environment,
you have to basically create an entire mirror of your environments.
You have your physical infrastructure.
You have your edge routers.
You have your application that might sprawl across multiple different racks.
And then, you have the infrastructure components,
the firewalls, the gateways, your security policies.
Everything that makes that site run,
available and gives high availability and reliability to that primary site.
Now, you have to duplicate it.
The problem is, you don't always necessarily have the ability to be able to build out and
design and have an infrastructure just sitting somewhere in
the wind that is available standing by.
So, taking and being able to recover that application on
dissimilar hardware or dissimilar networking infrastructure
can be a key component that allows
you to not only diversify your disaster recovery site
but also allow you to have different type of
design implementations for that secondary environment.
Now, one of the key challenges today is having to Re-IP that environment.
Having to actually relocate physical infrastructure, your L2,
your L3 stack between the two sides,
being able to recreate security policies,
being able to recreate all of your routing in
your infrastructure environment and on top of that you might have load balancers,
DNS and application IP dependencies that you necessarily
don't know or can't see and when you're going through you're testing of your DR runbook.
Now, moving an application between two sites isn't always necessarily the difficult part.
Sometimes, it's bringing up the infrastructure,
bringing up your MPLS backbone.
Having OTV which is
a very complex technology to bring that L2 stretch over without having to Re-IP.
There's other services out there and other pieces of hardware that
allow you to do this with a complex and now you're hardware dependent,
you're locked into that environment.
It's expensive. It's complex.
It's generally proprietary.
It doesn't give you any type of
flexibility and it gives you lack of automation in that environment.
It's not a holistic solution and is really focused on networking at a per-device basis,
and back 10 years ago when you had a physical environment this was fine
because you had that physical infrastructure you had to rely on.
Today, with a mostly virtualized environment,
having to be coupled and rely on that physical infrastructure just doesn't
give you the flexibility that's required today in a software-defined data center.
Now, traditional networking solutions include OTV over Dark Fiber and MPLS or a
VPLS over some sort of carrier backbone
and it is a hardware-based solution that is complex and challenging to maintain.
Its non holistic and really only focuses on a specific part of your infrastructure.
Now, networking with NSX decouples you from that.
So, what's needed for a software-defined approach?
You need to be able to decouple from the physical hardware.
You want ease of use and ease of deployment.
You want to have flexibility.
You want to be able to have hardware and infrastructure diversity and
not be locked down to a key specific vendor,
and you want to be able to have a high degree of automation that you can rapidly
deploy and recover when needed out of environment.
Also having an extensive partner ecosystem gives you the diversity to not just have
one infrastructure component fail and be recovered
but have it all fail and be recovered at the same time.