Nutanix Metro: Advanced Deployment, Configuration, and Best Practices for Production Environments

Table of Contents

  1. Introduction to Nutanix Metro
  2. Architecture Overview and Core Concepts
  3. Prerequisites and Environmental Planning
  4. Step-by-Step Configuration (GUI, CLI, API)
  5. Advanced Workflows and Automation
  6. Best Practices for Production Deployments
  7. Troubleshooting Common and Complex Issues
  8. Real-World Use Cases
  9. Frequently Asked Questions (FAQ)
  10. Conclusion

1. Introduction to Nutanix Metro

Nutanix Metro, also called Nutanix Metro Availability, is a business continuity and disaster recovery solution built into the Nutanix platform. It provides synchronous data replication between two geographically separated Nutanix clusters, ensuring zero data loss and rapid application failover in the event of site outages. Metro is crucial for mission-critical workloads that demand maximum uptime and compliance with stringent recovery point objectives (RPOs).


2. Architecture Overview and Core Concepts

At its core, Nutanix Metro extends the data protection and high availability features of Nutanix AOS by synchronously mirroring data between two separate sites.

Key architectural concepts:

  • Metro Clusters: Two Nutanix clusters, each at a different site, interconnected for synchronous replication.
  • Stretched Volume Groups: Application volumes mirrored in real-time between both sites.
  • Witness VM: An out-of-band component for split-brain avoidance and quorum.

Architecture Diagram:


3. Prerequisites and Environmental Planning

Hardware and Software Requirements

  • Nutanix clusters running AOS 6.x or later
  • Minimum of one Prism Central managing both clusters
  • Supported hypervisor (AHV or ESXi), with identical hypervisor type and version required on both clusters
  • Dedicated, low-latency, high-bandwidth network between sites
  • Witness VM deployed at a third location (preferably cloud or a separate site)

Licensing

  • Metro Availability is included in the Nutanix Ultimate Edition license.
  • Both clusters must be licensed appropriately with Ultimate Edition or equivalent to enable Metro features.
  • Always verify current licensing status and feature entitlements via Nutanix Support or your account representative.

Hypervisor Uniformity

  • Both clusters must run the same hypervisor type and version (either AHV or ESXi).
  • Mixed-hypervisor Metro configurations are not supported and will prevent proper Metro Availability operation.

Network & Latency

  • Recommended latency: Less than 5ms round-trip time between clusters.
  • Bandwidth: Sufficient to handle synchronous replication of all active workloads.

Security and Connectivity

  • Ensure secure, firewalled network paths between clusters and witness VM.
  • Consistent VLAN/subnet planning for stretched networks.

4. Step-by-Step Configuration (GUI, CLI, API)

4.1 Initial Setup via Prism (GUI)

  1. Log into Prism Central.
  2. Navigate to Protection Domains & Metro Availability.
  3. Select Create Metro Availability.
  4. Add both clusters to the Metro configuration.
  5. Select the volumes or VMs to protect.
  6. Configure stretched network and witness details.

4.2 Witness VM Deployment

  • Download and deploy the Witness OVA (for VMware) or QCOW2 (for AHV) at a third site.
  • Power on and configure IP/networking.
  • Register the Witness in Prism Central.

Witness Placement:

4.3 Advanced Configuration (CLI)

A. Check Metro Readiness:

ncli metro-cluster ls

B. Enable Metro on a Protection Domain:

ncli pd metro-availability-enable \
name="prod-db-protect" \
remote-cluster-name="Cluster-B"

C. Add Volumes to Metro Domain:

ncli pd add-entity \
name="prod-db-protect" \
entity-type=vm \
entity-names="AppServer01,DB01"

D. API Example: Create Metro Protection

curl -u admin:password -X POST \
-H "Content-Type: application/json" \
-d '{
"remote_cluster": "Cluster-B",
"entities": ["AppServer01", "DB01"]
}' \
https://prism-central-ip:9440/api/nutanix/v3/metro_availability

5. Advanced Workflows and Automation

Automated Failover (CLI Example)

ncli metro-cluster failover \
name="prod-db-protect" \
force=true

Automated Monitoring (Script Example)

#!/bin/bash
# Nutanix Metro Health Check
CLUSTERS=("Cluster-A" "Cluster-B")
for cluster in "${CLUSTERS[@]}"
do
ncli --cluster=${cluster} metro-cluster get-status
done

Scheduled Metro Health Checks

  • Use Nutanix Prism Central Scheduled Reports to send daily Metro health status to administrators.
  • API endpoint: /api/nutanix/v3/metro_availability/status

6. Best Practices for Production Deployments

  • Network Health: Regularly monitor latency and bandwidth between sites.
  • Witness Isolation: Place Witness VM in a neutral third site or cloud, not within either primary cluster’s data center.
  • Test Failover: Conduct quarterly planned failover and failback drills to validate business continuity.
  • Protection Domain Design: Group related workloads (app and database) in a single domain for consistent failover.
  • Alerting: Enable proactive alerting for Metro status changes or witness failures.
  • Version Alignment: Keep both clusters at the same AOS and hypervisor patch level.
  • Hypervisor Consistency: Both Metro clusters must be kept at identical hypervisor versions and patch levels. Plan for simultaneous upgrades to avoid configuration drift.
  • Licensing Compliance: Ensure both clusters are always covered by Nutanix Ultimate Edition licensing for uninterrupted Metro protection.
  • Runbooks: Maintain clear runbooks for manual failover, failback, and troubleshooting.

7. Troubleshooting Common and Complex Issues

Witness VM Unavailability and Failover Automation

  • Critical Note:
    If the Witness VM is unavailable, automated failover is disabled.
    Manual intervention is required to ensure data integrity and prevent split-brain scenarios.
  • Operational Planning:
    Always monitor the status of the Witness VM and ensure high-availability for its underlying infrastructure.

Witness Connectivity Problems

  • Symptom: Metro state shows “Degraded” or “Disconnected”
  • Check:
    • Witness VM network interface up?
    • Firewall ports open between witness and both clusters?
    • Witness service running?
  • CLI: ncli metro-cluster get-status

Split-Brain Condition

  • Cause: Loss of communication to witness and one cluster
  • Action:
    • Identify which cluster is active
    • Restore connectivity or perform controlled failover as per runbook

Resync Failures

  • Symptom: Protection domain fails to resync after network outage
  • Check:
    • Sufficient bandwidth?
    • Disk space on both clusters?
    • Review logs via Prism or CLI

Performance Impact

  • Monitor:
    • Storage latency metrics in Prism Central
    • Impacted VMs with high IOPS

8. Real-World Use Cases

Financial Services

  • Zero RPO database failover for core banking systems between two metropolitan data centers

Healthcare

  • Synchronous EMR application protection across two hospitals for HIPAA compliance

Retail

  • 24/7 e-commerce workload protection, instant recovery from datacenter outage

Public Sector

  • Metro clusters for critical infrastructure with automated disaster drills

9. Frequently Asked Questions (FAQ)

Q: Is Nutanix Metro included in my existing license?
A: Metro Availability requires Nutanix Ultimate Edition licensing. Both participating clusters must have the correct license level to enable Metro.

Q: Can I mix hypervisors between Metro clusters?
A: No. Metro clusters require the same supported hypervisor type and version on both sites.

Q: What happens if the Witness VM is unavailable?
A: Automated failover is disabled, and manual intervention is necessary. Operational continuity planning must account for this scenario.

Q: How often should I test failover?
A: At least quarterly, or after any major infrastructure changes.

Q: Is Metro suitable for asynchronous replication?
A: Metro is for synchronous use cases. For async, use Nutanix NearSync or traditional DR.


10. Conclusion

Nutanix Metro is a powerful tool for ensuring data resilience and business continuity across mission-critical environments. By following advanced configuration steps, enforcing best practices, and regularly testing your setup, you can achieve near-zero downtime and seamless recovery. Stay proactive with monitoring, licensing, and up-to-date runbooks to maximize your Metro deployment’s effectiveness.

Disclaimer: The views expressed in this article are those of the author and do not represent the opinions of Nutanix, my employer or any affiliated organization. Always refer to the official Nutanix documentation before production deployment.

Leave a Reply

Discover more from Digital Thought Disruption

Subscribe now to keep reading and get access to the full archive.

Continue reading