VCF 9.0 GA Mental Model Part 2: Fleet Services vs Instance Management Planes (and Who Owns What)

Paul Bryant

2 months ago

TL;DR

Standardize on the official hierarchy: VCF private cloud -> VCF fleet -> VCF instance -> VCF domain -> vSphere clusters. A VCF fleet is managed by one set of fleet-level management components (notably VCF Operations and VCF Automation), while each VCF instance keeps its own management domain and domain-level control planes.

Your fastest path to org alignment is separating two things people constantly mix up:

Fleet-level services: centralized operations, lifecycle for management components, automation, and SSO integration.
Instance management planes: SDDC Manager, management vCenter, management NSX, plus the vCenter and NSX that belong to each workload domain.

Scope and code levels referenced (VCF 9.0 GA core):

VCF Installer: 9.0.1.0 build 24962180 (required to deploy VCF 9.0.0.0 components)
SDDC Manager: 9.0.0.0 build 24703748
vCenter: 9.0.0.0 build 24755230
ESXi: 9.0.0.0 build 24755229
NSX: 9.0.0.0 build 24733065
VCF Operations: 9.0.0.0 build 24695812
VCF Operations fleet management: 9.0.0.0 build 24695816
VCF Automation: 9.0.0.0 build 24701403
VCF Identity Broker: 9.0.0.0 build 24695128

Architecture Diagram

Legend:

Fleet-level management components give you centralized governance, inventory, and services.
Instance management planes are not shared. Each instance still owns its own SDDC Manager, vCenter, and NSX boundaries.

Scenario
Assumptions
Core vocabulary recap
Core concept: separate fleet services from instance management planes
What runs where in VCF 9.0 GA
Who owns what
Day-0, day-1, day-2 map
Identity and SSO boundaries that actually matter
Topology patterns for single site, two sites, and multi-region
Failure domain analysis
Operational runbook snapshot
Anti-patterns
Summary and takeaways
Conclusion

Scenario

You need architects, operators, and leadership to agree on:

What VCF 9.0 actually manages.
What is centralized at fleet level vs isolated per instance or domain.
Who owns which parts of lifecycle, identity, and day-2 operations.

Assumptions

You are deploying greenfield VCF 9.0 GA (core components at 9.0.0.0, deployed via the documented installer level).
You deploy both VCF Operations and VCF Automation from day-1.
You want patterns for:
- Single site
- Two sites in one region
- Multi-region
You need guidance for both:
- Shared identity
- Separate identity and SSO boundaries for regulated isolation

Core vocabulary recap

Use these terms consistently in meetings, designs, and runbooks:

VCF private cloud: the highest-level management and consumption boundary; can contain one or more fleets.
VCF fleet: managed by one set of fleet-level management components (notably VCF Operations and VCF Automation); contains one or more instances.
VCF instance: a discrete VCF deployment containing a management domain and optionally workload domains.
VCF domain: a lifecycle and isolation boundary inside an instance (management domain and VI workload domains).
vSphere cluster: where ESXi capacity lives; clusters exist inside domains.

Core concept: separate fleet services from instance management planes

You get clean operations when you stop trying to force everything into a single “management plane” blob.

Instead, run this mental separation:

Fleet services

These are the things you deploy once per fleet to provide centralized capabilities:

VCF Operations: inventory, observability, and the console where centralized lifecycle and identity workflows surface.
VCF Operations fleet management appliance: lifecycle management operations for the fleet management components.
VCF Automation: self-service consumption, organization constructs, and automation.
VCF Identity Broker + VCF Single Sign-On: centralized authentication configuration across components (with important exclusions).

Practical implication: if fleet services are impaired, governance and workflows degrade, but the instance-level control planes do not magically disappear.

Instance management planes

Every instance retains its own control plane boundaries:

SDDC Manager
Management domain vCenter
Management domain NSX

This is where most “core infrastructure lifecycle” actually executes.

Domain-level control planes

Each workload domain is its own lifecycle and isolation boundary, typically with:

Its own vCenter
Its own NSX Manager (dedicated per domain, or shared depending on design)

What runs where in VCF 9.0 GA

A clean greenfield deployment is intentionally opinionated:

The management domain of the first instance hosts the fleet-level management components (VCF Operations and VCF Automation).
Additional instances still have their own instance-level management components (SDDC Manager, vCenter, NSX), and may deploy collectors as needed.

Two other details matter for design reviews:

VCF Operations fleet management is treated as a first-class appliance and should be protected with vSphere HA in the default management cluster.
VCF Single Sign-On can provide one-login access for many components, but not SDDC Manager and not ESXi.

Who owns what

This table is meant to stop “that’s not my job” loops during incidents and upgrades.

Component or capability	Platform team (VCF)	VI admin (domains and clusters)	App and platform teams
Fleet bring-up (VCF Installer, fleet creation)	Own	Consult	Inform
Fleet-level management components (VCF Operations, fleet management appliance, VCF Automation)	Own	Consult	Inform
VCF Identity Broker and VCF Single Sign-On configuration	Own	Consult	Inform
SDDC Manager (per instance)	Own (platform governance)	Own day-2 execution	Inform
Management domain vCenter and NSX	Shared	Own	Inform
Workload domain lifecycle (create domain, add clusters, remediate hosts)	Shared	Own	Inform
Workload consumption (Org structure, projects, templates, quotas, policies)	Shared (guardrails)	Consult	Own
Backup and restore for fleet management components	Own	Consult	Inform
Backup and restore for instance components (SDDC Manager, vCenter, NSX)	Shared (standards)	Own	Inform
Day-2 password lifecycle (rotation, remediation)	Own (policy + tooling)	Shared	Inform
Certificates and trust (CA integration, renewal cadence)	Own	Shared	Inform
DR plans for management components and identity	Own	Consult	Inform
DR plans for workload domains and applications	Shared (platform)	Shared (infra)	Own

Ownership rule of thumb:

Platform team owns the fleet services and guardrails.
VI admins own domain lifecycle execution and capacity.
App teams own how they consume resources and what SLAs they require.

Day-0, day-1, day-2 map

This matters because VCF 9.0 pushes more workflows into a centralized console, but it does not eliminate domain-level responsibilities.

Day-0

Design-time decisions that are expensive to change later:

How many fleets you need (governance and isolation boundary).
How many instances you need (location and operational boundary).
Identity design:
- VCF Identity Broker deployment mode (embedded vs appliance).
- SSO scope (single instance vs cross-instance vs fleet-wide).
- Shared vs separate IdPs and SSO boundaries.
Network and IP plan:
- Subnet sizing for growth matters because changing subnet masks for infrastructure networks is not supported.
- Decide whether fleet-level components share the management VM network or get a dedicated network or NSX-backed segment.
Management domain sizing:
- Management domains must be sized to host the management components plus future workload domain growth.
Lifecycle blast radius strategy:
- How you segment domains, instances, and fleets to control upgrade and incident scope.

Day-1

Bring-up and initial enablement:

Deploy the VCF Installer appliance, download binaries, and start a new VCF fleet deployment.
Bring up the first instance and its management domain.
Deploy the fleet-level management components (VCF Operations, fleet management appliance, VCF Automation).
Deploy VCF Identity Broker (often appliance mode for multi-instance SSO scenarios) and configure VCF Single Sign-On.
Create initial workload domains, and connect them into VCF Automation as needed.

Day-2

Ongoing operations:

Lifecycle management:
- Management component lifecycle through VCF Operations fleet management.
- Cluster lifecycle through vSphere lifecycle tooling, with VCF coordinating.
Identity operations:
- Adding components and instances into SSO scope.
- Re-assigning roles and permissions inside vCenter and NSX after SSO configuration changes.
Security hygiene:
- Password rotation and remediation flows.
- Certificate replacement with CA-signed certs across both management components and instance components.
Platform resilience:
- Backup scheduling to an SFTP target for management components and instance components.
- Shutdown and startup runbooks that preserve authentication and cluster integrity.

Identity and SSO boundaries that actually matter

What VCF Single Sign-On does (and does not)

VCF Single Sign-On is designed to streamline access across multiple VCF components with one authentication source configured from the VCF Operations console.

Key operational detail:

It supports SSO across components like vCenter, NSX, VCF Operations, VCF Automation, and other VCF management components.
It explicitly excludes SDDC Manager and ESXi, which means you still need local access patterns and break-glass workflows for those systems.

Identity pillars in VCF

Your identity design is built on three pillars:

External IdP (SAML/OIDC or directory)
VCF Identity Broker (brokers authentication and maintains SSO tokens)
VCF Single Sign-On (centralized authentication configuration and user management)

Important constraint:

Each VCF Identity Broker is configured with a single identity provider.

VCF Identity Broker deployment modes

Here’s the practical decision point.

Decision point	Embedded (vCenter service)	Appliance (3-node cluster)
Where it runs	Inside management domain vCenter	Stand-alone appliances deployed via VCF Operations fleet management
Multi-instance recommendation	One per instance	Up to five instances per Identity Broker appliance
Availability characteristics	Risk of being tied to mgmt vCenter availability	Designed for higher availability; handles node failure
Typical fit	Single instance, simpler environments	Multi-instance, larger environments, stronger availability targets

Change management warning: moving from appliance to embedded mode requires resetting the VCF Single Sign-On configuration and re-adding users and groups. Treat the deployment mode decision as day-0.

Challenge: You need shared identity for convenience, but regulated isolation for some tenants

Solutions:

A) Shared enterprise IdP with fleet-wide SSO

Best when you want one login experience across instances in the same fleet.
Biggest tradeoff is blast radius: the SSO scope is large.

B) Cross-instance SSO with multiple Identity Brokers in one fleet

Each Identity Broker serves a defined set of instances.
Reduces SSO blast radius compared to a single broker for the whole fleet.
Constraint: VCF management components (for example, VCF Operations and VCF Automation) can connect to only one Identity Broker instance for SSO, so you must design carefully if you are aiming for multiple identity boundaries under one fleet.

C) Separate fleets for regulated isolation

Strongest governance and identity boundary.
Higher operational overhead: multiple sets of fleet-level management components.

Topology patterns for single site, two sites, and multi-region

Use the design blueprints as your baseline mental model. Then tune.

Challenge: Your topology is not one-size-fits-all

Solutions:

A) Single site with minimal footprint

Best when you need to start small and accept tighter fault domains.
Typical posture:
- Single fleet, single instance.
- Management components and workloads can be co-located in one cluster for footprint reduction.
Operational reality:
- You are trading physical failure-domain isolation for speed and cost.
- Plan early if you intend to adopt organization models in VCF Automation that require additional clusters.

B) Two sites in one region

Region in VCF terms is multiple sites within synchronous replication latencies.
Typical posture:
- Single fleet, single instance.
- Stretched clusters across the two sites for higher availability.
- A dedicated workload domain for workloads, with management components protected in the management domain cluster.
Day-2 consequences:
- You are now dependent on stretched network and storage behaviors for management plane availability.
- You must design first-hop gateway resilience across availability zones for stretched segments.

C) Multi-region

Typical posture:
- Single fleet, multiple instances (at least one per region or per major site).
- Fleet-level management components run in the management domain of the first instance.
- Additional instances bring their own management domain control planes.
Practical design statement:
- Recovery between regions is a disaster recovery process. Do not confuse “multi-region” with “active-active without DR work”.

Quick comparison:

Topology	Fleet count	Instance count	Typical SSO scope	Primary operational risk
Single site	1	1	Single instance or fleet-wide	Small fault domain, tight coupling
Two sites, one region	1	1	Fleet-wide (common)	Stretched dependencies for management availability
Multi-region	1+	2+	Cross-instance or fleet-wide	Governance dependency on where fleet services run

Failure domain analysis

This is the conversation leadership actually needs.

Fleet services failure

If VCF Operations, fleet management, or VCF Automation are impaired:

You lose or degrade centralized lifecycle workflows, automation workflows, and centralized observability.
Instance control planes still exist, but day-2 operations may become more manual.

If VCF Identity Broker is down:

Users from external identity providers cannot authenticate.
You must fall back to local accounts for subsequent operations until Identity Broker is restored.

Instance management domain failure

If an instance management domain is down:

That instance’s domain lifecycle operations are impacted.
Workloads in workload domains may remain running, but you have reduced ability to manage and remediate.

Workload domain failure

If a workload domain’s vCenter or NSX is degraded:

Workloads in that domain take the blast radius.
Other workload domains in other instances are unaffected.

Example RTO/RPO targets you can start with

These are practical starting points to drive a discussion. Adjust to your business requirements.

Fleet services (VCF Operations, fleet management, VCF Automation):
- RTO: 4 hours
- RPO: 24 hours (aligned to daily backups)
Identity Broker:
- RTO: 1 to 2 hours
- RPO: 24 hours (align to backup cadence, plus local break-glass accounts)
Instance management domain:
- RTO: 2 to 4 hours
- RPO: 24 hours
Workload domain:
- Driven by application SLAs and data replication strategy

Operational runbook snapshot

Shutdown order matters

In multi-instance environments:

Shut down instances that do not run VCF Operations and VCF Automation first.
The instance running the fleet-level management components should be last.

Within the management domain that hosts fleet services, a typical shutdown sequence starts with:

VCF Automation
VCF Operations
VCF Identity Broker
Instance management components (NSX, vCenter, SDDC Manager)
ESXi hosts

Operational gotcha:

Taking the VCF Operations cluster offline can take significant time. Plan your maintenance windows accordingly.

Backups: get the SFTP target right early

You should treat SFTP backup targets as day-1 prerequisites, not an afterthought.

Configure SFTP settings for VCF management components.
Configure backup schedules for VCF Operations and VCF Automation.
Configure backup schedules for SDDC Manager, NSX Manager, and vCenter at the instance level.

Password lifecycle: know which system is authoritative

You can change passwords for many local users through VCF Operations.
Some password expiration and status information is updated on a schedule; real-time status often requires checking at the instance source (SDDC Manager and related APIs).
You can retrieve default passwords from SDDC Manager using the lookup_passwords command on the appliance.

Before running the lookup_passwords command on SDDC Manager, use this Bash example:

# On the SDDC Manager appliance
sudo lookup_passwords

Fast validation: confirm build levels in your environment

Before you start production onboarding, validate you are actually running the expected code level.

Use this PowerShell example with PowerCLI to validate vCenter and ESXi versions:

# Connect to vCenter
Connect-VIServer -Server <vcenter_fqdn>

# vCenter build and version
$about = (Get-View ServiceInstance).Content.About
[PSCustomObject]@{
  Product = $about.FullName
  Version = $about.Version
  Build   = $about.Build
}

# ESXi hosts build and version
Get-VMHost | Sort-Object Name | Select-Object Name, Version, Build

Anti-patterns

Avoid these early, or you will pay in incident response time later:

Treating fleet and instance as synonyms
- Fleet is centralized governance and services.
- Instance is a discrete VI footprint with its own management domain.
Designing SSO as if SDDC Manager participates
- It does not. Plan break-glass access and operational runbooks accordingly.
Choosing embedded Identity Broker for multi-instance and then being surprised by availability coupling
- If multi-instance SSO matters, appliance mode is commonly the safer default.
Using one fleet for regulated tenants without validating identity and governance blast radius
- Separate fleets remain the cleanest isolation boundary when governance separation is required.
Under-sizing management domains
- Fleet services and management components are not free. You will scale them and patch them like any other production system.

Summary and takeaways

Use the official construct hierarchy to keep conversations consistent: private cloud -> fleet -> instance -> domains -> clusters.
Fleet-level management components centralize governance, but they do not collapse instance control planes into a single shared management plane.
Identity design is a day-0 decision. Choose Identity Broker deployment mode and SSO scope intentionally.
Align topology to operations:
- Single site is about speed and footprint.
- Two-site in one region is about availability with stretched dependencies.
- Multi-region is about DR posture and multiple instance management planes.

Conclusion

VCF 9.0 becomes dramatically easier to operate when everyone can point to the same boundaries:

Fleet boundaries for centralized services and governance.
Instance boundaries for discrete infrastructure footprints.
Domain boundaries for lifecycle and workload isolation.

That shared mental model is what lets you scale without scaling confusion.

Sources

VMware Cloud Foundation 9.0 Documentation (VCF 9.0 and later): https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0.html

VCF 9.0 GA Mental Model Part 3: Day-0 to Day-2 Ownership Across Fleets, Instances, and Domains

TL;DR If you want clean accountability in VCF 9.0, anchor your operating model to the official hierarchy: VCF private cloud -> VCF fleet -> VCF instance -> VCF domain ->…