VM Network Troubleshooting from Guest OS to Uplink: A Layer by Layer VMware Runbook

Paul Bryant

3 weeks ago

Virtual machine network problems rarely arrive with a clean label.

The ticket usually says something like “the VM is unreachable,” “the application cannot connect,” “ping fails,” “internet access is down,” or “VMs on different hosts cannot talk.” The underlying cause might be inside the guest OS, on the VM’s virtual NIC, in the port group, on the VLAN trunk, in the distributed switch, on a bad uplink, at the physical switch, in routing, or at a firewall boundary.

That is why a useful VMware troubleshooting process needs to be layered.

Broadcom’s VMware KB324542 (KB 324542) frames VM network troubleshooting as a sequence of checks that should not be skipped, covering port group names, VM adapter connection state, guest OS networking, TCP/IP stack behavior, P2V hidden adapters, uplink isolation, VLAN configuration, jumbo frames, and packet capture. This article turns that KB into an operational ladder that an engineer can use during a real incident.

The goal is not to prove that the network, virtualization layer, firewall, or guest OS is “the problem.” The goal is to narrow the failure domain without creating a second outage.

Scenario

A virtual machine running on VMware vSphere has lost network connectivity.

The symptom may be isolated to one VM, several VMs on the same port group, VMs on one ESXi host, VMs after vMotion, VMs on different VLANs, or traffic to a specific destination. Broadcom’s KB lists common symptoms such as unreachable VMs, failed VM-to-VM communication across hosts, high latency, failed inbound or outbound traffic, unavailable internet access, and TCP/IP connection failures.

The runbook starts inside the guest OS and works outward to the physical and policy boundaries.

Why This Matters Operationally

The fastest way to waste time on a VM network issue is to start in the middle.

Changing a VLAN before checking the guest IP configuration can hide a simple OS issue. Rebuilding a port group before checking an uplink can create a broader outage. Blaming routing before testing the default gateway can pull the wrong team into the incident.

That matters in vCF and vSphere operations because VM networking crosses ownership boundaries. The same packet can touch the guest OS, vNIC, port group, vDS, host uplink, top-of-rack switch, default gateway, firewall, and routing domain before the application ever sees a response.

Symptoms and Risk

Use this runbook when you see symptoms like:

The operational risk is not just downtime. It is accidental blast radius.

Do not change VLANs, uplink teaming, LACP, distributed switch policies, firewall rules, or physical switch trunks until you have captured the current state and identified the smallest safe test.

Troubleshooting Ladder at a Glance

The diagram below is the troubleshooting path. The important thing to notice is that the checks move from the VM outward. Each layer should either prove connectivity, identify the break, or provide the evidence needed to hand off to the next owner.

This should be treated as a ladder, not a checklist of random ideas. If the VM cannot reach its default gateway, focus on Layer 2, VLAN, port group, uplink, and physical switch evidence first. If the VM can reach the gateway but cannot reach another subnet, Broadcom’s default-gateway troubleshooting guidance points toward Layer 3 routing rather than the local virtual switch path.

Prerequisites and Safety Checks

Before changing anything, collect the basics.

You need:

VM name
Guest OS type
VM IP address, subnet mask, default gateway, DNS servers
Destination IP, port, and protocol being tested
ESXi host currently running the VM
Cluster and vDS or standard vSwitch name
Port group name and VLAN ID
Physical uplinks used by the host
Whether NSX/vDefend Distributed Firewall applies
Whether this is a single VM, port group, host, cluster, or site-wide symptom

There is one important exception: if the unreachable VM is vCenter Server, be careful. Broadcom’s KB specifically calls out vCenter reachability as a scenario where opening a networking support case may be the best path, especially when vCenter networking is delivered through a vSphere Distributed Switch.

That warning exists for a reason. A vDS-backed vCenter outage can turn normal remediation into a control-plane recovery problem.

Runbook Stages

Stage 1: Define the Failure Domain

Start by proving the scope.

Ask four questions:

Is this one VM or multiple VMs?
Is it one port group or multiple port groups?
Is it one ESXi host or every host in the cluster?
Is the failure limited to one destination, one subnet, or all traffic?

This first step decides where the runbook branches.

A single VM problem usually starts with the guest OS, VM vNIC, or VM-specific policy. A port group-wide issue points toward VLAN, port group policy, or upstream trunking. A host-specific issue points toward that ESXi host’s uplinks, physical switch ports, or LACP/team configuration. A cross-subnet-only issue points toward routing or firewall policy.

Document the failure in plain terms:

Source VM:      APP01
Source IP:      10.20.30.41
Source Host:    esxi07
Port Group:     PG-App-Prod
VLAN:           230
Destination:    10.20.30.1 default gateway
Result:         Ping fails from APP01, succeeds from APP02 on same port group
Scope:          Single VM

That simple record prevents the incident from drifting.

Stage 2: Check the Guest OS First

A VM can be perfectly connected to the right port group and still fail because the guest OS is misconfigured.

From inside the guest, verify:

IP address
Subnet mask or prefix length
Default gateway
DNS settings
Static routes
Duplicate IP warnings
OS firewall profile
NIC driver state
Whether the OS thinks the cable is disconnected

For Windows:

ipconfig /all
route print
ping 127.0.0.1
ping <vm-ip>
ping <default-gateway-ip>
tracert <destination-ip>
Test-NetConnection <destination-ip> -Port <tcp-port>

For Linux:

ip addr
ip route
ping -c 4 127.0.0.1
ping -c 4 <vm-ip>
ping -c 4 <default-gateway-ip>
traceroute <destination-ip>
nc -vz <destination-ip> <tcp-port>

Interpret the results carefully.

If loopback fails, the problem is inside the OS TCP/IP stack. If the VM cannot ping its own IP, the guest stack or interface configuration is suspect. If the VM can ping itself but not the gateway, move outward to the vNIC, port group, VLAN, and uplink path. If the VM can ping the gateway but not a remote subnet, shift toward routing or firewall boundaries.

Broadcom’s KB explicitly includes guest OS networking and TCP/IP stack validation as part of the VM network troubleshooting sequence.

Stage 3: Verify the VM vNIC and Port Group Assignment

Next, confirm the virtual NIC exists, is connected, and is attached to the intended network.

In vSphere Client, check:

VM > Edit Settings
Network Adapter status
Connected checkbox
Connect at power on
Port group name
Adapter type
MAC address
Any recent network adapter changes

Broadcom’s KB starts the vSphere-side troubleshooting sequence by ensuring the VM’s port group exists on the vSwitch or vDS, is spelled correctly, and that the VM’s adapter is connected. It also notes that standard switches require VMkernel adapters to use their own port groups, so a VM should not be placed on a VMkernel port group.

This stage catches common mistakes:

Finding	Likely Cause	Action
Adapter disconnected	Manual change, automation issue, migration artifact	Reconnect only after confirming correct port group
Wrong port group	Template, clone, restore, or migration mistake	Move to correct port group
Port group missing on target host	Host not attached to vDS, standard switch inconsistency	Fix host/vDS membership or port group placement
Duplicate or stale guest NIC	P2V or OS-level hidden adapter	Clean up hidden adapter/IP conflict

If the VM was converted from physical to virtual, pay attention to hidden adapters. Broadcom’s KB calls out P2V hidden network adapters as a specific condition to check when troubleshooting VM networking.

Stage 4: Validate VLAN and Subnet Alignment

A large percentage of “VM network” incidents are really VLAN consistency problems.

Confirm:

VM IP subnet matches the intended VLAN
Port group VLAN ID is correct
Physical switch port mode matches the VMware tagging model
The VLAN is allowed on the trunk
Native VLAN expectations are understood
The same VLAN is available on every host where the VM can run

Broadcom’s VLAN configuration article describes three ESXi VLAN tagging methods: External Switch Tagging, Virtual Switch Tagging, and Virtual Guest Tagging. In EST, tagging is done on the physical switch and the ESXi port group VLAN ID is set to 0. In VST, tagging is done by the virtual switch and the ESXi uplinks connect to physical trunk ports with the appropriate VLAN configured on the port group. In VGT, tagging is done inside the guest OS and VLAN tags are preserved through the virtual switch.

Most enterprise VM port groups use VST. That means the usual check is:

VM subnet  -> expected VLAN
Port group -> same VLAN ID
ESXi uplink -> physical trunk
Switchport -> VLAN allowed on trunk
Gateway -> SVI/router for that VLAN reachable

Do not assume the VLAN is correct because the port group name looks right. Validate the actual VLAN ID.

Stage 5: Check the vSwitch or Distributed Switch Path

Now move from the VM object to the switching layer.

For a standard vSwitch, confirm:

Port group exists on the host where the VM is running
Correct VLAN ID
Correct uplinks assigned
Teaming and failover settings
Security policy settings if relevant
MTU alignment if jumbo frames are required

For a vSphere Distributed Switch, confirm:

Host is attached to the correct vDS
Distributed port group exists
VM is connected to the expected distributed port
Port group VLAN policy is correct
Teaming and failover policy is correct
Active uplinks map to physical NICs that carry the required VLAN
No per-port override is changing the expected policy

This is where a lot of post-vMotion issues show up. The VM may land on a host where the distributed port group exists, but the physical uplink path does not actually carry the VLAN.

A clean test is to compare a working VM and a failing VM:

Comparison Point	Working VM	Failing VM
Same port group?	Yes/No	Yes/No
Same VLAN ID?	Yes/No	Yes/No
Same ESXi host?	Yes/No	Yes/No
Same active vmnic?	Yes/No	Yes/No
Same default gateway result?	Yes/No	Yes/No
Same firewall policy?	Yes/No	Yes/No

Broadcom’s default gateway troubleshooting guidance recommends comparing affected VMs against other VMs in the same port group/subnet, and using esxtop networking view when only some VMs have gateway connectivity issues.

Stage 6: Isolate the ESXi Uplink and Teaming Path

If the problem appears host-specific or intermittent, check the uplink path.

On the ESXi host, use esxtop and press n for networking. Broadcom’s KB recommends using esxtop networking output to see which physical NIC a VM is using, then isolating physical switch ports one at a time to determine where connectivity is lost.

Useful ESXi checks:

esxtop
# Press n for networking view

net-stats -l

esxcli network nic list

esxcli network nic stats get -n vmnicX

Look for:

VM mapped to a different uplink than working VMs
Link down or speed/duplex mismatch
RX/TX errors
Dropped packets
Incorrect standby/active uplink order
LACP or EtherChannel mismatch
VLAN missing on one trunk but present on another

If the port group uses Route Based on Originating Virtual Port ID, a VM may consistently use one uplink until it moves or reconnects. If one uplink path is misconfigured, only a subset of VMs may fail. That symptom often looks random until you map VM traffic to the active pNIC.

If LACP or EtherChannel is in use, validate both sides. Broadcom’s VM network troubleshooting KB calls out port-channel techniques and recommends verifying that the physical switch ports are configured correctly for the channel.

Stage 7: Validate the Physical Switch Edge

At this stage, the virtualization team should have enough evidence to engage the network team with specifics.

Provide:

ESXi host:        esxi07
VM:               APP01
Port group:       PG-App-Prod
VLAN:             230
Active vmnic:     vmnic2
Switchport:       ToR-A Eth1/17
Test:             APP01 cannot ping 10.20.30.1 gateway
Working path:     APP02 on esxi08 via vmnic3 can ping gateway
Request:          Confirm switchport trunk allows VLAN 230 and MTU matches

Ask the network team to validate:

Access vs trunk mode
Allowed VLAN list
Native VLAN behavior
Port-channel membership
STP/portfast configuration
MTU
MAC address learning
ARP behavior
Interface errors or drops
ACLs on the switchport or SVI

This is also the right stage to check jumbo frames. Broadcom’s KB notes that if VMs require MTU 9000 and the VM network is configured for jumbo frames, the physical switch ports must also be configured for jumbo frames.

Stage 8: Test Default Gateway, Routing, and Remote Subnets

Separate Layer 2 reachability from Layer 3 reachability.

Use this logic:

Can VM ping itself?
  No -> guest OS / TCP/IP stack

Can VM ping another VM on same subnet?
  No -> port group / VLAN / uplink / local firewall

Can VM ping default gateway?
  No -> VLAN / uplink / physical switch / gateway SVI

Can VM ping remote subnet?
  No -> routing / firewall / ACL / asymmetric path

Can VM ping remote host but TCP fails?
  No -> service listener / firewall / security policy / application path

Broadcom’s default gateway article states that if VMs on the same subnet and host cannot reach the gateway, check VLAN configuration on the port group and physical switch. It also states that if gateway connectivity succeeds but other subnets fail, the issue is likely routing/Layer 3 and the network team should investigate.

For TCP checks from ESXi or supporting hosts, nc is useful when you need to test whether a TCP port is reachable. Broadcom’s host network troubleshooting KB lists ping/vmkping, nc, openssl, tcpdump-uw, and esxcli network as ESXi troubleshooting tools, and notes that nc helps determine whether a TCP port is online or possibly blocked by a firewall.

Example:

nc -z <destination-ip> <tcp-port>

For guest-level testing, use tools appropriate to the OS:

Test-NetConnection <destination-ip> -Port 443

nc -vz <destination-ip> 443

A successful ping does not prove the application path is open. It only proves ICMP reachability.

Stage 9: Check Firewall and Security Policy Boundaries

Firewall troubleshooting belongs near the end of the ladder, but it should not be ignored.

There may be multiple enforcement points:

Boundary	What to Check
Guest OS firewall	Windows Defender Firewall, Linux firewalld/iptables/nftables
NSX/vDefend Distributed Firewall	Rule match, applied-to scope, rule order, realization, exclusion list
Upstream firewall	Source/destination zones, service object, NAT, route symmetry
Physical ACL	SVI ACL, switchport ACL, routed interface ACL
Application listener	Service bound to correct IP and port

For NSX/vDefend DFW, Broadcom’s DFW troubleshooting guidance recommends checking rule source, destination, services, profiles, actions, applied-to scope, rule order, whether the rule is enabled, Traceflow, packet logs, and realized rules on ESXi hosts.

Do not “test” a firewall theory by broadly disabling security controls in production.

Safer tests include:

Verify rule hit counters.
Temporarily enable logging on the suspected rule.
Test a narrow source/destination/service tuple.
Use Traceflow where NSX applies.
Compare the VM against a known-good VM in the same security group.
Use a temporary allow rule only with change approval, scope, owner, and rollback.

If adding the VM to an exclusion list appears to remediate the problem, treat that as a diagnostic result, not the final fix. Broadcom’s DFW troubleshooting article includes the exclusion list as one troubleshooting step, but the durable fix should be a corrected policy, group membership, service definition, or rule order.

Stage 10: Use Packet Capture When the Evidence Is Still Ambiguous

Packet captures are the escalation tool that turns “it should work” into evidence.

Use them when:

The VM sends traffic but never receives replies.
The gateway ARP does not resolve.
One uplink works and another does not.
A firewall team needs proof of source, destination, and port.
The physical network team needs to know whether frames leave the ESXi host.
The application team says traffic never arrives.

Broadcom documents pktcap-uw as an ESXi packet capture tool included in ESXi 5.5 and later, capable of capturing traffic at multiple points in the hypervisor. The same Broadcom article warns not to store packet captures in /tmp; use an appropriate datastore path instead.

A practical pattern is to capture near the VM and near the uplink at the same time.

First identify the VM’s switchport and active uplink:

net-stats -l
esxtop
# Press n for networking view

Then capture at the VM vNIC side and uplink side:

mkdir /vmfs/volumes/<datastore>/Packet_Captures

pktcap-uw --switchport <switchport-id> \
  --capture VnicTx,VnicRx \
  -s 256 \
  --ip <gateway-or-destination-ip> \
  -o /vmfs/volumes/<datastore>/Packet_Captures/<host>.<vm>.switchport.pcapng &

pktcap-uw --uplink vmnicX \
  --capture UplinkSndKernel,UplinkRcvKernel \
  -s 256 \
  --ip <gateway-or-destination-ip> \
  -o /vmfs/volumes/<datastore>/Packet_Captures/<host>.vmnicX.uplink.pcapng &

Stop captures cleanly:

kill $(lsof | grep pktcap-uw | awk '{print $1}' | sort -u)

Broadcom’s pktcap-uw guidance describes --switchport as the capture point closest to the VM vNIC and --uplink as the capture point closest to the physical infrastructure.

Interpretation is straightforward:

Capture Result	Likely Meaning
Packet leaves VM vNIC but not uplink	vSwitch/vDS policy, port state, security filter, teaming path
Packet leaves uplink but no reply returns	Physical switch, VLAN, gateway, firewall, routing
Request and reply seen on uplink but not VM vNIC	Host switching, DFW/security filter, port state
Nothing leaves VM vNIC	Guest OS, application, local firewall, vNIC disconnected
ARP request leaves but no ARP reply	VLAN, gateway, physical switch, duplicate IP, upstream filtering

Packet capture should be short, scoped, and tied to an active test. Long unspecific captures create noise and operational risk.

Command Reference

Task	Command / Tool	Where
Show Windows IP configuration	`ipconfig /all`	Guest OS
Show Windows routes	`route print`	Guest OS
Test Windows TCP port	`Test-NetConnection <ip> -Port <port>`	Guest OS
Show Linux IP configuration	`ip addr`	Guest OS
Show Linux routes	`ip route`	Guest OS
Test Linux TCP port	`nc -vz <ip> <port>`	Guest OS
Test gateway	`ping <gateway-ip>`	Guest OS
Trace routed path	`tracert` / `traceroute`	Guest OS
Show ESXi networking view	`esxtop`, then `n`	ESXi
List VM switchports	`net-stats -l`	ESXi
Show physical NIC stats	`esxcli network nic stats get -n vmnicX`	ESXi
Capture VM-side traffic	`pktcap-uw --switchport <id>`	ESXi
Capture uplink traffic	`pktcap-uw --uplink vmnicX`	ESXi
Test ESXi TCP connectivity	`nc -z <ip> <port>`	ESXi

Validation Steps

Do not close the incident after the first successful ping.

For vMotion-sensitive issues, validate on more than one host. A VM that works only on one ESXi host is not fixed; it is pinned to a working path.

Rollback and Fallback Guidance

Troubleshooting should not leave the environment in a more fragile state.

Before changing a network setting, capture:

Object changed:
Original value:
New value:
Reason:
Approver:
Validation test:
Rollback step:
Rollback owner:

Safe fallback options include:

Reconnect the VM to the previously working port group.
Move the VM back to the previously working ESXi host.
Revert a port group VLAN change.
Restore original uplink teaming order.
Remove temporary firewall allow rules.
Revert guest firewall test changes.
Remove temporary static routes.
Stop packet captures and clean up capture files.

Avoid fallback actions that hide the root cause. For example, pinning a VM to one host might restore service, but it should be documented as a containment action, not the final resolution.

Practical Troubleshooting Patterns

Conclusion

VM network troubleshooting works best when it is boring.

Start in the guest. Validate the vNIC. Confirm the port group. Prove the VLAN. Check the distributed switch and uplink path. Validate the physical switch. Separate gateway reachability from routing. Then test firewall and application boundaries with specific source, destination, protocol, and port evidence.

The operational mistake is jumping layers too quickly. The operational discipline is proving where the packet stops.

Broadcom KB 324542 provides the vendor-backed troubleshooting sequence. The runbook above turns that sequence into a practical ladder for vSphere and vCF operations: guest OS to vNIC, port group to VLAN, distributed switch to uplink, physical network to routing, and firewall policy to final application validation.

External Sources

Patching vCenter Through VAMI Without Turning It Into a Recovery Event

Patching vCenter should not feel dramatic. The workflow in the Appliance Management Interface is straightforward: log in to VAMI, check for updates,...

DVS Upgrade Guardrails: What Can Break When Old Distributed Switches Move Forward

A vSphere Distributed Switch upgrade can look deceptively simple in the vCenter UI. Select the switch, choose the target version, confirm the warning, and move on. That is not how…