Site icon Digital Thought Disruption

VM Network Troubleshooting from Guest OS to Uplink: A Layer by Layer VMware Runbook

Virtual machine network problems rarely arrive with a clean label.

The ticket usually says something like “the VM is unreachable,” “the application cannot connect,” “ping fails,” “internet access is down,” or “VMs on different hosts cannot talk.” The underlying cause might be inside the guest OS, on the VM’s virtual NIC, in the port group, on the VLAN trunk, in the distributed switch, on a bad uplink, at the physical switch, in routing, or at a firewall boundary.

That is why a useful VMware troubleshooting process needs to be layered.

Broadcom’s VMware KB324542 (KB 324542) frames VM network troubleshooting as a sequence of checks that should not be skipped, covering port group names, VM adapter connection state, guest OS networking, TCP/IP stack behavior, P2V hidden adapters, uplink isolation, VLAN configuration, jumbo frames, and packet capture. This article turns that KB into an operational ladder that an engineer can use during a real incident.

The goal is not to prove that the network, virtualization layer, firewall, or guest OS is “the problem.” The goal is to narrow the failure domain without creating a second outage.

Scenario

A virtual machine running on VMware vSphere has lost network connectivity.

The symptom may be isolated to one VM, several VMs on the same port group, VMs on one ESXi host, VMs after vMotion, VMs on different VLANs, or traffic to a specific destination. Broadcom’s KB lists common symptoms such as unreachable VMs, failed VM-to-VM communication across hosts, high latency, failed inbound or outbound traffic, unavailable internet access, and TCP/IP connection failures.

The runbook starts inside the guest OS and works outward to the physical and policy boundaries.

Why This Matters Operationally

The fastest way to waste time on a VM network issue is to start in the middle.

Changing a VLAN before checking the guest IP configuration can hide a simple OS issue. Rebuilding a port group before checking an uplink can create a broader outage. Blaming routing before testing the default gateway can pull the wrong team into the incident.

That matters in vCF and vSphere operations because VM networking crosses ownership boundaries. The same packet can touch the guest OS, vNIC, port group, vDS, host uplink, top-of-rack switch, default gateway, firewall, and routing domain before the application ever sees a response.

Symptoms and Risk

Use this runbook when you see symptoms like:

The operational risk is not just downtime. It is accidental blast radius.

Do not change VLANs, uplink teaming, LACP, distributed switch policies, firewall rules, or physical switch trunks until you have captured the current state and identified the smallest safe test.

Troubleshooting Ladder at a Glance

The diagram below is the troubleshooting path. The important thing to notice is that the checks move from the VM outward. Each layer should either prove connectivity, identify the break, or provide the evidence needed to hand off to the next owner.

This should be treated as a ladder, not a checklist of random ideas. If the VM cannot reach its default gateway, focus on Layer 2, VLAN, port group, uplink, and physical switch evidence first. If the VM can reach the gateway but cannot reach another subnet, Broadcom’s default-gateway troubleshooting guidance points toward Layer 3 routing rather than the local virtual switch path.

Prerequisites and Safety Checks

Before changing anything, collect the basics.

You need:

There is one important exception: if the unreachable VM is vCenter Server, be careful. Broadcom’s KB specifically calls out vCenter reachability as a scenario where opening a networking support case may be the best path, especially when vCenter networking is delivered through a vSphere Distributed Switch.

That warning exists for a reason. A vDS-backed vCenter outage can turn normal remediation into a control-plane recovery problem.

Runbook Stages

Stage 1: Define the Failure Domain

Start by proving the scope.

Ask four questions:

  1. Is this one VM or multiple VMs?
  2. Is it one port group or multiple port groups?
  3. Is it one ESXi host or every host in the cluster?
  4. Is the failure limited to one destination, one subnet, or all traffic?

This first step decides where the runbook branches.

A single VM problem usually starts with the guest OS, VM vNIC, or VM-specific policy. A port group-wide issue points toward VLAN, port group policy, or upstream trunking. A host-specific issue points toward that ESXi host’s uplinks, physical switch ports, or LACP/team configuration. A cross-subnet-only issue points toward routing or firewall policy.

Document the failure in plain terms:

Source VM:      APP01
Source IP:      10.20.30.41
Source Host:    esxi07
Port Group:     PG-App-Prod
VLAN:           230
Destination:    10.20.30.1 default gateway
Result:         Ping fails from APP01, succeeds from APP02 on same port group
Scope:          Single VM

That simple record prevents the incident from drifting.

Stage 2: Check the Guest OS First

A VM can be perfectly connected to the right port group and still fail because the guest OS is misconfigured.

From inside the guest, verify:

For Windows:

ipconfig /all
route print
ping 127.0.0.1
ping <vm-ip>
ping <default-gateway-ip>
tracert <destination-ip>
Test-NetConnection <destination-ip> -Port <tcp-port>

For Linux:

ip addr
ip route
ping -c 4 127.0.0.1
ping -c 4 <vm-ip>
ping -c 4 <default-gateway-ip>
traceroute <destination-ip>
nc -vz <destination-ip> <tcp-port>

Interpret the results carefully.

If loopback fails, the problem is inside the OS TCP/IP stack. If the VM cannot ping its own IP, the guest stack or interface configuration is suspect. If the VM can ping itself but not the gateway, move outward to the vNIC, port group, VLAN, and uplink path. If the VM can ping the gateway but not a remote subnet, shift toward routing or firewall boundaries.

Broadcom’s KB explicitly includes guest OS networking and TCP/IP stack validation as part of the VM network troubleshooting sequence.

Stage 3: Verify the VM vNIC and Port Group Assignment

Next, confirm the virtual NIC exists, is connected, and is attached to the intended network.

In vSphere Client, check:

Broadcom’s KB starts the vSphere-side troubleshooting sequence by ensuring the VM’s port group exists on the vSwitch or vDS, is spelled correctly, and that the VM’s adapter is connected. It also notes that standard switches require VMkernel adapters to use their own port groups, so a VM should not be placed on a VMkernel port group.

This stage catches common mistakes:

FindingLikely CauseAction
Adapter disconnectedManual change, automation issue, migration artifactReconnect only after confirming correct port group
Wrong port groupTemplate, clone, restore, or migration mistakeMove to correct port group
Port group missing on target hostHost not attached to vDS, standard switch inconsistencyFix host/vDS membership or port group placement
Duplicate or stale guest NICP2V or OS-level hidden adapterClean up hidden adapter/IP conflict

If the VM was converted from physical to virtual, pay attention to hidden adapters. Broadcom’s KB calls out P2V hidden network adapters as a specific condition to check when troubleshooting VM networking.

Stage 4: Validate VLAN and Subnet Alignment

A large percentage of “VM network” incidents are really VLAN consistency problems.

Confirm:

Broadcom’s VLAN configuration article describes three ESXi VLAN tagging methods: External Switch Tagging, Virtual Switch Tagging, and Virtual Guest Tagging. In EST, tagging is done on the physical switch and the ESXi port group VLAN ID is set to 0. In VST, tagging is done by the virtual switch and the ESXi uplinks connect to physical trunk ports with the appropriate VLAN configured on the port group. In VGT, tagging is done inside the guest OS and VLAN tags are preserved through the virtual switch.

Most enterprise VM port groups use VST. That means the usual check is:

VM subnet  -> expected VLAN
Port group -> same VLAN ID
ESXi uplink -> physical trunk
Switchport -> VLAN allowed on trunk
Gateway -> SVI/router for that VLAN reachable

Do not assume the VLAN is correct because the port group name looks right. Validate the actual VLAN ID.

Stage 5: Check the vSwitch or Distributed Switch Path

Now move from the VM object to the switching layer.

For a standard vSwitch, confirm:

For a vSphere Distributed Switch, confirm:

This is where a lot of post-vMotion issues show up. The VM may land on a host where the distributed port group exists, but the physical uplink path does not actually carry the VLAN.

A clean test is to compare a working VM and a failing VM:

Comparison PointWorking VMFailing VM
Same port group?Yes/NoYes/No
Same VLAN ID?Yes/NoYes/No
Same ESXi host?Yes/NoYes/No
Same active vmnic?Yes/NoYes/No
Same default gateway result?Yes/NoYes/No
Same firewall policy?Yes/NoYes/No

Broadcom’s default gateway troubleshooting guidance recommends comparing affected VMs against other VMs in the same port group/subnet, and using esxtop networking view when only some VMs have gateway connectivity issues.

If the problem appears host-specific or intermittent, check the uplink path.

On the ESXi host, use esxtop and press n for networking. Broadcom’s KB recommends using esxtop networking output to see which physical NIC a VM is using, then isolating physical switch ports one at a time to determine where connectivity is lost.

Useful ESXi checks:

esxtop
# Press n for networking view

net-stats -l

esxcli network nic list

esxcli network nic stats get -n vmnicX

Look for:

If the port group uses Route Based on Originating Virtual Port ID, a VM may consistently use one uplink until it moves or reconnects. If one uplink path is misconfigured, only a subset of VMs may fail. That symptom often looks random until you map VM traffic to the active pNIC.

If LACP or EtherChannel is in use, validate both sides. Broadcom’s VM network troubleshooting KB calls out port-channel techniques and recommends verifying that the physical switch ports are configured correctly for the channel.

Stage 7: Validate the Physical Switch Edge

At this stage, the virtualization team should have enough evidence to engage the network team with specifics.

Provide:

ESXi host:        esxi07
VM:               APP01
Port group:       PG-App-Prod
VLAN:             230
Active vmnic:     vmnic2
Switchport:       ToR-A Eth1/17
Test:             APP01 cannot ping 10.20.30.1 gateway
Working path:     APP02 on esxi08 via vmnic3 can ping gateway
Request:          Confirm switchport trunk allows VLAN 230 and MTU matches

Ask the network team to validate:

This is also the right stage to check jumbo frames. Broadcom’s KB notes that if VMs require MTU 9000 and the VM network is configured for jumbo frames, the physical switch ports must also be configured for jumbo frames.

Stage 8: Test Default Gateway, Routing, and Remote Subnets

Separate Layer 2 reachability from Layer 3 reachability.

Use this logic:

Can VM ping itself?
  No -> guest OS / TCP/IP stack

Can VM ping another VM on same subnet?
  No -> port group / VLAN / uplink / local firewall

Can VM ping default gateway?
  No -> VLAN / uplink / physical switch / gateway SVI

Can VM ping remote subnet?
  No -> routing / firewall / ACL / asymmetric path

Can VM ping remote host but TCP fails?
  No -> service listener / firewall / security policy / application path

Broadcom’s default gateway article states that if VMs on the same subnet and host cannot reach the gateway, check VLAN configuration on the port group and physical switch. It also states that if gateway connectivity succeeds but other subnets fail, the issue is likely routing/Layer 3 and the network team should investigate.

For TCP checks from ESXi or supporting hosts, nc is useful when you need to test whether a TCP port is reachable. Broadcom’s host network troubleshooting KB lists ping/vmkping, nc, openssl, tcpdump-uw, and esxcli network as ESXi troubleshooting tools, and notes that nc helps determine whether a TCP port is online or possibly blocked by a firewall.

Example:

nc -z <destination-ip> <tcp-port>

For guest-level testing, use tools appropriate to the OS:

Test-NetConnection <destination-ip> -Port 443
nc -vz <destination-ip> 443

A successful ping does not prove the application path is open. It only proves ICMP reachability.

Stage 9: Check Firewall and Security Policy Boundaries

Firewall troubleshooting belongs near the end of the ladder, but it should not be ignored.

There may be multiple enforcement points:

BoundaryWhat to Check
Guest OS firewallWindows Defender Firewall, Linux firewalld/iptables/nftables
NSX/vDefend Distributed FirewallRule match, applied-to scope, rule order, realization, exclusion list
Upstream firewallSource/destination zones, service object, NAT, route symmetry
Physical ACLSVI ACL, switchport ACL, routed interface ACL
Application listenerService bound to correct IP and port

For NSX/vDefend DFW, Broadcom’s DFW troubleshooting guidance recommends checking rule source, destination, services, profiles, actions, applied-to scope, rule order, whether the rule is enabled, Traceflow, packet logs, and realized rules on ESXi hosts.

Do not “test” a firewall theory by broadly disabling security controls in production.

Safer tests include:

If adding the VM to an exclusion list appears to remediate the problem, treat that as a diagnostic result, not the final fix. Broadcom’s DFW troubleshooting article includes the exclusion list as one troubleshooting step, but the durable fix should be a corrected policy, group membership, service definition, or rule order.

Stage 10: Use Packet Capture When the Evidence Is Still Ambiguous

Packet captures are the escalation tool that turns “it should work” into evidence.

Use them when:

Broadcom documents pktcap-uw as an ESXi packet capture tool included in ESXi 5.5 and later, capable of capturing traffic at multiple points in the hypervisor. The same Broadcom article warns not to store packet captures in /tmp; use an appropriate datastore path instead.

A practical pattern is to capture near the VM and near the uplink at the same time.

First identify the VM’s switchport and active uplink:

net-stats -l
esxtop
# Press n for networking view

Then capture at the VM vNIC side and uplink side:

mkdir /vmfs/volumes/<datastore>/Packet_Captures

pktcap-uw --switchport <switchport-id> \
  --capture VnicTx,VnicRx \
  -s 256 \
  --ip <gateway-or-destination-ip> \
  -o /vmfs/volumes/<datastore>/Packet_Captures/<host>.<vm>.switchport.pcapng &

pktcap-uw --uplink vmnicX \
  --capture UplinkSndKernel,UplinkRcvKernel \
  -s 256 \
  --ip <gateway-or-destination-ip> \
  -o /vmfs/volumes/<datastore>/Packet_Captures/<host>.vmnicX.uplink.pcapng &

Stop captures cleanly:

kill $(lsof | grep pktcap-uw | awk '{print $1}' | sort -u)

Broadcom’s pktcap-uw guidance describes --switchport as the capture point closest to the VM vNIC and --uplink as the capture point closest to the physical infrastructure.

Interpretation is straightforward:

Capture ResultLikely Meaning
Packet leaves VM vNIC but not uplinkvSwitch/vDS policy, port state, security filter, teaming path
Packet leaves uplink but no reply returnsPhysical switch, VLAN, gateway, firewall, routing
Request and reply seen on uplink but not VM vNICHost switching, DFW/security filter, port state
Nothing leaves VM vNICGuest OS, application, local firewall, vNIC disconnected
ARP request leaves but no ARP replyVLAN, gateway, physical switch, duplicate IP, upstream filtering

Packet capture should be short, scoped, and tied to an active test. Long unspecific captures create noise and operational risk.

Command Reference

TaskCommand / ToolWhere
Show Windows IP configurationipconfig /allGuest OS
Show Windows routesroute printGuest OS
Test Windows TCP portTest-NetConnection <ip> -Port <port>Guest OS
Show Linux IP configurationip addrGuest OS
Show Linux routesip routeGuest OS
Test Linux TCP portnc -vz <ip> <port>Guest OS
Test gatewayping <gateway-ip>Guest OS
Trace routed pathtracert / tracerouteGuest OS
Show ESXi networking viewesxtop, then nESXi
List VM switchportsnet-stats -lESXi
Show physical NIC statsesxcli network nic stats get -n vmnicXESXi
Capture VM-side trafficpktcap-uw --switchport <id>ESXi
Capture uplink trafficpktcap-uw --uplink vmnicXESXi
Test ESXi TCP connectivitync -z <ip> <port>ESXi

Validation Steps

Do not close the incident after the first successful ping.

For vMotion-sensitive issues, validate on more than one host. A VM that works only on one ESXi host is not fixed; it is pinned to a working path.

Rollback and Fallback Guidance

Troubleshooting should not leave the environment in a more fragile state.

Before changing a network setting, capture:

Object changed:
Original value:
New value:
Reason:
Approver:
Validation test:
Rollback step:
Rollback owner:

Safe fallback options include:

Avoid fallback actions that hide the root cause. For example, pinning a VM to one host might restore service, but it should be documented as a containment action, not the final resolution.

Practical Troubleshooting Patterns

Conclusion

VM network troubleshooting works best when it is boring.

Start in the guest. Validate the vNIC. Confirm the port group. Prove the VLAN. Check the distributed switch and uplink path. Validate the physical switch. Separate gateway reachability from routing. Then test firewall and application boundaries with specific source, destination, protocol, and port evidence.

The operational mistake is jumping layers too quickly. The operational discipline is proving where the packet stops.

Broadcom KB 324542 provides the vendor-backed troubleshooting sequence. The runbook above turns that sequence into a practical ladder for vSphere and vCF operations: guest OS to vNIC, port group to VLAN, distributed switch to uplink, physical network to routing, and firewall policy to final application validation.

External Sources

Exit mobile version