ESXi PSOD Triage: Turning a Purple Screen into an Evidence-Driven Escalation

Paul Bryant

3 weeks ago

A purple screen on an ESXi host creates an immediate operational problem, but the bigger risk is what happens next.

The first reaction is usually to get the host back online. That is understandable, especially when workloads are down, HA is recovering virtual machines, or a cluster is running hot after losing capacity. But if the host is power-cycled too quickly, patched based on a partial error string, or returned to production without preserving evidence, the team may lose the only data that can explain what actually happened.

Broadcom KB 316522 is a useful example. The signature mentions NOT_IMPLEMENTED bora/vmkernel/main/world.c:2307, and the KB ties a specific pattern to HPE Gen10, Gen10 Plus, and Gen11 platforms, the HPE iLO native driver, and ESXi 7.x / 8.x remediation guidance. But the important lesson is broader than one crash signature: a PSOD should be treated as an evidence workflow before it becomes a remediation workflow.

Broadcom’s KB identifies the affected hardware families, the vmkernel.log heap alert, and an example NOT_IMPLEMENTED backtrace, while also noting that the example values can vary by environment.

This runbook focuses on how to triage an ESXi PSOD in a VCF or vSphere environment: capture the screen, preserve the core dump, collect support bundles, correlate driver and firmware state, and decide whether the next move is rollback, targeted remediation, hardware isolation, or vendor escalation.

Scenario

You have an ESXi host that has stopped responding. vCenter shows the host as disconnected or not responding. Some virtual machines may have restarted through HA, some may still be unavailable, and the out-of-band console shows a purple diagnostic screen.

The visible error may include a recognizable string such as:

NOT_IMPLEMENTED bora/vmkernel/main/world.c:2307

or a similar ASSERT, Exception, Spin count exceeded, Machine Check Exception, or device-driver-related failure.

The goal is not to diagnose the entire VMkernel from the console. The goal is to preserve enough evidence that the next decision is defensible.

Why This Matters Operationally

A PSOD is not a normal service restart. Broadcom describes purple screen errors as severe hardware or software errors that halt the server and prevent it from continuing. Common symptoms include the host being listed as not responding in vCenter, VMs becoming unresponsive, loss of ICMP or SSH access, and the host displaying a purple diagnostic screen on the console.

That distinction matters because the reboot is only one part of recovery. The incident also creates several operational questions:

Question	Why it matters
Did the host successfully write a core dump?	Without it, support may only have logs and a screenshot.
Is this a one-off hardware event or a repeatable software signature?	The remediation path is different.
Did the crash follow a driver, firmware, ESXi, or vendor add-on change?	Recent lifecycle activity becomes evidence.
Is the issue isolated to one host model, one cluster, or one image baseline?	Scope determines whether to isolate, roll forward, roll back, or escalate.
Can the host safely return to production?	Returning a repeatedly crashing host can create secondary outages.

A PSOD is a failure event, but it is also a diagnostic opportunity. Treat it that way.

Symptoms and Risk Signals

During the first pass, do not over-index on a single line of the purple screen. The line number in a signature can vary between ESXi builds, patches, and compiled code paths. Instead, capture a small set of repeatable evidence points.

Look for:

Evidence	Example	Why it matters
ESXi version and build	ESXi 7.x / 8.x build number	Determines known issue applicability and patch path.
Panic or exception string	`NOT_IMPLEMENTED`, `#PF Exception`, `MCE`, `ASSERT`	Helps route the issue class.
Stack trace pattern	`World_DestroyHeap`, `CpuSched_StartWorld`, driver modules	Helps compare recurrence across hosts.
Physical CPU / world	Same CPU or same world across crashes	Can suggest hardware affinity or userworld/module pattern.
VMK uptime	Host uptime before crash	Useful for recurrence timing.
Core dump status	Disk dump successful or failed	Determines whether deep analysis is possible.
Recent lifecycle changes	ESXi patch, firmware, driver, vendor add-on	Often the difference between rollback and escalation.

Broadcom’s purple screen interpretation guidance explains that the stack trace represents what the VMkernel was doing at the time of the error and that the core dump section indicates VMkernel memory being copied to the configured dump location. It also recommends using repeated patterns across error message, stack trace, physical CPU, and world to distinguish likely software patterns from possible hardware patterns.

PSOD Triage Workflow

The workflow below is the operational path I would want a team to follow before making a rollback or patching decision. Notice that remediation does not happen first. Evidence capture does.

The key point is sequencing. A PSOD response should move from evidence preservation to controlled recovery to correlation to decision. Skipping straight to remediation creates a fragile root-cause story.

Prerequisites and Safety Checks

Before touching the host, confirm who owns each part of the response:

Area	Owner	Safety check
Workload recovery	vSphere / application operations	Confirm HA restart state, failed VMs, and application impact.
Host access	Platform operations	Confirm iLO/iDRAC/OOB console access and SSH policy.
Evidence handling	Platform operations / security	Confirm whether core dumps and support bundles can be shared externally.
Lifecycle data	VMware / hardware platform team	Confirm ESXi image, vendor add-on, driver, firmware, and BIOS state.
Escalation	VMware/Broadcom and hardware vendor support	Confirm SR ownership and upload path.

Core dumps and support bundles deserve special handling. Broadcom notes that host support bundles can include host logs, VM descriptions, system state, and core dumps. Core dumps can include data from memory at the time of failure, and transmitting a support bundle grants VMware permission to examine the included data. Environments using vSphere Virtual Machine Encryption can also affect core dump handling and access.

That does not mean “do not collect evidence.” It means evidence collection should follow your security policy.

Runbook Stage 1: Capture the PSOD Before Reboot

When the purple screen is still visible, capture it.

Do not reset the host immediately. Broadcom explicitly warns not to reset an ESX/ESXi host while the purple screen is displayed and recommends taking a picture or screenshot that captures all visible technical data. The same guidance says to verify whether “Disk Dump Successful” appears and to allow more time if the dump has not completed; in some cases, dump completion may take up to an hour.

Capture:

Item	How
Full console screenshot	OOB console screenshot or phone photo if necessary
Hostname and asset tag	vCenter inventory, hardware management console, CMDB
Time of failure	Include timezone and whether this is host, vCenter, or monitoring time
ESXi build	From console if visible, otherwise collect after reboot
Panic string	Exact first-line message and any file/line reference
Stack trace	Full visible backtrace, not just the first line
Dump status	Whether the screen reports dump progress, success, or failure

A partial screenshot of only the first error line is not enough. The stack trace, CPU/world information, and dump status are part of the diagnostic record.

Runbook Stage 2: Reboot Without Destroying the Investigation

After the dump completes, reboot the host through the cleanest available method. If the host is fully halted, the out-of-band power control may be the only practical option.

After boot:

Do not immediately return the host to normal workload placement.
Keep the host in maintenance mode or otherwise prevent automated workload return if recurrence risk is unknown.
Confirm whether vCenter reports an unread host kernel core dump.
Collect logs and support bundles before applying patches, removing drivers, or changing firmware.

The startup sequence can process configured core dump slots and create a core dump file after a PSOD, which can then be reviewed for corrective action and root-cause work.

Runbook Stage 3: Collect the Support Bundle

Broadcom’s vm-support guidance states that VMware Technical Support routinely requests diagnostic information for support requests and that the vm-support utility is present on all ESXi versions, though available options vary by release. The traditional command creates a compressed .tgz bundle locally on the host, and -w can write it to a specific VMFS datastore.

Use the datastore method when the host has enough accessible storage and your security policy allows it:

# Create a support bundle on a VMFS datastore
vm-support -w /vmfs/volumes/DATASTORE_NAME

For environments where saving locally is not preferred, Broadcom documents streaming vm-support over SSH to a client system:

# Stream vm-support to a local file from a management workstation
ssh root@ESXHostnameOrIPAddress vm-support -s > vm-support-ESXHostname.tgz

This method requires root authentication and is not usable with lockdown mode.

Collect the vCenter support bundle as well if the incident involved HA behavior, host disconnect events, lifecycle remediation, DRS activity, or cluster-level alarms.

Runbook Stage 4: Preserve and Verify the Core Dump

Do not assume the dump exists just because the host crashed.

Check the configured dump targets:

# Check VMFS coredump files
esxcli system coredump file list

# Check coredump partition configuration
esxcli system coredump partition list

# Check network coredump configuration
esxcli system coredump network get

The ESXCLI command reference includes commands to create, list, set, and remove VMkernel dump files; it also includes commands to check file, partition, and network dump configuration.

If the host uses a diagnostic partition, Broadcom documents extracting a VMkernel core dump by identifying the diagnostic partition with esxcli system coredump partition list or esxcfg-dumppart -t, changing to a datastore with enough space, and using esxcfg-dumppart --copy to produce a zdump file.

# Identify diagnostic partition
esxcli system coredump partition list

# Example extraction pattern after identifying the device path
cd /vmfs/volumes/DatastoreName/

esxcfg-dumppart --copy \
  --devname "/vmfs/devices/disks/identifier" \
  --zdumpname /vmfs/volumes/DatastoreName/hostname-date-zdump

If no coredump target exists, fix that as a preventive control after the incident. Broadcom’s coredump-to-file guidance notes the warning “No coredump target has been configured. Host core dumps cannot be saved,” and documents creating a VMFS dump file with esxcli system coredump file add, then enabling it with esxcli system coredump file set --smart --enable true. It also notes that Software iSCSI and Software FCoE are not supported for coredump locations.

# Create a VMFS coredump file
esxcli system coredump file add -d <datastore_UUID> -f <hostname>.dumpfile

# Enable smart selection for the dump file
esxcli system coredump file set --smart --enable true

# Verify Active and Configured are true
esxcli system coredump file list

For larger environments, configure network dump collection as a standard build item. Broadcom states that ESXi network coredump functionality helps capture diagnostic data through the network during a purple diagnostic screen, and documents configuring it with a VMkernel interface, destination server IP, and UDP port, then validating with esxcli system coredump network get and vmkping.

# Configure network coredump collector
esxcli system coredump network set \
  --interface-name vmk0 \
  --server-ipv4 <collector-or-vcenter-ip> \
  --server-port 6500

# Enable network coredump
esxcli system coredump network set --enable true

# Verify configuration
esxcli system coredump network get

# Confirm VMkernel network path
vmkping -I vmk0 <collector-or-vcenter-ip>

Runbook Stage 5: Build the Evidence Matrix

Once the host is booted and evidence is preserved, build a simple matrix. This gives support, hardware vendors, and internal change approvers the same view of the event.

Evidence	Command or source	Notes
ESXi version and build	`vmware -vl`	Match against KBs and release notes.
Installed VIBs/components	`esxcli software vib list`	Look for hardware vendor drivers and async drivers.
Loaded modules	`esxcli system module list`	Useful when a stack trace references a module or device path.
Coredump config	`esxcli system coredump file list` / `partition list` / `network get`	Confirms whether future crashes will be captured.
VMkernel logs	`/var/log/vmkernel.log`	Search for panic, heap, driver, storage, network, MCE, or NMI messages.
Hardware model	`esxcli hardware platform get`	Required for vendor advisories and compatibility checks.
Firmware / BIOS / iLO	Vendor tooling, OneView, iLO, iDRAC, OME, vLCM/HSM	Needed for hardware correlation.
Recent changes	vLCM, SDDC Manager, change record	Determines rollback versus roll-forward options.

Useful first-pass commands:

# Version and build
vmware -vl

# Hardware platform
esxcli hardware platform get

# Coredump targets
esxcli system coredump file list
esxcli system coredump partition list
esxcli system coredump network get

# Installed packages / drivers
esxcli software vib list | grep -Ei "hpe|ilo|ams|smad|bnxt|lpfc|nfnic|nenic|qfle|nvme|scsi|fc|nic"

# Loaded modules
esxcli system module list | grep -Ei "hpe|ilo|bnxt|lpfc|nfnic|nenic|qfle|nvme|scsi|fc|nic"

# Search vmkernel log for crash-adjacent signals
grep -Ei "NOT_IMPLEMENTED|ASSERT|Exception|MCE|NMI|world.c|heap|panic|backtrace|coredump" /var/log/vmkernel.log

Treat this as a triage set, not a final RCA. The goal is to avoid empty escalation: “Host crashed, please advise.”

Runbook Stage 6: Compare the Signature Without Anchoring on It

This is where KB316522 becomes useful.

Broadcom’s KB identifies a specific issue where ESXi hosts on HPE Gen10, Gen10 Plus, or Gen11 hardware can experience a PSOD. The KB lists a vmkernel.log alert similar to Unable to complete wait for non-empty heap, and an example backtrace containing NOT_IMPLEMENTED and World_DestroyHeap.

The KB’s stated cause is specific: when a kernel module exposing a character device does not behave as expected, a vmkpollcontext object can leak after a userspace poll() syscall; later, when the userspace process terminates, the VMkernel can PSOD with a NOT_IMPLEMENTED assert. The KB also says the HPE ilo kernel module used by HPE SMAD is known to cause this issue.

For remediation, Broadcom states:

Environment	KB 316522 remediation guidance
ESXi 7.0 or later	Update the HPE iLO Native Driver component to v10.8.2 or later.
ESXi 8.0 or later	Update the HPE iLO Native Driver component to v10.8.2 or later and update ESXi to 8.0 Update 2b or later.

The operational caution is this: do not assume every NOT_IMPLEMENTED purple screen is KB 316522. Match the platform, ESXi version, vendor module state, log alert, stack trace shape, and recent lifecycle history. A signature is evidence. It is not the entire case.

Runbook Stage 7: Correlate Driver, Firmware, Build, and Vendor Image

A PSOD investigation usually becomes a lifecycle investigation.

For HPE environments, confirm whether the host is running a supported HPE custom ESXi image, a vendor add-on, or a manually assembled image. HPE’s VMware ESXi support page states that HPE servers require the HPE custom ESXi image or an ESXi image built with ImageBuilder that includes appropriate drivers for the boot controller and at least one network device. It also notes that drivers for newer network and storage controllers are integrated in the HPE custom ESXi image and are not part of VMware’s base ESXi image.

For clusters managed by vSphere Lifecycle Manager, use the image, vendor add-on, firmware and drivers add-on, and hardware support manager data as part of the evidence trail. VMware’s Cloud Foundation blog notes that firmware, driver, and BIOS/EFI versions can be inspected and monitored for compliance with the Broadcom Compatibility Guide and vSAN Compatibility Guide, and that vSphere Lifecycle Manager interfaces with a registered Hardware Support Manager to orchestrate firmware lifecycle operations.

Capture:

Layer	Evidence to collect
ESXi base image	Version, build, patch level
Vendor add-on	HPE, Dell, Lenovo, Cisco, or other vendor package version
Device drivers	NIC, storage, NVMe, FC, iLO/iDRAC/platform agents
Firmware	BIOS/UEFI, BMC/iLO/iDRAC, NIC, HBA, RAID, disk firmware
Management agents	AMS, SMAD, CIM providers, vendor tools
Cluster lifecycle state	Desired image, compliance drift, recent remediation tasks

The strongest escalation packet includes both the crash evidence and the lifecycle state. The support engineer should not have to ask which driver was installed, which firmware was active, or whether the host was recently remediated.

Runbook Stage 8: Decide Rollback, Roll Forward, or Escalate

The wrong move is to pick one answer for every PSOD. Use the evidence pattern.

Condition	Preferred action	Why
Known KB match, supported fix exists, and issue matches platform/build/driver pattern	Roll forward to the documented driver/ESXi fix during a controlled maintenance window	You have a supported remediation path.
Crash started immediately after a driver, firmware, or ESXi update and repeats on the same image	Consider rollback to the last known-good validated image while preserving evidence and opening support	The change is temporally tied to the incident.
Same host repeatedly crashes with different stack traces or same physical CPU indicators	Isolate host and engage hardware vendor diagnostics	Pattern may indicate hardware or platform fault.
Multiple hosts on the same model/image show the same signature	Treat as cluster image or vendor component issue; stop broad remediation until scoped	Prevents spreading a bad image or unsupported combination.
No core dump, no full screenshot, and no repeatable pattern	Fix evidence capture first, then monitor or escalate with limited confidence	RCA will be weak without dump and logs.
Production cluster is capacity constrained after host loss	Keep stability first; defer nonessential remediation until workload capacity is safe	Avoids creating a second outage during investigation.

A rollback should not be emotional. It should be tied to a recent known change, a repeatable failure pattern, and an approved fallback image. A roll-forward should be tied to a vendor-documented fix, compatibility validation, and staged host remediation. Escalation should include enough artifacts for support to analyze the issue instead of recreating your evidence collection process.

Targeted Remediation Example: KB 316522 Pattern

When the evidence matches KB 316522, the remediation path should still be staged.

Recommended sequence:

Confirm affected hardware model: HPE Gen10, Gen10 Plus, or Gen11.
Confirm ESXi major version and build.
Confirm installed HPE iLO Native Driver component version.
Confirm whether the vmkernel.log heap alert and stack trace pattern match the KB.
Confirm whether HPE SMAD / AMS / iLO-related components are present.
Confirm the target driver and ESXi build are supported for the server model.
Remediate one host first in a maintenance window.
Validate stability before expanding to the cluster.
Document the final image state in vLCM / SDDC Manager / change records.

For ESXi 8.x hosts matching this KB, Broadcom’s resolution calls for both the HPE iLO Native Driver component v10.8.2 or later and ESXi 8.0 Update 2b or later.

That “and” matters. Updating only one layer may leave the environment in a partially remediated state.

Validation Steps After Recovery

After the host is back online and before it returns to full production placement, validate the following:

Validation	Pass condition
Host boots cleanly	No immediate PSOD or management agent failure.
vCenter connectivity restored	Host reconnects without repeated disconnects.
Core dump target configured	File, partition, or network dump target is active and configured.
Support bundle collected	Bundle is stored securely and associated with the incident/SR.
Driver and firmware state captured	Evidence matrix includes current and previous versions.
Cluster health stable	HA, DRS, vSAN, NSX, and workload alarms reviewed as applicable.
Lifecycle compliance known	Host is compliant with intended image or intentionally held back.
Recurrence monitoring active	Logs and monitoring are watching for repeated stack or heap alerts.

For VCF environments, also confirm whether SDDC Manager, vCenter, NSX, vSAN, and lifecycle tasks recorded relevant events around the incident window. A host PSOD may be local, but the recovery story is cluster-wide.

Rollback and Fallback Guidance

Rollback is appropriate when the evidence points to a recent change and a known-good target exists. It is not appropriate when the team is guessing.

Before rollback, confirm:

The previous ESXi image, vendor add-on, driver, and firmware combination is documented.
The previous state is still supported by the hardware vendor and VMware/Broadcom.
The rollback process has been tested or is operationally understood.
Workloads can tolerate the maintenance sequence.
Evidence from the failure state has already been collected.

Fallback options include:

Fallback	Use when
Keep host in maintenance mode	Recurrence risk is unknown or evidence points to hardware.
Evacuate and isolate host	Cluster has enough capacity and host stability is suspect.
Revert to previous image	Recent lifecycle change is strongly correlated and rollback is supported.
Apply vendor-documented fix	KB match is strong and remediation is validated.
Open Broadcom and hardware vendor cases	Core dump analysis or hardware diagnosis is required.

Do not remove vendor agents, disable platform modules, or downgrade drivers as an unsupported workaround unless directed by the vendor or support. Those changes may reduce observability, create supportability issues, or make later analysis harder.

What to Hand to Support

A good escalation packet should include:

Artifact	Notes
Full PSOD screenshot	Include the entire visible stack, not just the first line.
`vm-support` bundle	Collected before remediation where possible.
Core dump / zdump	Preserve securely; follow data handling policy.
ESXi version/build	`vmware -vl` output.
Installed VIB/component list	Include vendor drivers and add-ons.
Hardware model and serial	Include host generation and platform details.
Firmware versions	BIOS/UEFI, BMC/iLO/iDRAC, NIC, HBA, RAID, disks.
vLCM / SDDC Manager image state	Desired image, compliance state, recent remediation tasks.
Incident timeline	Failure time, last lifecycle change, reboot time, validation steps.
Scope statement	One host, one cluster, one hardware model, or fleet-wide.

This is the difference between “we had a PSOD” and “we have a reproducible evidence package.”

Conclusion

A PSOD is not just a crash screen. It is a time-sensitive evidence source.

The right operational posture is to slow down just enough to capture the facts: screenshot, dump status, support bundle, core dump, ESXi build, driver versions, firmware state, and recent lifecycle changes. Once that evidence is preserved, the team can make a disciplined decision: apply a known fix, roll back a suspect change, isolate a hardware candidate, or escalate with a useful support packet.

KB 316522 is a good reminder of why this matters. The visible signature is useful, but the real answer lives in the correlation between the stack, the platform, the driver, the ESXi build, and the lifecycle history. Treat the purple screen as the start of the investigation, not the end of it.

External Sources

The vCenter Log Partition Runbook: Find Growth, Preserve Evidence, Restore Headroom

A full /storage/log partition on a vCenter Server Appliance is not just a housekeeping problem. It is a management-plane risk. In a...

Command-Line ESXi Patching: A Controlled Workflow for Hosts Outside the Happy Path

There are times when the normal ESXi patching path is exactly what you should use: SDDC Manager, vSphere Lifecycle Manager, a tested cluster image, prechecks, staged remediation, and a maintenance…