A purple screen on an ESXi host creates an immediate operational problem, but the bigger risk is what happens next.
The first reaction is usually to get the host back online. That is understandable, especially when workloads are down, HA is recovering virtual machines, or a cluster is running hot after losing capacity. But if the host is power-cycled too quickly, patched based on a partial error string, or returned to production without preserving evidence, the team may lose the only data that can explain what actually happened.
Broadcom KB 316522 is a useful example. The signature mentions NOT_IMPLEMENTED bora/vmkernel/main/world.c:2307, and the KB ties a specific pattern to HPE Gen10, Gen10 Plus, and Gen11 platforms, the HPE iLO native driver, and ESXi 7.x / 8.x remediation guidance. But the important lesson is broader than one crash signature: a PSOD should be treated as an evidence workflow before it becomes a remediation workflow.
Broadcom’s KB identifies the affected hardware families, the vmkernel.log heap alert, and an example NOT_IMPLEMENTED backtrace, while also noting that the example values can vary by environment.
This runbook focuses on how to triage an ESXi PSOD in a VCF or vSphere environment: capture the screen, preserve the core dump, collect support bundles, correlate driver and firmware state, and decide whether the next move is rollback, targeted remediation, hardware isolation, or vendor escalation.
Scenario
You have an ESXi host that has stopped responding. vCenter shows the host as disconnected or not responding. Some virtual machines may have restarted through HA, some may still be unavailable, and the out-of-band console shows a purple diagnostic screen.
The visible error may include a recognizable string such as:
NOT_IMPLEMENTED bora/vmkernel/main/world.c:2307
or a similar ASSERT, Exception, Spin count exceeded, Machine Check Exception, or device-driver-related failure.
The goal is not to diagnose the entire VMkernel from the console. The goal is to preserve enough evidence that the next decision is defensible.
Why This Matters Operationally
A PSOD is not a normal service restart. Broadcom describes purple screen errors as severe hardware or software errors that halt the server and prevent it from continuing. Common symptoms include the host being listed as not responding in vCenter, VMs becoming unresponsive, loss of ICMP or SSH access, and the host displaying a purple diagnostic screen on the console.
That distinction matters because the reboot is only one part of recovery. The incident also creates several operational questions:
| Question | Why it matters |
|---|---|
| Did the host successfully write a core dump? | Without it, support may only have logs and a screenshot. |
| Is this a one-off hardware event or a repeatable software signature? | The remediation path is different. |
| Did the crash follow a driver, firmware, ESXi, or vendor add-on change? | Recent lifecycle activity becomes evidence. |
| Is the issue isolated to one host model, one cluster, or one image baseline? | Scope determines whether to isolate, roll forward, roll back, or escalate. |
| Can the host safely return to production? | Returning a repeatedly crashing host can create secondary outages. |
A PSOD is a failure event, but it is also a diagnostic opportunity. Treat it that way.
Symptoms and Risk Signals
During the first pass, do not over-index on a single line of the purple screen. The line number in a signature can vary between ESXi builds, patches, and compiled code paths. Instead, capture a small set of repeatable evidence points.
Look for:
| Evidence | Example | Why it matters |
|---|---|---|
| ESXi version and build | ESXi 7.x / 8.x build number | Determines known issue applicability and patch path. |
| Panic or exception string | NOT_IMPLEMENTED, #PF Exception, MCE, ASSERT | Helps route the issue class. |
| Stack trace pattern | World_DestroyHeap, CpuSched_StartWorld, driver modules | Helps compare recurrence across hosts. |
| Physical CPU / world | Same CPU or same world across crashes | Can suggest hardware affinity or userworld/module pattern. |
| VMK uptime | Host uptime before crash | Useful for recurrence timing. |
| Core dump status | Disk dump successful or failed | Determines whether deep analysis is possible. |
| Recent lifecycle changes | ESXi patch, firmware, driver, vendor add-on | Often the difference between rollback and escalation. |
Broadcom’s purple screen interpretation guidance explains that the stack trace represents what the VMkernel was doing at the time of the error and that the core dump section indicates VMkernel memory being copied to the configured dump location. It also recommends using repeated patterns across error message, stack trace, physical CPU, and world to distinguish likely software patterns from possible hardware patterns.
PSOD Triage Workflow
The workflow below is the operational path I would want a team to follow before making a rollback or patching decision. Notice that remediation does not happen first. Evidence capture does.
The key point is sequencing. A PSOD response should move from evidence preservation to controlled recovery to correlation to decision. Skipping straight to remediation creates a fragile root-cause story.
Prerequisites and Safety Checks
Before touching the host, confirm who owns each part of the response:
| Area | Owner | Safety check |
|---|---|---|
| Workload recovery | vSphere / application operations | Confirm HA restart state, failed VMs, and application impact. |
| Host access | Platform operations | Confirm iLO/iDRAC/OOB console access and SSH policy. |
| Evidence handling | Platform operations / security | Confirm whether core dumps and support bundles can be shared externally. |
| Lifecycle data | VMware / hardware platform team | Confirm ESXi image, vendor add-on, driver, firmware, and BIOS state. |
| Escalation | VMware/Broadcom and hardware vendor support | Confirm SR ownership and upload path. |
Core dumps and support bundles deserve special handling. Broadcom notes that host support bundles can include host logs, VM descriptions, system state, and core dumps. Core dumps can include data from memory at the time of failure, and transmitting a support bundle grants VMware permission to examine the included data. Environments using vSphere Virtual Machine Encryption can also affect core dump handling and access.
That does not mean “do not collect evidence.” It means evidence collection should follow your security policy.
Runbook Stage 1: Capture the PSOD Before Reboot
When the purple screen is still visible, capture it.
Do not reset the host immediately. Broadcom explicitly warns not to reset an ESX/ESXi host while the purple screen is displayed and recommends taking a picture or screenshot that captures all visible technical data. The same guidance says to verify whether “Disk Dump Successful” appears and to allow more time if the dump has not completed; in some cases, dump completion may take up to an hour.
Capture:
| Item | How |
|---|---|
| Full console screenshot | OOB console screenshot or phone photo if necessary |
| Hostname and asset tag | vCenter inventory, hardware management console, CMDB |
| Time of failure | Include timezone and whether this is host, vCenter, or monitoring time |
| ESXi build | From console if visible, otherwise collect after reboot |
| Panic string | Exact first-line message and any file/line reference |
| Stack trace | Full visible backtrace, not just the first line |
| Dump status | Whether the screen reports dump progress, success, or failure |
A partial screenshot of only the first error line is not enough. The stack trace, CPU/world information, and dump status are part of the diagnostic record.
Runbook Stage 2: Reboot Without Destroying the Investigation
After the dump completes, reboot the host through the cleanest available method. If the host is fully halted, the out-of-band power control may be the only practical option.
After boot:
- Do not immediately return the host to normal workload placement.
- Keep the host in maintenance mode or otherwise prevent automated workload return if recurrence risk is unknown.
- Confirm whether vCenter reports an unread host kernel core dump.
- Collect logs and support bundles before applying patches, removing drivers, or changing firmware.
The startup sequence can process configured core dump slots and create a core dump file after a PSOD, which can then be reviewed for corrective action and root-cause work.
Runbook Stage 3: Collect the Support Bundle
Broadcom’s vm-support guidance states that VMware Technical Support routinely requests diagnostic information for support requests and that the vm-support utility is present on all ESXi versions, though available options vary by release. The traditional command creates a compressed .tgz bundle locally on the host, and -w can write it to a specific VMFS datastore.
Use the datastore method when the host has enough accessible storage and your security policy allows it:
# Create a support bundle on a VMFS datastore vm-support -w /vmfs/volumes/DATASTORE_NAME
For environments where saving locally is not preferred, Broadcom documents streaming vm-support over SSH to a client system:
# Stream vm-support to a local file from a management workstation ssh root@ESXHostnameOrIPAddress vm-support -s > vm-support-ESXHostname.tgz
This method requires root authentication and is not usable with lockdown mode.
Collect the vCenter support bundle as well if the incident involved HA behavior, host disconnect events, lifecycle remediation, DRS activity, or cluster-level alarms.
Runbook Stage 4: Preserve and Verify the Core Dump
Do not assume the dump exists just because the host crashed.
Check the configured dump targets:
# Check VMFS coredump files esxcli system coredump file list # Check coredump partition configuration esxcli system coredump partition list # Check network coredump configuration esxcli system coredump network get
The ESXCLI command reference includes commands to create, list, set, and remove VMkernel dump files; it also includes commands to check file, partition, and network dump configuration.
If the host uses a diagnostic partition, Broadcom documents extracting a VMkernel core dump by identifying the diagnostic partition with esxcli system coredump partition list or esxcfg-dumppart -t, changing to a datastore with enough space, and using esxcfg-dumppart --copy to produce a zdump file.
# Identify diagnostic partition esxcli system coredump partition list # Example extraction pattern after identifying the device path cd /vmfs/volumes/DatastoreName/ esxcfg-dumppart --copy \ --devname "/vmfs/devices/disks/identifier" \ --zdumpname /vmfs/volumes/DatastoreName/hostname-date-zdump
If no coredump target exists, fix that as a preventive control after the incident. Broadcom’s coredump-to-file guidance notes the warning “No coredump target has been configured. Host core dumps cannot be saved,” and documents creating a VMFS dump file with esxcli system coredump file add, then enabling it with esxcli system coredump file set --smart --enable true. It also notes that Software iSCSI and Software FCoE are not supported for coredump locations.
# Create a VMFS coredump file esxcli system coredump file add -d <datastore_UUID> -f <hostname>.dumpfile # Enable smart selection for the dump file esxcli system coredump file set --smart --enable true # Verify Active and Configured are true esxcli system coredump file list
For larger environments, configure network dump collection as a standard build item. Broadcom states that ESXi network coredump functionality helps capture diagnostic data through the network during a purple diagnostic screen, and documents configuring it with a VMkernel interface, destination server IP, and UDP port, then validating with esxcli system coredump network get and vmkping.
# Configure network coredump collector esxcli system coredump network set \ --interface-name vmk0 \ --server-ipv4 <collector-or-vcenter-ip> \ --server-port 6500 # Enable network coredump esxcli system coredump network set --enable true # Verify configuration esxcli system coredump network get # Confirm VMkernel network path vmkping -I vmk0 <collector-or-vcenter-ip>
Runbook Stage 5: Build the Evidence Matrix
Once the host is booted and evidence is preserved, build a simple matrix. This gives support, hardware vendors, and internal change approvers the same view of the event.
| Evidence | Command or source | Notes |
|---|---|---|
| ESXi version and build | vmware -vl | Match against KBs and release notes. |
| Installed VIBs/components | esxcli software vib list | Look for hardware vendor drivers and async drivers. |
| Loaded modules | esxcli system module list | Useful when a stack trace references a module or device path. |
| Coredump config | esxcli system coredump file list / partition list / network get | Confirms whether future crashes will be captured. |
| VMkernel logs | /var/log/vmkernel.log | Search for panic, heap, driver, storage, network, MCE, or NMI messages. |
| Hardware model | esxcli hardware platform get | Required for vendor advisories and compatibility checks. |
| Firmware / BIOS / iLO | Vendor tooling, OneView, iLO, iDRAC, OME, vLCM/HSM | Needed for hardware correlation. |
| Recent changes | vLCM, SDDC Manager, change record | Determines rollback versus roll-forward options. |
Useful first-pass commands:
# Version and build vmware -vl # Hardware platform esxcli hardware platform get # Coredump targets esxcli system coredump file list esxcli system coredump partition list esxcli system coredump network get # Installed packages / drivers esxcli software vib list | grep -Ei "hpe|ilo|ams|smad|bnxt|lpfc|nfnic|nenic|qfle|nvme|scsi|fc|nic" # Loaded modules esxcli system module list | grep -Ei "hpe|ilo|bnxt|lpfc|nfnic|nenic|qfle|nvme|scsi|fc|nic" # Search vmkernel log for crash-adjacent signals grep -Ei "NOT_IMPLEMENTED|ASSERT|Exception|MCE|NMI|world.c|heap|panic|backtrace|coredump" /var/log/vmkernel.log
Treat this as a triage set, not a final RCA. The goal is to avoid empty escalation: “Host crashed, please advise.”
Runbook Stage 6: Compare the Signature Without Anchoring on It
This is where KB316522 becomes useful.
Broadcom’s KB identifies a specific issue where ESXi hosts on HPE Gen10, Gen10 Plus, or Gen11 hardware can experience a PSOD. The KB lists a vmkernel.log alert similar to Unable to complete wait for non-empty heap, and an example backtrace containing NOT_IMPLEMENTED and World_DestroyHeap.
The KB’s stated cause is specific: when a kernel module exposing a character device does not behave as expected, a vmkpollcontext object can leak after a userspace poll() syscall; later, when the userspace process terminates, the VMkernel can PSOD with a NOT_IMPLEMENTED assert. The KB also says the HPE ilo kernel module used by HPE SMAD is known to cause this issue.
For remediation, Broadcom states:
| Environment | KB 316522 remediation guidance |
|---|---|
| ESXi 7.0 or later | Update the HPE iLO Native Driver component to v10.8.2 or later. |
| ESXi 8.0 or later | Update the HPE iLO Native Driver component to v10.8.2 or later and update ESXi to 8.0 Update 2b or later. |
The operational caution is this: do not assume every NOT_IMPLEMENTED purple screen is KB 316522. Match the platform, ESXi version, vendor module state, log alert, stack trace shape, and recent lifecycle history. A signature is evidence. It is not the entire case.
Runbook Stage 7: Correlate Driver, Firmware, Build, and Vendor Image
A PSOD investigation usually becomes a lifecycle investigation.
For HPE environments, confirm whether the host is running a supported HPE custom ESXi image, a vendor add-on, or a manually assembled image. HPE’s VMware ESXi support page states that HPE servers require the HPE custom ESXi image or an ESXi image built with ImageBuilder that includes appropriate drivers for the boot controller and at least one network device. It also notes that drivers for newer network and storage controllers are integrated in the HPE custom ESXi image and are not part of VMware’s base ESXi image.
For clusters managed by vSphere Lifecycle Manager, use the image, vendor add-on, firmware and drivers add-on, and hardware support manager data as part of the evidence trail. VMware’s Cloud Foundation blog notes that firmware, driver, and BIOS/EFI versions can be inspected and monitored for compliance with the Broadcom Compatibility Guide and vSAN Compatibility Guide, and that vSphere Lifecycle Manager interfaces with a registered Hardware Support Manager to orchestrate firmware lifecycle operations.
Capture:
| Layer | Evidence to collect |
|---|---|
| ESXi base image | Version, build, patch level |
| Vendor add-on | HPE, Dell, Lenovo, Cisco, or other vendor package version |
| Device drivers | NIC, storage, NVMe, FC, iLO/iDRAC/platform agents |
| Firmware | BIOS/UEFI, BMC/iLO/iDRAC, NIC, HBA, RAID, disk firmware |
| Management agents | AMS, SMAD, CIM providers, vendor tools |
| Cluster lifecycle state | Desired image, compliance drift, recent remediation tasks |
The strongest escalation packet includes both the crash evidence and the lifecycle state. The support engineer should not have to ask which driver was installed, which firmware was active, or whether the host was recently remediated.
Runbook Stage 8: Decide Rollback, Roll Forward, or Escalate
The wrong move is to pick one answer for every PSOD. Use the evidence pattern.
| Condition | Preferred action | Why |
|---|---|---|
| Known KB match, supported fix exists, and issue matches platform/build/driver pattern | Roll forward to the documented driver/ESXi fix during a controlled maintenance window | You have a supported remediation path. |
| Crash started immediately after a driver, firmware, or ESXi update and repeats on the same image | Consider rollback to the last known-good validated image while preserving evidence and opening support | The change is temporally tied to the incident. |
| Same host repeatedly crashes with different stack traces or same physical CPU indicators | Isolate host and engage hardware vendor diagnostics | Pattern may indicate hardware or platform fault. |
| Multiple hosts on the same model/image show the same signature | Treat as cluster image or vendor component issue; stop broad remediation until scoped | Prevents spreading a bad image or unsupported combination. |
| No core dump, no full screenshot, and no repeatable pattern | Fix evidence capture first, then monitor or escalate with limited confidence | RCA will be weak without dump and logs. |
| Production cluster is capacity constrained after host loss | Keep stability first; defer nonessential remediation until workload capacity is safe | Avoids creating a second outage during investigation. |
A rollback should not be emotional. It should be tied to a recent known change, a repeatable failure pattern, and an approved fallback image. A roll-forward should be tied to a vendor-documented fix, compatibility validation, and staged host remediation. Escalation should include enough artifacts for support to analyze the issue instead of recreating your evidence collection process.
Targeted Remediation Example: KB 316522 Pattern
When the evidence matches KB 316522, the remediation path should still be staged.
Recommended sequence:
- Confirm affected hardware model: HPE Gen10, Gen10 Plus, or Gen11.
- Confirm ESXi major version and build.
- Confirm installed HPE iLO Native Driver component version.
- Confirm whether the
vmkernel.logheap alert and stack trace pattern match the KB. - Confirm whether HPE SMAD / AMS / iLO-related components are present.
- Confirm the target driver and ESXi build are supported for the server model.
- Remediate one host first in a maintenance window.
- Validate stability before expanding to the cluster.
- Document the final image state in vLCM / SDDC Manager / change records.
For ESXi 8.x hosts matching this KB, Broadcom’s resolution calls for both the HPE iLO Native Driver component v10.8.2 or later and ESXi 8.0 Update 2b or later.
That “and” matters. Updating only one layer may leave the environment in a partially remediated state.
Validation Steps After Recovery
After the host is back online and before it returns to full production placement, validate the following:
| Validation | Pass condition |
|---|---|
| Host boots cleanly | No immediate PSOD or management agent failure. |
| vCenter connectivity restored | Host reconnects without repeated disconnects. |
| Core dump target configured | File, partition, or network dump target is active and configured. |
| Support bundle collected | Bundle is stored securely and associated with the incident/SR. |
| Driver and firmware state captured | Evidence matrix includes current and previous versions. |
| Cluster health stable | HA, DRS, vSAN, NSX, and workload alarms reviewed as applicable. |
| Lifecycle compliance known | Host is compliant with intended image or intentionally held back. |
| Recurrence monitoring active | Logs and monitoring are watching for repeated stack or heap alerts. |
For VCF environments, also confirm whether SDDC Manager, vCenter, NSX, vSAN, and lifecycle tasks recorded relevant events around the incident window. A host PSOD may be local, but the recovery story is cluster-wide.
Rollback and Fallback Guidance
Rollback is appropriate when the evidence points to a recent change and a known-good target exists. It is not appropriate when the team is guessing.
Before rollback, confirm:
- The previous ESXi image, vendor add-on, driver, and firmware combination is documented.
- The previous state is still supported by the hardware vendor and VMware/Broadcom.
- The rollback process has been tested or is operationally understood.
- Workloads can tolerate the maintenance sequence.
- Evidence from the failure state has already been collected.
Fallback options include:
| Fallback | Use when |
|---|---|
| Keep host in maintenance mode | Recurrence risk is unknown or evidence points to hardware. |
| Evacuate and isolate host | Cluster has enough capacity and host stability is suspect. |
| Revert to previous image | Recent lifecycle change is strongly correlated and rollback is supported. |
| Apply vendor-documented fix | KB match is strong and remediation is validated. |
| Open Broadcom and hardware vendor cases | Core dump analysis or hardware diagnosis is required. |
Do not remove vendor agents, disable platform modules, or downgrade drivers as an unsupported workaround unless directed by the vendor or support. Those changes may reduce observability, create supportability issues, or make later analysis harder.
What to Hand to Support
A good escalation packet should include:
| Artifact | Notes |
|---|---|
| Full PSOD screenshot | Include the entire visible stack, not just the first line. |
vm-support bundle | Collected before remediation where possible. |
| Core dump / zdump | Preserve securely; follow data handling policy. |
| ESXi version/build | vmware -vl output. |
| Installed VIB/component list | Include vendor drivers and add-ons. |
| Hardware model and serial | Include host generation and platform details. |
| Firmware versions | BIOS/UEFI, BMC/iLO/iDRAC, NIC, HBA, RAID, disks. |
| vLCM / SDDC Manager image state | Desired image, compliance state, recent remediation tasks. |
| Incident timeline | Failure time, last lifecycle change, reboot time, validation steps. |
| Scope statement | One host, one cluster, one hardware model, or fleet-wide. |
This is the difference between “we had a PSOD” and “we have a reproducible evidence package.”
Conclusion
A PSOD is not just a crash screen. It is a time-sensitive evidence source.
The right operational posture is to slow down just enough to capture the facts: screenshot, dump status, support bundle, core dump, ESXi build, driver versions, firmware state, and recent lifecycle changes. Once that evidence is preserved, the team can make a disciplined decision: apply a known fix, roll back a suspect change, isolate a hardware candidate, or escalate with a useful support packet.
KB 316522 is a good reminder of why this matters. The visible signature is useful, but the real answer lives in the correlation between the stack, the platform, the driver, the ESXi build, and the lifecycle history. Treat the purple screen as the start of the investigation, not the end of it.
External Sources
- Broadcom KB 316522: ESXi host may crash with PSOD with the message NOT_IMPLEMENTED bora/vmkernel/main/world.c:2307
- Broadcom KB 337182: ESX/ESXi host stops responding and displays a purple diagnostic screen
- Broadcom KB 343033: Interpreting an ESX/ESXi host purple diagnostic screen
- Broadcom KB 313542: Collecting diagnostic information for VMware ESX/ESXi using the vm-support command
- Broadcom ESXCLI Command Reference: esxcli system namespace
- Broadcom KB 343591: Extracting a core dump file from the diagnostic partition
- Broadcom KB 314320: Configuring ESXi coredump to file instead of partition
- Broadcom KB 344063: Configuring the network dump collector service in ESXi
- Broadcom KB 327899: Data collected when gathering diagnostic information for VMware ESX/ESXi
- HPE VMware ESXi support and certification matrix
- VMware Cloud Foundation Blog: Firmware Lifecycle Made Simple with vSphere Lifecycle Manager
A full /storage/log partition on a vCenter Server Appliance is not just a housekeeping problem. It is a management-plane risk. In a...
