Finding VM File Locks on ESXi: A Production-Safe Runbook Before You Kill Processes

A virtual machine file lock issue rarely shows up at a convenient time. It usually appears when a VM refuses to power on, a consolidation task is stuck, a backup window has overrun, or vCenter and an ESXi host disagree about the VM’s state.

That is exactly when bad habits get expensive.

The temptation is to find a process, kill it, retry the power-on, and move on. Sometimes that works. Sometimes it turns a lock investigation into a corrupted disk chain, a broken snapshot tree, or an application outage that is harder to explain than the original alert.

Broadcom KB314365 gives the core mechanics for investigating virtual machine file locks on ESXi hosts. This runbook takes that workflow and wraps it in a production-safe operating model: identify the locked file, determine the real lock owner, correlate host ownership, validate datastore health, avoid destructive fixes, and only then decide whether process termination is appropriate.

The goal is not to “break the lock.”

The goal is to understand why the lock exists before taking an action that cannot be rolled back.

Scenario

You are operating a vSphere or VMware Cloud Foundation environment and a VM has one or more of the following symptoms:

  • Power-on remains stuck near completion or fails.
  • The VM appears as Invalid.
  • vCenter and the ESXi Host Client show conflicting VM power states.
  • A disk add, snapshot consolidation, clone, or backup-related operation fails.
  • The task reports an error such as Unable to access a file since it is locked.
  • The VMX file or a VMDK cannot be opened from the host shell.

Broadcom KB 314365 lists these types of symptoms, including power-on tasks stuck around 95%, invalid VM state, conflicting power states, locked-file errors, swap-file errors, VMX access errors, and vmkernel log messages indicating that a swap file could not be opened because the lock was not free.

In production, those symptoms are not enough to justify killing a VMX process. They are only enough to justify starting a lock investigation.

Why File Locks Matter Operationally

ESXi file locking is a protection mechanism, not just an obstacle. A powered-on VM legitimately holds locks on files it is actively using so that multiple hosts or processes do not make unsafe concurrent changes. Broadcom describes this purpose directly: ESXi hosts establish locks on critical VM files and file systems to prevent concurrent changes.

Runtime locks can involve files such as:

  • VMNAME.vswp
  • DISKNAME-flat.vmdk
  • DISKNAME-00000x-delta.vmdk
  • VMNAME.vmx
  • VMNAME.vmxf
  • vmware.log

Those file types are called out in KB 314365 as examples of VM files locked during runtime.

That matters because a lock does not automatically mean something is wrong. A lock can be legitimate when the VM is powered on, when a snapshot-based backup appliance has hot-added a disk, when a VM is using snapshots, when multi-writer disks are configured, or when a management task is still running.

The unsafe move is assuming every lock is stale.

Symptoms and Risk

The most common operational mistake is treating all lock symptoms as the same issue. A locked VMDK during backup is different from a stale NFS .lck file. A vSAN object lock is different from a VMFS metadata lock. A datastore in APD or PDL is different from a VMX process that did not exit cleanly.

Use this table as the first triage filter.

SymptomLikely Investigation PathProduction Risk
VM will not power on and task shows locked fileIdentify locked file and lock ownerRepeated retries may create noise without fixing ownership
Snapshot consolidation failsCheck backup proxy, hot-add, and active tasksKilling the wrong process can damage backup or consolidation state
VM is Invalid or inaccessibleValidate datastore access before lock remediationStorage outage may be the root cause, not a stale VM process
Lock owner is another ESXi hostCorrelate MAC/IP/host before actionKilling locally will not clear a remote lock
NFS .lck file exists after VM is powered offInvestigate NFS lock owner and stale lock fileDeleting the wrong .lck while a VM is active can be destructive
vSAN object lock existsInvestigate vSAN object UUID lockFile-path assumptions from VMFS may not apply

Broadcom’s locked virtual disk guidance also notes that locked disks can result from powered-on VM ownership, snapshot-based backup appliances, unsupported disk formats, or existing locks.

Runbook Workflow at a Glance

The important thing to notice in this workflow is that process termination is intentionally late. The runbook moves from evidence collection to ownership correlation to datastore validation before remediation.

Prerequisites and Safety Checks

Before you run commands, establish control of the situation.

You need:

  • vCenter access with permission to view tasks, events, hosts, datastores, and VM configuration.
  • SSH access to the relevant ESXi hosts.
  • The VM name, datastore name, VM folder path, and expected power state.
  • The current backup or replication job status for the VM.
  • Awareness of whether the datastore is VMFS, NFS, or vSAN.
  • A maintenance or change record if remediation may affect a running workload.
  • Application owner awareness if a hard VM process kill becomes a possibility.

Do not start with kill.

Do not delete VMDKs.

Do not delete .lck files on VMFS. Broadcom explicitly notes that VMFS volumes do not have .lck files because VMFS locking is handled in VMFS metadata.

Do not use a broad services.sh restart as a casual fix. Broadcom’s ESXi management-agent guidance warns not to use services.sh restart or DCUI Restart Management Agents when vSAN, LACP, NSX, or shared graphics are in use; it recommends restarting single services instead.

That warning matters in VCF environments because vSAN and NSX are common parts of the platform design. A lock incident should not become a management-plane incident.

Stage 1: Capture the Error Before You Change Anything

Start with the failed task and the VM’s current state.

In newer environments, this may be easier than it used to be. Broadcom notes that starting with ESXi 8.0 U2, the file-lock owner can be shown in the vSphere Client by navigating to the VM’s Monitor tab, then Tasks and Events, then Tasks. The locked virtual disk KB similarly states that vSphere 8.0 U2 can show lock details in the failed Power On virtual machine task status, including file path, host, MAC, world name, and lock mode.

Capture:

  • Failed task name
  • Exact error text
  • Locked file path
  • Host reported in the task
  • MAC address reported in the task
  • Lock mode
  • World name, if shown
  • VM power state in vCenter
  • VM power state in the ESXi Host Client
  • Recent backup, snapshot, clone, storage, or vMotion activity

If the UI does not expose the lock owner, perform one controlled reproduction only if appropriate. KB 314365 describes powering on the VM so the operation fails and the error can be noted, then connecting to the ESXi host by SSH. In production, do this only when the VM is expected to be powered on and no other recovery workflow is already in progress.

Stage 2: Identify the Datastore Type

The datastore type determines the investigation path.

Datastore TypeLock Investigation PatternImportant Caveat
VMFSUse vmfsfilelockinfo against the relevant flat, delta, or sesparse fileDo not look for .lck files on VMFS
NFSInspect .lck-#### files and decode owner informationOnly remove stale .lck files when the VM is powered off
vSANResolve the vSAN object UUID and inspect .UUID.lckvSAN virtual disk objects are not the same as VMFS flat files

Broadcom KB 314365 states that vmfsfilelockinfo can be run against the VMDK flat, delta, or sesparse file for VMFS, or the .UUID.lck file for vSAN.

Stage 3: Investigate VMFS Locks

For VMFS, identify the active disk file from the VM configuration or from the failure message. If snapshots are present, the locked file may be a delta disk rather than the base descriptor.

From an ESXi host with access to the datastore:

cd /vmfs/volumes/<datastore>/<vm-folder>

ls -lah

vmfsfilelockinfo -p /vmfs/volumes/<datastore>/<vm-folder>/<locked-file>.vmdk \
  -v <vcenter-fqdn-or-ip> \
  -u <sso-user>

Example pattern:

vmfsfilelockinfo -p /vmfs/volumes/prod-ds01/App01/App01-000003-delta.vmdk \
  -v vcsa01.example.com \
  -u administrator@vsphere.local

vmfsfilelockinfo can use vCenter credentials to trace the MAC address back to an ESXi host. KB 314365 shows the command returning the MAC address of the host holding the lock and, when vCenter lookup succeeds, the host owning the lock and the lock mode.

Interpret the lock mode carefully:

Lock ModeMeaning
mode 0No lock
mode 1Exclusive lock
mode 2Read-only lock
mode 3Multi-writer lock

Broadcom lists these lock modes and gives examples such as exclusive locks for powered-on VM files, read-only locks for certain snapshot cases, and multi-writer locks for MSCS or FT-style scenarios.

The practical point is simple: mode 1 is not automatically stale, and mode 3 may be intentional.

Stage 4: Investigate NFS Locks

NFS locking behaves differently from VMFS. Broadcom’s locked virtual disk KB explains that NFS does not provide block-level access for SCSI locks and that NFS locks are implemented by creating .lck-#### files on the NFS server; it also notes that the same command-line tools used for VMFS lock holders cannot be used the same way for NFS.

From the ESXi host where the affected VM is registered:

cd /vmfs/volumes/<nfs-datastore>/<vm-folder>

ls -lha

If you find a .lck-#### file, inspect it:

hexdump -C .lck-####

Broadcom documents this approach and notes that the output can provide the hostname of the lock owner.

Only consider moving or removing the lock file after you have confirmed:

  • The affected VM is powered off.
  • Backup proxies are not using the disk.
  • No other VM has the disk mounted.
  • No ISO, RDM, or other configuration is referencing the file.
  • The application owner understands that power-on or consolidation will be retried after the lock is moved.

KB 314365 gives a conservative NFS remediation pattern: power down the VM, create a bkup directory, move the lck-#### files into it, confirm they moved, and only remove lock files for a powered-off VM.

Use move, not delete, unless your support process says otherwise:

mkdir bkup
mv .lck-#### bkup/
ls -lah bkup

If the .lck file is immediately recreated, stop. That means something is still touching the disk.

Stage 5: Investigate vSAN Locks

vSAN requires a different mental model. Broadcom’s vSAN lock KB explains that vSAN uses a specific object type for virtual disks and that virtual disks are not stored with the VM configuration files in the namespace directory in the same way VMFS files are.

Start in the VM namespace:

cd /vmfs/volumes/vsanDatastore/<VM_Namespace>

Find the vSAN object UUID from the VMDK descriptor:

grep RW <VMDiskName>.vmdk

You are looking for an extent similar to:

RW 209715200 VMFS "vsan://########-####-####-####-########31f0"

Broadcom documents this pattern and identifies the UUID in the vsan:// path as the vSAN object representing the virtual disk.

Then look for lock files:

ls -lah .*.lck
ls -lah *.lck

Check the object lock:

vmfsfilelockinfo -p .########-####-####-####-########31f0.lck

Broadcom’s vSAN guidance shows vmfsfilelockinfo -p .<uuid>.lck returning the host owning the lock and lock mode, or showing that the file is not locked and is free.

For a broader check in the VM directory, use a quoted loop so VM file names with spaces do not break the command:

for file in *; do
  echo "== ${file} =="
  vmfsfilelockinfo -p "${file}" 2>/dev/null | grep -iE "locked|mode|owner|free"
done

For hidden vSAN lock files:

for file in .*lck; do
  echo "== ${file} =="
  vmfsfilelockinfo -p "${file}" 2>/dev/null | grep -iE "locked|mode|owner|free"
done

Broadcom provides similar vSAN loops for checking all VM files and hidden .lck files.

Stage 6: Correlate the Lock Owner to a Host

Once you have a host, MAC address, or world name, map it back to the real ESXi owner before taking action.

If vmfsfilelockinfo returns the host directly, record it.

If you only have the MAC address, map VMkernel adapters across the cluster. This PowerCLI snippet is read-only and is useful when you need to correlate a lock-owner MAC address to a host without clicking through every host in the UI.

# Map ESXi VMkernel MAC addresses to hosts.
# Change the vCenter name before running.
# This does not modify the environment.

Connect-VIServer vcsa01.example.com

Get-VMHost | Sort-Object Name | ForEach-Object {
    $vmhost = $_

    Get-VMHostNetworkAdapter -VMHost $vmhost -VMKernel |
        Select-Object `
            @{Name = 'VMHost'; Expression = { $vmhost.Name } },
            Name,
            Mac,
            IP,
            PortGroupName
} | Sort-Object Mac | Format-Table -AutoSize

You are looking for the MAC address reported by the failed task or vmfsfilelockinfo.

If the MAC does not appear, verify that:

  • You are connected to the correct vCenter.
  • The host is still in inventory.
  • The VM is not on a datastore shared outside the expected cluster.
  • The management VMkernel adapter was not recently changed.
  • The host has not been reinstalled or replaced.
  • You are not looking at a backup proxy or another VM that has hot-added the disk.

Stage 7: Decide Whether the Lock Is Expected

Before remediation, classify the lock.

FindingInterpretationNext Action
VM is powered on on the owning hostLikely expected lockDo not kill; validate VM state and inventory
Backup proxy has the disk hot-addedExpected during backup, stale if backup failedRemove disk from proxy safely; do not delete from disk
Snapshot consolidation is runningTask may legitimately hold a lockValidate task progress before interruption
VM is powered off, no backup, no other VM referenceLock may be staleContinue to process/task investigation
Lock owner is unknown and datastore has APD/PDL signsStorage issue may be primaryFix storage path before VM remediation
Multi-writer lock is presentMay be intentionalValidate cluster/app design before touching it

For vSAN specifically, Broadcom recommends checking backup proxy servers and removing the affected disk from the proxy if it is still attached, making sure Delete from disk is not selected.

That same principle applies operationally outside vSAN as well: when a backup proxy has hot-added a production disk, the fix is not to delete the disk. The fix is to detach it safely from the proxy and then retry consolidation or power-on.

Stage 8: Find the Process, Cartel ID, or Task Holding the Lock

On the ESXi host that owns the lock, use lsof and esxcli to connect the file lock to a process.

For a known locked file:

lsof | egrep 'Cartel|<locked-file-name>'

Example:

lsof | egrep 'Cartel|App01-000003-delta.vmdk'

KB 314365 shows this pattern and explains that the output can reveal a VMX Cartel ID for the VM holding the file lock.

Then list active VM processes:

esxcli vm process list

Compare the VMX Cartel ID from lsof to the output of esxcli vm process list. KB 314365 documents using esxcli vm process list to map the Cartel ID back to the VM display name and configuration file.

If you suspect a running task rather than a normal VMX owner, inspect VM tasks:

vim-cmd vmsvc/getallvms | grep -i <vm-name>
vim-cmd vmsvc/get.tasklist <vmid>
vim-cmd vimsvc/task_info <task-id>

Broadcom’s locked virtual disk KB uses this flow to identify running VM tasks such as snapshot removal and inspect task state, progress, start time, and cancelability.

If no obvious process is returned, search for another VM that has the VMDK referenced in its VMX file. Keep this scoped; do not run broad datastore searches during a storage incident unless you understand the load.

cd /vmfs/volumes/<datastore>

find . -name "*.vmx" -print0 | \
  xargs -0 grep -H "<disk-name-or-vmdk-fragment>"

This is read-only, but it can still be expensive on large datastores.

Stage 9: Validate Datastore Access Before Remediation

A file lock can be a symptom of a deeper storage issue. Before restarting agents, killing processes, or moving lock files, validate that the datastore is healthy from the relevant hosts.

Check for APD or PDL indicators if hosts are disconnected, datastores are inaccessible, migrations are stuck, or VMs are invalid. Broadcom’s APD/PDL KB lists symptoms such as unavailable datastores, all paths marked dead, hosts becoming unresponsive, stuck migrations, and VMs becoming inaccessible.

On the affected host:

cd /var/run/log

grep -i "esx.problem.storage.apd.start" vobd.log
grep -i "esx.clear.storage.apd.exit" vobd.log
grep -i "permanently inaccessible\|perm loss\|APD\|PDL" vmkernel.log

Broadcom documents checking vobd.log for esx.problem.storage.apd.start and esx.clear.storage.apd.exit events when validating whether a host has LUNs or datastores in APD state.

Also check whether the datastore is accessible from all expected hosts:

esxcli storage filesystem list
esxcli storage core path list | less

If the datastore is in APD or PDL, do not treat the VM lock as the root problem. Broadcom states that APD must be resolved at the storage array or fabric layer to restore connectivity, and that affected hosts may require reboots to remove residual references after the condition is resolved.

Stage 10: Remediate Using the Least Destructive Option

Remediation should follow the evidence.

If the Lock Belongs to a Running VM

Do not kill the process just because you found it.

Confirm whether the VM is actually running on the owning host. If vCenter is wrong but the ESXi host shows the VM running, update your incident assessment. You may be dealing with inventory inconsistency, host management-agent issues, or a stale vCenter view.

If the Lock Belongs to a Backup Proxy

Coordinate with the backup owner.

Check whether the disk is still hot-added to a proxy. Remove the disk from the proxy without deleting it from the datastore. In Broadcom’s vSAN guidance, the instruction is explicit: if the affected disk is still attached to the proxy, remove it while ensuring Delete from disk is not selected.

If locks persist after host-side cleanup, Broadcom notes that restarting hostd may not be enough when a backup proxy VM is actively retaining locks; the backup service may need to be restarted on the proxy VM, or the proxy VM may need to be rebooted.

If the Lock Appears to Be a Management-Agent Task

Restart only the specific management agents required, and only after checking the platform context.

/etc/init.d/hostd restart
/etc/init.d/vpxa restart

Broadcom documents restarting hostd and vpxa individually from ESXi Shell or SSH.

In HA-enabled clusters, be deliberate. Broadcom’s locked virtual disk KB warns to deactivate HA Host Monitoring before restarting management agents to prevent unwanted VM failover.

If the Lock Is an NFS .lck File and the VM Is Powered Off

Move the stale lock file into a backup folder.

cd /vmfs/volumes/<nfs-datastore>/<vm-folder>

mkdir bkup
mv .lck-#### bkup/
ls -lah bkup

Only do this when the VM is powered off. KB 314365 cautions to only remove .lck files for a powered-off VM, and it scopes .lck file movement to NFS.

If a VMX Process Must Be Killed

This is the last-resort path, not the normal path.

Broadcom’s locked virtual disk KB describes esxcli vm process kill -t force -w <World ID> as a hard shutdown and also notes that, if the guest is responsive, you should try shutting down from inside the guest OS instead.

Use this decision gate before killing a process:

QuestionRequired Answer
Have you identified the correct host?Yes
Have you identified the correct VM World ID?Yes
Is the VM already unresponsive or confirmed safe to hard power off?Yes
Has the application owner accepted outage/corruption risk?Yes
Have you ruled out backup proxy hot-add ownership?Yes
Have you ruled out active consolidation or storage APD/PDL?Yes
Do you have rollback or recovery steps documented?As much as possible

Command pattern:

esxcli vm process list
esxcli vm process kill -t force -w <World-ID>

There is no clean rollback for a forced VMX kill. Treat it like pulling power from a physical server.

Validation Steps

After remediation, validate in layers.

Validate Lock State

Re-run the relevant lock command.

For VMFS:

vmfsfilelockinfo -p /vmfs/volumes/<datastore>/<vm-folder>/<locked-file>

For vSAN:

vmfsfilelockinfo -p .<object-uuid>.lck

For NFS:

ls -lha

If an NFS .lck file immediately reappears, something still owns the file.

Validate VM Process State

On the expected owner host:

esxcli vm process list | grep -i <vm-name> -B5 -A10
lsof | grep -i <vm-name>

There should not be duplicate VMX processes across hosts. There should not be a backup proxy holding a production disk unexpectedly.

Validate Datastore Health

Check host visibility and APD/PDL indicators again:

esxcli storage filesystem list
grep -i "apd\|pdl\|permanently inaccessible\|perm loss" /var/run/log/vmkernel.log
grep -i "esx.problem.storage.apd.start\|esx.clear.storage.apd.exit" /var/run/log/vobd.log

If storage is still unstable, stop VM-level remediation and escalate the storage incident.

Validate VM Power-On or Consolidation

Retry the original operation once:

  • Power on the VM, if the VM was down.
  • Consolidate snapshots, if the issue was consolidation.
  • Re-run the failed backup cleanup only after the proxy state is clean.
  • Watch the task in vCenter.
  • Tail vmware.log if needed.
tail -f /vmfs/volumes/<datastore>/<vm-folder>/vmware.log

Validate Guest and Application Health

For production workloads, infrastructure validation is not enough.

Confirm:

  • Guest OS booted cleanly.
  • VMware Tools status is normal.
  • Application services started.
  • Filesystem or database checks are complete where appropriate.
  • Monitoring has cleared.
  • Backup jobs are rescheduled or restarted safely.
  • Snapshot/consolidation warnings are gone.

Rollback and Fallback Guidance

Some actions have rollback. Some do not.

ActionRollback OptionNotes
Moving NFS .lck file to bkupMove it back if neededOnly for powered-off VM and only after ownership validation
Restarting hostd / vpxaNo true rollback; wait or restart again if neededAvoid broad service restart in vSAN, NSX, LACP, or shared graphics environments
Removing disk from backup proxyReattach only through approved backup/proxy workflowNever choose Delete from disk for a production VMDK
Killing VMX processNo rollbackHard shutdown; use only after approval
Rebooting ESXi hostNo rollback; evacuation or outage requiredMay be required for residual APD/lock conditions
Retrying consolidationCancel only if safe and supportedSnapshot chain risk increases if interrupted blindly

Open a support case when:

  • Lock ownership is inconsistent across tools.
  • vSAN object lock output does not match inventory reality.
  • VMFS lock metadata appears abnormal.
  • APD/PDL is present or recently occurred.
  • Snapshot chains are unclear.
  • Consolidation repeatedly fails after lock cleanup.
  • The VM is business-critical and the next step is a forced process kill.

KB 314365 includes opening a Broadcom support request as the next step if the problem persists after completing the investigation and remediation steps.

Command Reference

TaskCommand
Check VMFS file ownervmfsfilelockinfo -p /vmfs/volumes/<ds>/<vm>/<file> -v <vcenter> -u <user>
List active VM processesesxcli vm process list
Map file to VMX Cartel IDlsof | egrep 'Cartel|<locked-file>'
Find VM task listvim-cmd vmsvc/get.tasklist <vmid>
Inspect task detailsvim-cmd vimsvc/task_info <task-id>
Inspect NFS lock filehexdump -C .lck-####
Check vSAN lock filevmfsfilelockinfo -p .<object-uuid>.lck
Restart host agent only/etc/init.d/hostd restart
Restart vCenter agent only/etc/init.d/vpxa restart
Last-resort VMX killesxcli vm process kill -t force -w <World-ID>
Check APD eventsgrep -i "esx.problem.storage.apd.start" /var/run/log/vobd.log
Check APD recovery eventsgrep -i "esx.clear.storage.apd.exit" /var/run/log/vobd.log

Conclusion

A VM file lock incident is not just a command-line problem. It is an ownership problem.

The safest recovery path is to slow down long enough to answer four questions:

  1. Which file is locked?
  2. Which host, process, VM, task, or proxy owns the lock?
  3. Is that ownership expected?
  4. Is the datastore healthy enough for remediation?

Once you know those answers, the fix is usually straightforward: stop the conflicting backup workflow, detach a hot-added disk, restart a specific management agent, move a stale NFS lock file, reboot an owning host, or, only as a last resort, terminate a VMX process.

The mistake is doing those steps in reverse.

In a VCF or enterprise vSphere environment, the better runbook is not “kill the process and retry.” It is “prove the owner, validate the datastore, remediate the least destructive layer, then prove the VM and application are healthy.”

That is how you recover the workload without making the corruption story worse.

External Sources

Leave a Reply

Discover more from Digital Thought Disruption

Subscribe now to keep reading and get access to the full archive.

Continue reading