Update: A major “All Paths Dead (ADP)” storage related issue with vSphere 4 and how to workaround it

December 29, 2009

Great writeup on the ADP bug and the workarounds from the Virtual Geek blog…

http://virtualgeek.typepad.com/virtual_geek/2009/12/an-important-vsphere-4-storage-bug-and-workaround.html

Common behavior:

  1. They want to remove a LUN from a vSphere 4 cluster
  2. They move or Storage vMotion the VMs off the datastore who is being removed (otherwise, the VMs would hard crash if you just yank out the datastore)
  3. After removing the LUN, VMs on OTHER datastores would become unavailable (not crashing, but becoming periodically unavailable on the network)
  4. the ESX logs would show a series of errors starting with “NMP”

Examples of the error messages include:

    “NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa._______________” – failed to issue command due to Not found (APD)”“NMP: nmp_DeviceUpdatePathStates: Activated path “NULL” for NMP device “naa.__________________”.

This is affecting multiple storage vendors (suggesting an ESX-side issue).  You can see the VMTN thread on this here

Here’s what’s happening, and the workaround options:

When a LUN supporting a datastore becomes unavailable, the NMP stack in vSphere 4 attempts failover paths, and if no paths are available, an APD (All Paths Dead) state is assumed for that device (starts a different path state detection routine).   If after that you do a rescan, periodically VMs on that ESX host will lose network connectivity and become non-responsive.  

This is a bug, and a known bug.   

What was commonly happening in these cases was that the customer was changing LUN masking or zoning in the array or in the fabric, removing it from all the ESX hosts before removing the datastore and the LUN in the VI client.   It is notable that this could also be triggered by anything making the LUN inaccessible to the ESX host – intentional, outage, or accidental.

Workaround 1

This workaround falls under “operational excellence”.   The sequence of operations here is important – the issue only occurs if the LUN is removed while the datastore and disk device are expected by the ESX host.   The correct sequence for removing a LUN backing a datastore.

  1. In the vSphere client, vacate the VMs from the datastore being removed (migrate or Storage vMotion)
  2. In the vSphere client, remove the Datastore
  3. In the vSphere client, remove the storage device
  4. Only then, in your array management tool remove the LUN from the host.
  5. In the vSphere client, rescan the bus.

Workaround 2 (only available in ESX/ESXi 4 u1)

This workaround is available only in update 1, and changes what the vmkernel does when it detects this APD state for a storage device, basically just immediately failing to open a datastore volume if the device’s state is APD.  Since it’s an advanced parameter change – I wouldn’t make this change unless instructed by VMware support.

esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD

Again all credit for documenting this go to the Virtual Geek blog….  http://virtualgeek.typepad.com/virtual_geek/2009/12/an-important-vsphere-4-storage-bug-and-workaround.html

—————————————————————————————————–

Update 1:

Has VMware fixed the ADP issue?  Maybe, only time will tell.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1016291


Follow

Get every new post delivered to your Inbox.