Event 18510 upon VM import, VMMS system call to update VHD ACL fails


we experiencing intermittent issue appears triggered race condition during vm import.  first, background. have 2 servers set in failover cluster sharing cluster shared volume (csv), importing vms.  each of these vms references unique vhd (xyz.vhd).  for example, may have 12 or 20 vms each reference own, unique vhd.  each of these unique vhds reference same, common parent (call a.vhd).  so there 12 or 20 diff disks point same parent.  each server assumes responsibility half of vms.

when triggering import of 12 or 20 vms simultaneously, observe following in our vmms error logs:

 

event id: 18250

level: error

user: system

opcode: info

 

import completed warnings. please check admin events in hyper-v-vmms event log more information.

-----------------------

event id: 18510

level: error

user: system

opcode: info

 

import task failed fix acl file 'c:\clusterstorage\volume1\base-images\winxp_01-1.vhd' error 'the data

area passed system call small.' (0x8007007a)

 

-----------------------

in above error messages, "c:\clusterstorage\volume1\base-images\winxp_01-1.vhd" common parent image shared vms being imported.

triggering condition while monitoring servers processmonitor (from sysinternals) yields following observation:

 

during 1 part of import, there pattern of calls single thread appears make under normal conditions:

 

1,t1: querysecurityfile (winxp_01-1.vhd, owner|group|dacl) --> buffer_overflow

2,t1: querysecurityfile (winxp_01-1.vhd, owner|group|dacl) --> success

3,t1: querysecurityfile (winxp_01-1.vhd, owner|group|dacl) --> buffer_overflow

4,t1: querysecurityfile (winxp_01-1.vhd, owner|group|dacl) --> success

5,t1: setsecurityfile (winxp_01-1.vhd, dacl) --> success

 

t1 short "thread 1".  the first "parameter" vhd being accessed, the second type of information being queried/set. "x," prefix indicates relative sequence (time).

the buffer_overflow return codes here happen because system call fetch security information didn't provide big enough buffer system dump security stuff into.  as result, call returns error code along with the *correct* size of buffer caller should provide.  this is typical of ms api calls expect caller provide buffer variable length output data.

this can break down when there multiple threads executing import operations at same time because, apparently, required buffer size can change between calls on same thread.  here's example of failure scenario:

 

1,t2: querysecurityfile (winxp_01-1.vhd, dacl) --> buffer_overflow

2,t2: querysecurityfile (winxp_01-1.vhd, dacl) --> success

3,t3: querysecurityfile (winxp_01-1.vhd, dacl) --> buffer_overflow

4,t3: querysecurityfile (winxp_01-1.vhd, dacl) --> success

5,t3: querysecurityfile (winxp_01-1.vhd, owner|group|dacl) --> buffer_overflow

6,t3: querysecurityfile (winxp_01-1.vhd, owner|group|dacl) --> success

7,t2: querysecurityfile (winxp_01-1.vhd, owner|group|dacl) --> buffer_overflow

8,t3: setsecurityfile (winxp_01-1.vhd, dacl) --> success

9,t2: querysecurityfile (winxp_01-1.vhd, owner|group|dacl) --> buffer_overflow

^ triggers error see in event logs , causes failed import

 

notice in particular instance things humming along nicely until step 8.  setsecurityfile changes security meta-data on vhd.  the working hypothesis that, result, size required store the security information requested in step 7 *changes* unbeknownst thread 2. 

step 9 fails because provides buffer of size requested in step 7, which small accommodate changes made security meta-data step 8 (thread 3).

the event log reports other failures same error code, although this particular pair of operations not shed light on nature of those failures.  it suspected there other combinations of system calls that are caught in same race.

what follows stack trace failed call of step 9 above.

 

frame module location address path

0 fltmgr.sys fltacquirepushlockshared + 0x907 0xfffff880014ec067 c:\windows\system32\drivers\fltmgr.sys

1 fltmgr.sys fltiscallbackdatadirty + 0xa39 0xfffff880014ed329 c:\windows\system32\drivers\fltmgr.sys

2 fltmgr.sys fltmgr.sys + 0x16c7 0xfffff880014eb6c7 c:\windows\system32\drivers\fltmgr.sys

3 ntoskrnl.exe ntsetinformationfile + 0x1093 0xfffff80001bb17d3 c:\windows\system32\ntoskrnl.exe

4 ntoskrnl.exe ntquerysecurityobject + 0xe6 0xfffff80001b64736 c:\windows\system32\ntoskrnl.exe

5 ntoskrnl.exe kesynchronizeexecution + 0x3a43 0xfffff800018d0ed3 c:\windows\system32\ntoskrnl.exe

6 ntdll.dll zwquerysecurityobject + 0xa 0x777925ba c:\windows\system32\ntdll.dll

7 kernelbase.dll getkernelobjectsecurity + 0x14 0x7fefda76de4 c:\windows\system32\kernelbase.dll

8 advapi32.dll getkernelobjectsecurity + 0x13 0x7fefe9e1e53 c:\windows\system32\advapi32.dll

9 ntmarta.dll accrewritesetentriesinacl + 0x683 0x7fefc8722e3 c:\windows\system32\ntmarta.dll

10 ntmarta.dll accrewritesetnamedrights + 0xb01 0x7fefc874235 c:\windows\system32\ntmarta.dll

11 ntmarta.dll accrewritesetnamedrights + 0x14f 0x7fefc873883 c:\windows\system32\ntmarta.dll

12 advapi32.dll setnamedsecurityinfow + 0x83 0x7fefe9d8a23 c:\windows\system32\advapi32.dll

13 vmms.exe vmms.exe + 0x2e0f60 0x100140f60 c:\windows\system32\vmms.exe

14 vmms.exe vmms.exe + 0x2df862 0x10013f862 c:\windows\system32\vmms.exe

15 vmms.exe vmms.exe + 0x2df6f0 0x10013f6f0 c:\windows\system32\vmms.exe

16 vmms.exe vmms.exe + 0x2dbddf 0x10013bddf c:\windows\system32\vmms.exe

17 vmms.exe vmms.exe + 0x2cecfd 0x10012ecfd c:\windows\system32\vmms.exe

18 vmms.exe vmms.exe + 0x28abb7 0x1000eabb7 c:\windows\system32\vmms.exe

19 vmms.exe vmms.exe + 0x2b1463 0x100111463 c:\windows\system32\vmms.exe

20 ntdll.dll tppostwork + 0x154 0x77752484 c:\windows\system32\ntdll.dll

21 ntdll.dll rtlrealsuccessor + 0x136 0x77760c26 c:\windows\system32\ntdll.dll

22 kernel32.dll basethreadinitthunk + 0xd 0x7753652d c:\windows\system32\kernel32.dll

23 ntdll.dll rtluserthreadstart + 0x21 0x7776c521 c:\windows\system32\ntdll.dll

 

 

also, worth noting these errors seem happen on cluster node not *owner* of csv resource (and meta-data updates being sent on network, opposed directly disk).  the owner node, (more directly) updates meta-data volume underlying csv, has never exhibited error condition.  the cluster status appears normal , there no problems redirected i/o.

os: windows hyper v server 2008 r2; csv using iscsi w/ mpio.

i've been unable locate hotfix resolve behavior.

please assist.

-james

 

 


jdefelice



Windows Server  >  Hyper-V



Comments

Popular posts from this blog

CRL Revocation always failed

Failed to query the results of bpa xpath

0x300000d errors in Microsoft Remote Desktop client