Hyper-V 2012 R2 2-Node Cluster Completely Fails When One Node Shut Down


we have 2 node 2012 r2 cluster running on dell vrtx chassis m620 blades.  cluster storage 10 tb shared chassis storage , 21 tb iscsi synology nas.  cluster, live migration, management , iscsi on subnets.  we have 30 vms on cluster.  the quorum , chassis storage csvs owned node #1.  the iscsi csv owned node #2.

cluster functional , live migration works fine as  long both nodes running.

here’s problem have discovered: need shutdown node #1 dell recommended troubleshooting (1 port  on dual port intel 10 gig pci card receiving not sending packets, that’s story.)

  • when tried drain roles node #1, error “move of cluster role 'cluster group' drain not completed.  the operation failed error code 0x138d.”
  • i attempted move csv disks node #1 node #1 , fails error “clustered storage not connected node”  seems clue problem, not sure why i’m getting error.
  • so go ahead , manually live migrate roles node #2 without problem.
  • i shut down node #1.
  • as node #1 gets shut down, quorum disk , 2 other csv disks (which happen owned node #1) go offline.  this shouldn’t happen!
  • since cluster can’t talk quorum disk, whole cluster goes down and, since 2 out of 3 csvs not available node #2, many of vms go down.
  • when node #1 comes up, i’m able reconnect cluster, , quorum disk online, csv disks still offline.
  • in failover cluster manager, have “resume” node #1 option “fail roles” (even though had no roles)
  • then able online csv disks , cluster “normal.”

so seems node 2 having problems talking quorum 2 out of 3 csv disks when node 1 missing.  definitely not redundant!

when built cluster year , half ago, validated , working flawlessly.  the problems seemed begin after long lasting blue screen issue on node #1 traced bad fan on 1 of 10 gig nics.  i suspect networking issue, when run cluster validation, issue pops connection issue our iscsi drive (because have bad port on 1 of our nics, working dell on now.)  the iscsi csv owned node #2 , doesn’t go offline when node #1 rebooted.

can offer insight?

thanks!


george moore

the cluster validation network test not test is simple ping between nodes.  has been mentioned, there going on storage , in redirected mode on csv.  run powershell command get-clustersharedvolumestate on 1 of nodes , tell csv drives connectivity nodes

there 2 things need at.  first stateinfo.  if says direct, machine has direct access , good.  however, if says filesystemredirected or blockredirected on node, has no connectivity it.  reason, @ blockredirectioreason parameter.  if says nodiskconnectivity, not seeing disks @ all. 

you can go system event log , see if having errors relating storage or iscsi.  seeing here, need contact storage vendor assistance.


thanks, john marlin microsoft server beta team



Windows Server  >  High Availability (Clustering)



Comments

Popular posts from this blog

CRL Revocation always failed

Failed to query the results of bpa xpath

0x300000d errors in Microsoft Remote Desktop client