We had a scheduled maintenance window this afternoon that offlined both network switches connected to the management interfaces of our UCS 6248 fabric interconnects (version 2.2(1c)). There was no impact to the production uplinks/portchannels. When connectivity was restored a few minutes later, we logged into the UCS fabric interconnects, and observed that they were in "switchover in progress" mode:
amsp01fi01-A(local-mgmt)# show cluster extended-state
Cluster Id: 0x7c837140c47711e3-0xb265002a6a99d481
Start time: Mon Apr 28 19:45:32 2014
Last election time: Wed May 28 17:17:37 2014
A: UP, PRIMARY, (Management services: SWITCHOVER IN PROGRESS)
B: UP, SUBORDINATE, (Management services: SWITCHOVER IN PROGRESS)
A: memb state UP, lead state PRIMARY, mgmt services state: INVALID
B: memb state UP, lead state SUBORDINATE, mgmt services state: INVALID
heartbeat state PRIMARY_OK
INTERNAL NETWORK INTERFACES:
HA NOT READY
Management services: switchover in progress on local Fabric Interconnect
Detailed state of the device selected for HA storage:
Chassis 2, serial: FOX1749GX66, state: active
Chassis 10, serial: FOX1751GX3C, state: active
Chassis 12, serial: FOX1750RGLJ, state: active
Unffortunately, they've been in this mode for more than six hours now, with no apparent changes in state. I've tried to force the issue via the "cluster lead a" and "cluster primary a" commands, but it appears that the cluster command is disabled:
amsp01fi01-A# connect local-mgmt
Cisco Nexus Operating System (NX-OS) Software
TAC support: http://www.cisco.com/tac
Copyright (c) 2009, Cisco Systems, Inc. All rights reserved.
The copyrights to certain works contained in this software are
owned by other third parties and used and distributed under
license. Certain components of this software are licensed under
the GNU General Public License (GPL) version 2.0 or the GNU
Lesser General Public License (LGPL) Version 2.1. A copy of each
such license is available at
amsp01fi01-A(local-mgmt)# cluster lead a
% Invalid Command at '^' marker
Fortunately, production services are not impacted; however, we cannot login to UCSM via the web gui, nor can we make any configuration changes. I'm hoping someone can suggest a minimally disruptive resolution.
Hi. Unfortunately you seem to be hitting defect CSCuh92027 (FI Switchover stuck after pulliing out management cable from primary FI). The good news is that this issue is already resolved in the latest release, 2.2(2c).
As for a workaround... According to the Release Note Enclosure restarting the Data Management Engine (DME) in the primary Fabric Interconnect will correct the issue. That being said, this is not something to take lightly. The DME is the "brains" of the UCSM. I would not recommend restarting this during the middle of the day. We don't expect anything to happen, but I would better to be safe than sorry.
My suggestion is for you to open a case with TAC and do that with them. Ideally you'll have a backup of your UCS in case something goes really bad (not that I think that it will, I'm just being extremely cautious).
We had the same issue and we open a TAC case. All we had to do was stop the service and restart it on both UCS interconnects. Here is how we did it.
Just encountered this issue - pmon stop/start fixed and saved me form opening a TAC case
Hi George - Did the pmon stop command cause an outage or disruption?
This does not appear to disrupt production services, and takes only a few seconds to stop and start. In system scope.
It will only interupt the management plane and not the data plane.
I did ran pmon command it did not cause outage.
Just run that first on subordinate FI then on primary.
TAC Server-Virtualization lead here. I really want to add some extra info on this topic.
While at times performing a pmon stop/start on the FI is needed as a workaround to a defect, it should only be done under direction from TAC. The expected behavior is that this should only impact MGMT plane and not the data plane, the reality is that if you have to run these commands, then you're already in an unexpected and most likely an unknown situation. It is very possible that whatever has caused this situation to begin with could be made worse by running these commands. It is possible that one of the many processes does not respawn properly and you could then be facing a full outage. This is why TAC needs to first investigate what's causing this so we can avoid any potential outages.
Ultimately, do not run pmon stop/start unless TAC specifically tells you to, or that is the specific workaround to a known defect. If you think you need to run this command, call TAC first. Let us figure out the root cause of the process issue so we can fix it rather than just work around it.