Got 2 devices stuck in unknown state. Tried to re-discover them, but there was no change in their status. Tried to delete them and got the following error message:
Restarted the APIC-EM server, but still getting the same error messages. The discovery doesn't show any errors. It's a single server setup.
Take the instance logs of apic-em-inventory-manager-service
and share apic-em-inventory-manager-service.log* logs.(assuming it is 2.0.X build)
If it is earlier build (1.4.X) , please share apic-em-inventory-manager-service.log, apic-inventory.log, xde.log, existenceInventory.log, inventory.log
Follow the steps mentioned below for taking the logs:
Log in as grapevine.
execute: cd bin
execute: grape instance status
This will display all of the services. Look for the apic-em-inventory-manager-service and note the ip address.
execute: ssh <ip of inventory service>
execute: tar czvf logs4tac.tgz /var/log/grapevine/services/apic-em-inventory-manager-service/188.8.131.529/
NOTE: the version at the end of the path differs depending on what you are running
execute: scp <ip of inventory service>:logs4tac.tgz .
Now the file can be retrieved with filezilla or such.
sftp://<ip of inventory service>:logs4tac.tgz.
Took a backup of APIC-EM and a snapshot in VMWare, delete all of the devices . clear the logs and then performed discovery, resync and delete operations on those remaining devices that were stuck.
There are some exceptions seen with respect to RawCliInfo table entries and found that got defect open for this issue of devices being stuck at 'Unknown' state.
To check the delete device issue, there is no relevant logs available. Enabling DEBUG logs and sharing the same set of logs would help to debug the issue with delete.
Command to enable DEBUG logs:
sudo /opt/CSCOlumos/bin/setLogLevel.sh inventory DEBUG
sudo /opt/CSCOlumos/bin/setLogLevel.sh disocvery DEBUG
This has to be run on the VM where apic-em-inventory-manager-service is running.
Looking at the logs, the rootcause for both the issues that Device goes to 'Unknown' state and unable to delete device is same. There is a open defect for this issue.
For the bug related to these issues, work is still in progress.
The workaround for the issues is to cleanup the corrupted entries from DB manually.
ended up doing a reset grapevine to get the entries cleared and that worked.
Got 50 sites and 300+ network devices, so resetting grapevine and configuring everything again is not something that is preferable..