A few of the core services won't stay running so localhost:14141 locally or remotely doesn't work.
router=184.108.40.20644 goes into FATAL state and some of the others just exit and backoff.
Since services are not coming up unable to evacuate any host or deal with the grape hosts, services, etc.
$ sudo service grapevine status
grapevine is running
grapevine_capacity_manager RUNNING pid 4372, uptime 0:13:25
grapevine_capacity_manager_lxc_plugin RUNNING pid 9794, uptime 0:00:31
grapevine_cassandra RUNNING pid 3799, uptime 0:13:42
grapevine_client BACKOFF Exited too quickly (process log may have details)
grapevine_coordinator_service RUNNING pid 3808, uptime 0:13:42
grapevine_dlx_service BACKOFF Exited too quickly (process log may have details)
grapevine_log_collector RUNNING pid 3811, uptime 0:13:42
grapevine_root RUNNING pid 5869, uptime 0:08:09
grapevine_ui RUNNING pid 3797, uptime 0:13:42
reverse-proxy=220.127.116.1144 RUNNING pid 3802, uptime 0:13:42
router=18.104.22.16844 FATAL Exited too quickly (process log may have details)
we have changed port 14141. It should redirect to
Common startup issue is ntp. can you verify that the ntp server is reachable.
Sure, that redirect is understood by the reverse-proxy/router, but ssh on the box it's trying to connect to localhost:14141 to run service commands. All 'grape' commands return localhost:14141 unavailable.
ntp is good
The core services can't connect to message broker which is rabbitmq right?
When you are redirected to controller Development, are you able to login. Or you get login error still. (Assuming you are not already logged into through the cluster and redirected; but a fresh login attempt by typing direct link to controller Development)
Couple of grapevine core services from your previous output look to be in BACKOFF state. Did they recover at all.
grape instance status might not be working. Is it the same case with "grape application status" command.
Worst case would suggest resetting the grapevine so ALL the services would be up and running in their own sweet time; but of course, that's the last resort.
fixed without running a reset, but here is what TAC recommended after escalation -
Please perform the following steps to bring the cluster back to a clean/running state:
1. Ensure both VMs are powered "on"
2. SSH into one of the VMs and run the following command:
3. A series of prompts would be presented to the user, asking if they want to delete specific data/configure. Since the customer wants to save their cluster data, for each prompt/question presented, specify "no".
After answering all the prompts, the command will proceed to reset the cluster back to a clean/running state with their data.
Depending on the speed of their hardware, this operation will take around 30-60 minutes to complete.
Here's the recommend specs for UCS hardware for cluster to deploy and run smoothly. Is your UCS hardware compliant with the following specs?
Requirements Specification Server Image Format ISO VMware ESXi Version 5.1/5.5/6.0 Virtual CPU (vCPU) Minimum Required: 6, Recommend: 12 CPU (speed) 2.4 GHz Memory 64 GB [ For a multi-host deployment (2 or 3 hosts) only 32GB of RAM is required for each host. ] Disk Capacity 500 GB Disk I/O Speed 200 MBps Network Adapter 1 Web Access Required Browser Google Chrome – version 50.0 or later and Firefox - version 46.0 or later Network Timing To avoid conflicting time settings, It is recommended to disable time synchronization between the guest VM running Cisco APIC-EM and the ESXi host. Please use NTP instead.
A process of shutdown, boot, evacuate, shutdown, boot, enable - for each node. The evacuate/enable seems to be the trick during the brief period the core services are working.