vRA 8.0 – Fresh installation found unresponsive, troubleshooting, and resolution

A few months ago we spun up a fresh install of vRA 8.0 along side our 7.6 environment. Over that time the service was lightly used as we explored and researched our migration path (coming in a future release).

When I tried to login to the web interface this week, I found it to be completely unresponsive. I did the basic user troubleshooting (ping etc..) and the server was up. Next step was to ssh into the appliance and start poking around. Being a fairly new install and vastly different that 7.6 this was the first time really popping the hood.

The main command for interacting with vRA is: vracli
Using this, I ran vracli status

This is quite the string of text but in there we received: [ERROR] Exception while getting the DB nodes. Since this appliance is not part of a cluster, all services reside on this host. My first through was to reboot it. I thought we might get lucky and let this thing self-heal itself but unfortunately this issue persisted through reboots.

Next up is to find an admin guide and see what else I can do to get status of services if the status command is not working. I was able to find the following Administering vRealize Automation 8.0 document. In this document it notes to properly shutdown, reboot, and start the vRA appliance you need to run some commands. I certainly didn’t do that before, so I tried it.

Before you shutdown or reboot you should run the following commands:
/opt/scripts/svc-stop.sh sleep 120 /opt/scripts/deploy.sh -–onlyClean

Once the reboot has completed, run the following commands to start vRA services:
/opt/scripts/deploy.sh kubectl get pods --all-namespaces

What I noticed after running the shutdown command is nothing would happen, other than a lot of waiting. The deploy.sh command would time out at 5 minutes as specified by a wait 300 seconds flag in the script. After that, it would terminate. I rebooted, and ran the startup commands. Again deploy.sh times out after 5 minutes. I ran the kubectl command and found a few interesting nuggets. First there all most all pods were in a Pending state, and there were a few in an Evicted state.

This is a snippet and does not show status of all listed pods.

To backup just a bit, it should be noted that vRA runs it’s services in containers and leverages Kubernetes under the hood. vRA services run under the Prelude namespace. When we check the status of pods, it’s getting the state of the various vRA services. Now seeing a few pods in the Evicted state is a tell. There are a few reasons why a pod can get evicted, most commonly it’s because it’s exceeded a threshold for resources or it’s been manually evicted. I won’t pretend to be an expert on Kubernetes, but resources could be a valid theory.

At this point I opened a ticket with VMware. I created a log bundle using the vracli log-bundlecommand fired it off. Support noted that the log bundle was incomplete, but found a note in the environment log file. You can find this log by untaring the archive and opening the environment file. There is several references in the log that state:
Reason: Evicted
Message: Pod The node was low on resource: [DiskPressure]

Alright, we are on the right trail, the pod was low on resources and for that reason it was evicted. Lets take a look at the disk usage on the appliance using df -h

Right away we see the root disk is 11gb in total size and is 93% used. Seems like this could be our issue. If all of our containers are trying to reserve or ensure a specific amount of free space, we are going to run into contention issues. VMware is aware of this issue, and has defined a process to expand the disks before applying the 8.0.1 hotfix.

Lets take a look at expanding the disks

Expanding the root partition
1. Log into vcenter and find your vra8 appliance
2. Make sure there are no active snapshots, otherwise you will not be able to extend the disks.
  Note: Confirm the snapshots are not needed before deleting!
3. Identify the disk(s) that are associated with the full partition(s).
  / (root) is associated with disk 1 (vRealize Automation VM Disk 1 (/dev/sda4))
  /data is associated with disk 2. (vRealize Automation VM Disk 2 (/dev/mapper/data_vg-data))
  
  This can be confirmed by running df -h from the vra command line and then compare the sizes to the disks shown in vcenter.
  
  VMware recommends between 20-48 gigs free on these partitions. Expand the disks appropriately.
4. Once disk space has been added in vcenter, ssh into the appliance and run the following command: vracli disk-mgr resize
  This will automatically resize the disks. Once you see the disk size reflected in the df -h command, you should be good to go.
5. Proceed with the service shutdown/startup procedure mentioned above.
6. Confirm service has been restored
If the services has been restored and pods there are duplicate pods in an evicted state you can delete these pods using the following command:
1. Kubectl delete mod –n prelude <pod name>
2. Example: Kubectl delete mod –n prelude symphony-logging
  Note: You can also delete the evicted pods before expanding the disks.

Reference: VMware 8.0.1 Release Notes

Post reboot we can run vracli status

We don’t have anything to compare it to since this command would fail, but at least its running now. Also it shows DiskPressure status as False. We can also take a look at the pod status to make sure all pods are now healthy

And that look good as well. All there is to do is confirm we can reach the vRA web interface, and we can!

Wrap-up:
I’d like to thank you for following along. I hope this article will help others resolve this issue. I’ll continue to post issues and solutions as stumble across them. Feel free to comment if I missed anything.