All credit for this post belongs to one of my colleagues called Imran Mughal, a very talented vSphere engineer!
We recently came across an issue during some site migration work on our PSC’s. The scenario which drove our PSC site migration work was the fact that we noticed all of our vCenter Servers were trying to authenticate users via a single load balanced pair of PSCs regardless of physical location instead of the PSCs in the datacentre local to the vCenter. This was due to us only using a single site name that covered 4 separate datacentre locations. All of our PSCs regardless of which physical location formed part of this single site.
As we already had live service running on out vCenters we decided to re-direct vCenters to another PSCs and rebuild in pairs (as there is currently no way to manually change the site name or create new sites over an existing deployment).
This was done using a combination of these two KB articles:
The initial repoint of the vCenters was successful and no issues were seen as we didn’t need to move and service registrations. Only after we had re-installed pairs of PSCs in their new sites and did the second vCenter repoint did we start to see issues (this was a site repoint also). This repoint was completed using the updated cmsso_util from VMware (listed in the KB 2131191). This version of the cmsso_util has an extra command line option of “move-services” which doesn’t exist out of the box. The services in question are service registrations from the vCenter install…a bit like what you see in the managed object browser extension manager when you connect to the VC (https://vc-name/mob/).
When you connect to the PSC via a LDAP browser like Jxplorer (unsupported by VMware!!!), you can see a similar set of registrations in the site->service registrations. William Lam’s blog on Jxplorer and connecting to the PSC here was very useful.
So during a vCenter site migration using cmsso_util tool we found that some registrations failed to migrate and rather than still leftover on our old site they had actually gone.
The vCenters were repointed and services migrated using these two commands.
"%VMWARE_PYTHON_BIN%" cmsso-util repoint --repoint-psc FQDN_of_PSC_New_Site "%VMWARE_PYTHON_BIN%" cmsso-util move-services
It was the second command which failed and we were left with the error “unable to move services across sites”
At first we didn’t actually see any problems on the vCenter that was moved but then we noticed a couple of issues with VMotion and storage policies (PBM) as below:
From this we found that the Profile-Driven storage service was in an unknown state but running:
This is where we raised a case with VMware GSS whilst we internally investigated.
VMware found the relevant logs:
2016-08-23T14:40:32.212+01:00 [WrapperSimpleAppMain] ERROR opId= com.vmware.vim.storage.common.kv.KvClientManager - Failed to lookup KV store from Component Manager. java.lang.NullPointerException at com.vmware.vim.storage.common.util.ComponentManagerService.getServiceEndpointData(ComponentManagerService.java:455) at com.vmware.vim.storage.common.util.ComponentManagerService.lookupKvStore(ComponentManagerService.java:320)
Which basically translated into “the KV Store endpoint is missing”
After thorough searching of the PSC via Jxplorer and comparisons with other sites we found the missing service registration.
Site 4 (The site with the issue!)
VMware agreed that this was the issue and worked on a fix with their Engineering team. We did some testing on a test vCenter stack and found we could recreate the entries manually by exporting branches of the LDAP tree and then modifying and re-importing but that was completely unsupported by VMware and hadn’t ever been tested. Testing for this method could have taken weeks and we didn’t have that long!
Thankfully VMware came back with a solution from engineering that used current tools that come with vCenter 6.0. The only problem was that this depended on some temp files being leftover in my %temp% directory when I ran the move-services command on the VC.
Luckily we had them (they are always named cmsso_svspec_…….). I’m not sure how they make up the last few characters of the file!
These files can be used to manually register the service on the Platform Services Controller. (Snapshot everything beforehand… seriously!)
To manually register the service, you must logon to the OS of the Platform Service Controller then copy the cmsso_svcspec file/files to folder C:\Program Files\VMware\vCenter Server\VMware Identity Services\lstool\scripts
You can identify which file is used for each service by searching for “serviceId” within the file (open in notepad) for example in the file cmsso_svcspec_3v2mps we find the entry:
So to register this service we run the following command from the folder C:\Program Files\VMware\vCenter Server\VMware Identity Services\lstool\scripts (On the VC):
"%VMWARE_PYTHON_BIN%" lstool.py register --url https://localhost/lookupservice/sdk --user email@example.com --password "password" --spec cmsso_svcspec_3v2mps --id 88d941cd-ae85-46f2-aeaa-9c30d2897137_kv --no-check-cert
–id is the serviceId entry in the cmsso_svcspec file.
After this, we must restart the vCenter (or all vCenter services.) (PSC restart not required.)
In the LDAP directory via Jxplorer on the PSC you should now see the missing registration.
Once the VC has been restarted we can check the health via the GUI again:
VMotion was working again and we could check the storage policies as normal!!