2026-04-23·14 min read

Expired Certificates in VMware vSphere with Tanzu (TKG/VKS): Diagnosis and Remediation

vmwaretanzutkgkubernetesk8scertificatesinfrastructuretroubleshootingalta-disponibilidadvsphere

1. Executive Summary

This note details the diagnosis, root cause, and remediation of expired SSL/TLS certificates in a VMware vSphere with Tanzu (VKS/TKG) infrastructure. The problem affects three independent certificate layers: the Supervisor (Spherelets), kapp-controller (packaging API), and Workload Clusters (TKC). Each layer has its own certificate lifecycle and requires a different renewal procedure.

2. Problem Description

What are these certificates?

In a VMware vSphere with Tanzu environment, multiple TLS certificate layers are automatically generated during deployment with a default validity of 1 year. Kubernetes uses these certificates for mutual authentication (mTLS) between all its components.

Why do they expire?

VMware/Broadcom follows the Kubernetes standard where certificates have a 1-year validity by default. There is no universal auto-renewal mechanism across all layers. Some certificates auto-renew (like the Supervisor API server certificates), but others require manual intervention or specific tooling.

When does this occur?

The problem occurs when the infrastructure goes more than 12 months without version upgrades or maintenance interventions. Version upgrades automatically renew certificates, so environments that are kept up to date typically do not experience this issue.

Affected certificate layers

Layer	Components	Default Validity	Auto-renewal	Renewal Tool
Layer 1: Supervisor	Spherelets (ESXi agents), Control Plane VMs, etcd, API server	1 year	Partial (API server yes, Spherelets NO)	`certmgr certificates rotate`
Layer 2: Supervisor Services	kapp-controller, packaging-api (APIService), cert-manager	1 year	Only on pod restart	Restart kapp-controller deployment
Layer 3: Workload Clusters (TKC)	API server, etcd, kubelet, front-proxy, controller-manager, scheduler	1 year	NO	`certmgr tkc certificates rotate` or `kubeadm certs renew all`

3. Symptoms and How to Identify Them

3.1 Layer 1 — Supervisor Spherelets (ESXi Agents)

Symptoms visible in vCenter UI:

Workload Management → Supervisor: Kubernetes Status: Error (3)
Node Health: Unhealthy
Config Status may show Configuring or Running (doesn't always reflect the problem) Symptoms in kubectl (from the Supervisor):

kubectl get nodes -o wide

Expected output when there is a problem:

NAME                            STATUS     ROLES                  AGE
42252dea...                     Ready      control-plane,master   536d    # CP VM 1 - OK
42259c72...                     Ready      control-plane,master   536d    # CP VM 2 - OK
4225e502...                     Ready      control-plane,master   536d    # CP VM 3 - OK
esxi-node-01.domain.local       NotReady   agent                  536d    # ESXi Agent - PROBLEM
esxi-node-02.domain.local       NotReady   agent                  536d    # ESXi Agent - PROBLEM

ESXi agent nodes appear as NotReady with the message:

Kubelet stopped posting node status.

And leases are not found:

Failed to get lease: leases.coordination.k8s.io "hostname" not found

Diagnostic command from vCenter SSH:

# Check certificate expiration date on the Spherelet
openssl s_client -connect <ESXi_IP>:10250 </dev/null 2>/dev/null | openssl x509 -noout -dates

If notAfter is a date in the past, the certificate is expired.

Important note: The TLS certificate responding on port 10250 may be a different self-signed certificate from the Spherelet's internal communication certificate. It's possible that after rotation with certmgr, port 10250 still shows the old certificate, but the Spherelet is already communicating correctly with the Supervisor. The correct validation is to verify that the node transitions to Ready and that lastHeartbeatTime is recent.

Connectivity diagnostic command from vCenter SSH:

# If it gives a certificate error, the Spherelet has an expired certificate
curl -k https://<ESXi_IP>:10250/healthz 2>&1

Problematic response:

OpenSSL: error:0A000412:SSL routines::sslv3 alert bad certificate

3.2 Layer 2 — kapp-controller / packaging-api

Symptoms in kubectl:

kubectl api-resources 2>&1 | grep error

Expected output when there is a problem:

error: unable to retrieve the complete list of server APIs: data.packaging.carvel.dev/v1alpha1: the server is currently unable to handle the request

Cascade effect:

tanzu-framework app in Reconcile failed state with thousands of consecutiveReconcileFailures
TKC clusters show Ready: False with reason ClusterBootstrapNotCompleted
TKG plugin in vCenter shows Error 502 Bad Gateway
WCP service logs in vCenter show continuous warnings against data.packaging.carvel.dev Diagnostic command — verify APIService caBundle:

kubectl get apiservice v1alpha1.data.packaging.carvel.dev -o jsonpath='{.spec.caBundle}' | base64 -d | openssl x509 -noout -dates -subject

If notAfter is in the past, the kapp-controller certificate is expired.

Check pod status:

kubectl get pods -n vmware-system-appplatform-operator-system

The kapp-controller may show hundreds of accumulated RESTARTS.

Verify the specific tanzu-framework error:

kubectl get app tanzu-framework -n vmware-system-pkgs -o jsonpath='{.status.usefulErrorMessage}'
kubectl get app tanzu-framework -n vmware-system-pkgs -o json | jq '.status.consecutiveReconcileFailures'

3.3 Layer 3 — Workload Clusters (TKC)

Symptoms reported by users:

x509: certificate has expired or is not yet valid: current time 2026-04-17T09:12:30-06:00 is after 2025-10-27T07:47:49Z

Characteristic behavior: INTERMITTENT

A TKC cluster typically has 3 Control Plane nodes. If only one has expired certificates, behavior is intermittent — sometimes it works (when the load balancer routes to a healthy node) and sometimes it fails (when it routes to the node with expired certificates). This confuses users because it appears to be a random problem.

Effects:

kubectl get, kubectl describe intermittently fail with x509 error
PersistentVolumeClaims remain in Pending state (the CSI provisioner cannot communicate with the API server)
cert-manager cannot renew application certificates because it depends on the API server
kubectl-vsphere login sessions expire quickly Diagnostic command from vCenter SSH (requires Supervisor KUBECONFIG):

# Configure access to the Supervisor
/usr/lib/vmware-wcp/decryptK8Pwd.py
scp root@<SUPERVISOR_CP_IP>:/etc/kubernetes/admin.conf /root/supervisor-kubeconfig
export KUBECONFIG=/root/supervisor-kubeconfig
 
# List certificates for a specific TKC cluster
./certmgr tkc certificates list <CLUSTER_NAME> -n <NAMESPACE>

The output shows a table with each certificate, its expiration date, and whether it's expired (ISEXPIRED: true).

Direct verification of each cluster node's API server:

# Verify TLS certificate of each CP node's API server
openssl s_client -connect <NODE_IP>:6443 </dev/null 2>/dev/null | openssl x509 -noout -dates -subject

4. Root Cause

Typical production scenario

This problem typically occurs in infrastructure that was deployed in a given year and has not received version upgrades since. Kubernetes certificates have a 1-year default validity, so they begin expiring 12 months after the initial deployment or last upgrade.

Certificate expiration timeline

Date	Expired Certificate	Effect
Month 12	ESXi Spherelets (Layer 1)	ESXi agents go NotReady, Supervisor shows Error
Month 12–13	Control Plane node of a workload cluster (Layer 3)	kubectl commands intermittently fail
Month 13–14	Kubelet of another CP node (Layer 3)	Worsens cluster intermittency
Month 15–16	kapp-controller / packaging-api (Layer 2)	Packaging API goes down, tanzu-framework fails, ClusterBootstrapNotCompleted

Why don't they auto-renew?

Spherelets: VMware does not implement auto-renewal for Spherelet certificates. Manual intervention with certmgr is required.
kapp-controller: The certificate is regenerated only on pod restart, but the pod does not restart automatically due to its own certificate expiration.
TKC Clusters: kubeadm does not auto-renew control plane certificates. They are only renewed during cluster version upgrades.

5. Remediation Procedure

5.1 Prerequisites

certmgr tool:

Download from Broadcom KB 322994: "Replace vSphere with Tanzu Supervisor Certificates"
URL: https://knowledge.broadcom.com/external/article/322994
File: wcp_cert_manager.zip
Extract to /root/ on the vCenter Required access:
SSH to vCenter as root
SSH to Supervisor Control Plane VM (for kubectl commands and access to TKC nodes)
Supervisor password: /usr/lib/vmware-wcp/decryptK8Pwd.py Mandatory backup before any rotation:

mkdir -p /tmp/cert-backup-$(date +%Y%m%d)
./certmgr certificates backup --backup-dir /tmp/cert-backup-$(date +%Y%m%d)

Note: The backup directory must exist before running the command. If it doesn't exist, the tool fails with "no such file or directory" even though the Supervisor VM backup completes successfully.

5.2 Layer 1 Remediation — Spherelets

From vCenter SSH:

cd /root/
./certmgr certificates rotate --spherelet-only

What it does: Rotates the communication certificates between the Spherelets (Kubernetes agents on ESXi hosts) and the Supervisor. Does not touch Supervisor Control Plane certificates.

Duration: ~20 seconds for 2 hosts.

Verification:

# From the Supervisor
kubectl get nodes -o wide
# ESXi agents should transition to Ready
 
# Verify recent heartbeat
kubectl get node <ESXi_HOSTNAME> -o json | jq '.status.conditions[] | select(.type=="Ready")'
# lastHeartbeatTime should be recent (within the last few minutes)

If a host doesn't transition to Ready after 15 minutes:

Check NTP synchronization (Broadcom KB 387476)
Try the manual script from KB 305320 using the host and cluster MOIDs

5.3 Layer 2 Remediation — kapp-controller

From the Supervisor:

# Restart kapp-controller (auto-regenerates certificates)
kubectl rollout restart deployment kapp-controller -n vmware-system-appplatform-operator-system
kubectl rollout status deployment kapp-controller -n vmware-system-appplatform-operator-system --timeout=5m

What it does: On restart, kapp-controller generates a new self-signed certificate pair and automatically updates the caBundle of the v1alpha1.data.packaging.carvel.dev APIService.

Duration: ~2–3 minutes.

Verification:

# Verify new caBundle
kubectl get apiservice v1alpha1.data.packaging.carvel.dev -o jsonpath='{.spec.caBundle}' | base64 -d | openssl x509 -noout -dates -subject
# notAfter should be ~1 year in the future
 
# Verify the API responds
kubectl api-resources 2>&1 | grep -iE "packaging|error"
# The "unable to handle the request" error should not appear
 
# Verify the pod is stable
kubectl get pods -n vmware-system-appplatform-operator-system
# RESTARTS should be 0, AGE recent

Post-remediation — Force tanzu-framework reconciliation:

After renewing the kapp-controller certificates, the tanzu-framework app may remain in a "stale" state with the previous error cached. To force a new reconciliation:

kubectl patch app tanzu-framework -n vmware-system-pkgs --type merge -p '{"spec":{"paused":true}}'
sleep 5
kubectl patch app tanzu-framework -n vmware-system-pkgs --type merge -p '{"spec":{"paused":false}}'

Verification:

kubectl get app tanzu-framework -n vmware-system-pkgs
# Should show "Reconcile succeeded"
 
kubectl get app tanzu-framework -n vmware-system-pkgs -o json | jq '.status.consecutiveReconcileFailures'
# Should be null
 
# TKC clusters should transition to Ready: True
kubectl get tkc -A

5.4 Layer 3 Remediation — Workload Clusters (TKC)

Two methods exist: the automatic certmgr method and the manual kubeadm method.

Method A — certmgr (recommended when etcd is healthy)

From vCenter SSH with Supervisor KUBECONFIG:

export KUBECONFIG=/root/supervisor-kubeconfig
./certmgr tkc certificates rotate <CLUSTER_NAME> -n <NAMESPACE>

Known limitation: certmgr verifies etcd cluster health before rotating. If one of the Control Plane nodes has expired etcd certificates, the etcd cluster appears "unhealthy" and certmgr enters a retry loop without being able to complete the rotation. In this case, use Method B.

Method B — Manual kubeadm (when certmgr fails due to unhealthy etcd)

This method is used when certmgr cannot complete the rotation because an etcd member has expired certificates, creating a deadlock: it cannot rotate because etcd is not healthy, and etcd is not healthy because the certificates are expired.

Step 1: Get TKC cluster SSH credentials

# From the Supervisor
kubectl get secret -n <NAMESPACE> <CLUSTER_NAME>-ssh-password -o jsonpath='{.data.ssh-passwordkey}' | base64 -d; echo

The user is always vmware-system-user.

Step 2: Connect to the problematic node

# From the Supervisor
ssh vmware-system-user@<NODE_IP>
# Use the password obtained in step 1

Step 3: Renew certificates and restart components

# Renew all kubeadm certificates
sudo kubeadm certs renew all
 
# Restart kubelet
sudo systemctl restart kubelet
 
# Restart static pods (API server, etcd, controller-manager, scheduler)
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-controller-manager.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
 
sleep 10
 
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests/
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/
 
sleep 30
 
# Verify static pods started
sudo crictl ps | grep -E "kube-apiserver|etcd|kube-controller|kube-scheduler"

Step 4: Verify renewal

sudo kubeadm certs check-expiration
# All certificates should show ~364 days of RESIDUAL TIME

Step 5: Repeat for each CP node with expired certificates

Not all Control Plane nodes necessarily have the same problem. Use certmgr tkc certificates list to identify which nodes have ISEXPIRED: true and apply Method B only to those.

6. Post-Remediation Validation

6.1 Layer 1 Validation (Supervisor)

# All nodes should be Ready (3 CP + N ESXi agents)
kubectl get nodes -o wide
 
# In vCenter UI: Kubernetes Status = Running, Node Health = Healthy

6.2 Layer 2 Validation (kapp-controller)

# Packaging API responds without errors
kubectl api-resources 2>&1 | grep error
# (should return nothing)
 
# All apps reconcile
kubectl get app -A
# All should show "Reconcile succeeded"
 
# kapp-controller with no restarts
kubectl get pods -n vmware-system-appplatform-operator-system
# RESTARTS = 0

6.3 Layer 3 Validation (TKC Clusters)

# Clusters should be Ready
kubectl get tkc -A
# READY = True for all
 
# Verify certificates for each node
./certmgr tkc certificates list <CLUSTER_NAME> -n <NAMESPACE>
# ISEXPIRED = false for all
 
# Verify from the user side
kubectl-vsphere login --insecure-skip-tls-verify --server=<SUPERVISOR_VIP> --vsphere-username <USER> --tanzu-kubernetes-cluster-name <CLUSTER_NAME>
kubectl get nodes
kubectl get pvc -A
# No x509 error should appear

6.4 Dependent Services Validation

# Verify that cert-manager can renew application certificates
kubectl get certificates -A
# Application certificates should transition to Ready: True
# This may take a few minutes/hours after remediation
 
# Verify PVCs can be provisioned
kubectl get pvc -A
# Pending PVCs should start provisioning

7. Prevention

Proactive certificate monitoring

It is recommended to implement a cronjob or monitoring script that periodically checks certificate expiration across all 3 layers:

Layer 1 — Spherelets (run from vCenter monthly):

for HOST_IP in <ESXi_IP_1> <ESXi_IP_2>; do
  echo "=== $HOST_IP ==="
  openssl s_client -connect $HOST_IP:10250 </dev/null 2>/dev/null | openssl x509 -noout -dates
done

Layer 2 — kapp-controller (run from Supervisor monthly):

kubectl get apiservice v1alpha1.data.packaging.carvel.dev -o jsonpath='{.spec.caBundle}' | base64 -d | openssl x509 -noout -enddate

Layer 3 — TKC Clusters (run from vCenter with KUBECONFIG monthly):

export KUBECONFIG=/root/supervisor-kubeconfig
for CLUSTER in tkc-cluster-dev tkc-cluster-qa tkc-cluster-prd; do
  echo "=== $CLUSTER ==="
  ./certmgr tkc certificates list $CLUSTER -n <NAMESPACE> 2>&1 | grep -E "ISEXPIRED|true"
done

Upgrade policy

The best prevention is keeping the infrastructure updated. Version upgrades automatically renew certificates. An N-1 policy (latest version minus one) guarantees stability without falling into obsolescence.

8. Official Broadcom/VMware References

KB	Title	Use
322994	Replace vSphere with Tanzu Supervisor Certificates	Main KB. Contains the certmgr tool and complete procedure for Supervisor and TKC
305320	vSphere Supervisor Manual Spherelet Certificate Renewal	Manual Spherelet renewal script (fallback when certmgr fails)
387476	ESXi nodes become NotReady after rotating Supervisor Certificates	NTP troubleshooting. Applies when hosts don't transition to Ready after rotation
319377	certmgr reports "Process exited with status 1"	Troubleshooting when certmgr fails during execution
323421	Master vSphere Supervisor Certificate Guide	Master guide of all Supervisor certificates and their lifecycles
389860	vSphere Supervisor Certificates - authproxy-client, pinniped, vip, wcp_node_bootstrapper	Certificates NOT renewed by certmgr (pinniped, authproxy, etc.)

9. Key Takeaways

Each layer's certificates are independent. Renewing Spherelets (Layer 1) does not renew kapp-controller (Layer 2) or TKC clusters (Layer 3). Each layer requires its own procedure.
certmgr has a known etcd limitation. When an etcd member has expired certificates, certmgr cannot complete rotation because it verifies the entire cluster's health before acting. The solution is to manually renew with kubeadm certs renew all on the problematic node first.
Intermittent behavior is the key signal. When a TKC cluster has only 1 of 3 CP nodes with expired certificates, kubectl commands fail intermittently. This is because the cluster's load balancer distributes connections among the 3 nodes — sometimes hitting the healthy node (works), sometimes hitting the expired one (fails).
The APIService caBundle only updates when kapp-controller restarts. There is no auto-renewal for the packaging-api certificate. A simple kubectl rollout restart resolves the issue.
The certmgr backup directory must exist beforehand. The tool does not create it automatically. Always use mkdir -p before running the backup.
vCenter snapshots with memory cause prolonged stun. On VMs with 24+ GB RAM, taking a memory snapshot can leave the vCenter UI inaccessible for several minutes. Always use snapshot without memory with Quiesce filesystem enabled.
The Spherelet port 10250 TLS certificate may still show the old date after rotation. This is a different self-signed certificate from the internal communication one. Rotation was successful if the node transitions to Ready — the port 10250 certificate is cosmetic.
"Unknown field" warnings during kubeadm certs renew are normal. Messages like unknown field "dns.type" or unknown field "udpIdleTimeout" are minor schema incompatibilities between versions and do not affect the renewal.

← Back to all notes