Deep Review: 20260521-205512-pr-361

Date2026-05-21 20:55
Reporancher-sandbox/rancher-desktop-daemon
Round2
Author@jandubois
PR#361 — kubernetes-reconciler: add V(1) tracing and merge-failure handling
Commits8543b4f Document KubernetesReady reasons in api_app.md
6062ce1 Set KubernetesReady=False when kubeconfig merge fails
805b48e Add V(1) tracing and log silent probe errors in kubernetes-reconciler
f0ff640 Extract WatchEventLogger to shared predicates package
ReviewersClaude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
VerdictMerge with optional fixes — round 1 blockers are addressed (docs, MergeFailed test, message wording); remaining items are doc-drift polish, additional Reconcile-path test coverage, and a daemon-side autostart change that persists -v into args.json.
Wall-clock time37 min 45 s


Executive Summary

Round 2. The PR closes the round-1 Suggestions on docs, MergeFailed coverage, and message wording, and adds two new commits: a daemon-side wiring that propagates klog verbosity through autostart, and the merge-failure condition itself. Reviewers found no Critical or Important regressions. The residual items are stale Go-side Settled docstring/prose that lag the new KubernetesReady gate in api_app.md, three uncovered Reconcile branches (NotApplicable, NotRunning, Probing), and an autostart change that bakes -v <N> into the persisted args.json.


Critical Issues

None.


Important Issues

None.


Suggestions

S1. Settled docstring and prose still omit KubernetesReady Codex
	// rancher-desktop-{instance} context into ~/.kube/config. It is only
	// meaningful when spec.kubernetes.enabled is true; when Kubernetes is
	// disabled the condition is absent or False with reason NotApplicable.
	AppConditionKubernetesReady = "KubernetesReady"

	// AppConditionSettled reports whether the reconcile chain has
	// fully caught up with the current spec: observed generations on
	// the feeding conditions match the App's generation, and the VM
	// and engine have reached a stable state for the desired config.
	// A spec change forces Settled to False; once the chain quiesces,
	// the App reconciler flips it back to True. `rdd set` waits on
	// this condition.
	AppConditionSettled = "Settled"
)

The PR adds WaitingForKubernetes and KubernetesStale reasons to the conditions table and extends the (forwarded) row to mention KubernetesReady (api_app.md:117-128), but three prose surfaces still describe Settled as if KubernetesReady were not a gating condition: the AppConditionSettled docstring above; api_app.md:100 lists only "App controller" and "engine controller" as condition writers; api_app.md:132 says Settled is "computed from the mirrored Running and the engine-written ContainerEngineReady." A reader who hits these first will not realise KubernetesReady blocks Settled too.


- **spec.kubernetes.version**: The Kubernetes version to use (e.g. `"1.30.2"`). Defaults to `"1.30.2"`. Propagated to the `KUBERNETES_VERSION` Lima template param.

- **status.kubernetesPort**: The host TCP port allocated for the k3s API server (`7441 + instance.Index()` by default). Set by the App reconciler on the first reconcile after `spec.kubernetes.enabled` becomes `true`, and cleared when `spec.kubernetes.enabled` is set back to `false` so that a fresh port is resolved on the next enable. The `KUBERNETES_PORT` Lima template param is set to this value; Lima's identity port-forward rule binds the same port on the host and forwards it to the guest.

- **status.conditions**: Multiple controllers write here. The App controller mirrors `Created` and `Running` from the owned `LimaVM`, computes `Settled`, and the engine controller writes `ContainerEngineReady`. All writers use `retry.RetryOnConflict` with a re-Get so concurrent status updates do not 409.

  | Type                   | Status    | Reason           | Description                                                       |
  |------------------------|-----------|------------------|-------------------------------------------------------------------|
  | `Created`              | `Unknown` | `Pending`        | LimaVM controller has started reconciliation                      |
  | `Created`              | `True`    | `Created`        | Lima instance created on disk and ready                           |
  | `Created`              | `False`   | `CreateFailed`   | Lima instance creation failed (see `message` for details)         |
  | `Running`              | `Unknown` | `Reconciling`    | Verifying instance state (e.g. after controller restart)          |
  | `Running`              | `True`    | `Started`        | Lima instance is running                                          |
  | `Running`              | `False`   | `Stopped`        | Lima instance is stopped                                          |
  | `Running`              | `False`   | `Starting`       | Lima instance is starting up                                      |
  | `Running`              | `False`   | `StartFailed`    | Lima instance failed to start                                     |
  | `Running`              | `False`   | `StopFailed`     | Lima instance failed to stop cleanly                              |
  | `ContainerEngineReady` | `True`    | `Connected`      | Engine controller has connected to Docker and completed full sync |
  | `ContainerEngineReady` | `True`    | `NotApplicable`  | Mirroring is not implemented for the current backend (e.g. `containerd`); forced `True` so `rdd set` can finish waiting |
  | `ContainerEngineReady` | `False`   | `ConnectFailed`  | Engine controller failed to connect to Docker                     |
  | `ContainerEngineReady` | `False`   | `Stopped`        | The VM is stopped; the engine watcher is not running              |
  | `KubernetesReady`      | `True`    | `Ready`          | k3s API server is reachable; instance context merged into `~/.kube/config` |
  | `KubernetesReady`      | `False`   | `NotApplicable`  | `spec.kubernetes.enabled` is false                                |
  | `KubernetesReady`      | `False`   | `NotRunning`     | VM is not running, so k3s cannot be healthy                       |
  | `KubernetesReady`      | `False`   | `Probing`        | Waiting for k3s API server to respond                             |
  | `KubernetesReady`      | `False`   | `MergeFailed`    | k3s is reachable but merging the instance kubeconfig failed (see `message`) |
  | `Settled`              | `True`    | `Settled`        | Reconcile chain has caught up with the current spec               |
  | `Settled`              | `False`   | `WaitingForLimaVM` | The App has no `Running` condition yet (nothing mirrored from LimaVM) |
  | `Settled`              | `False`   | `WaitingForEngine` | Engine controller has not yet written `ContainerEngineReady`    |
  | `Settled`              | `False`   | `EngineStale`    | Engine controller has not yet observed the current generation     |
  | `Settled`              | `False`   | `WaitingForKubernetes` | Kubernetes controller has not yet written `KubernetesReady`  |
  | `Settled`              | `False`   | `KubernetesStale` | Kubernetes controller has not yet observed the current generation |
  | `Settled`              | `False`   | *(forwarded)*    | Forwarded from the blocking `Running`, `ContainerEngineReady`, or `KubernetesReady` reason |

  `Running=True` means the Lima guest has finished booting and SSH is reachable. It says nothing about the container engine socket; consumers that depend on the engine (container/image/volume mirrors, `docker` against the forwarded socket) must also check `ContainerEngineReady`, which flips to `True` only after the engine controller has connected to the socket and completed its initial full sync.

  The `Created` and `Running` conditions are mirrored from LimaVM, so their `lastTransitionTime` reflects the LimaVM transition rather than the copy — the timestamp is meaningful for staleness checks. The engine reconciler stamps `ContainerEngineReady.observedGeneration` with the App's `metadata.generation` at the time it writes the condition; if the App's generation advances between the reconciler's read and write, the stamp reflects the write-time generation rather than the observed one. The App reconciler computes `Settled` from the mirrored `Running` and the engine-written `ContainerEngineReady`, and stamps its own `observedGeneration` with the `metadata.generation` observed when it computed the condition, not the generation at write time. The retry-on-conflict loop re-reads on 409, so a successful write always carries the current generation. `rdd set` waits for `Settled.status=True` with `observedGeneration >= post-patch metadata.generation`, so stale snapshots cannot prematurely satisfy the wait.

Deleting the `App` resource triggers the finalizer to stop and delete the owned LimaVM (and wait for the LimaVM controller to complete its own cleanup before removing the App finalizer).

## GUI

Fix: extend the docstring at app_types.go:48-49 to mention Kubernetes when enabled, add the Kubernetes controller to the writers list at api_app.md:100, and update the Settled derivation paragraph at api_app.md:132.

S2. ensureServiceRunning bakes -v <N> into the persisted args.json on first autostart Claude
		RunE:  serviceConfigAction,
	}
	return command
}

func ensureServiceRunning(ctx context.Context) error {
	// Pass klog verbosity the same way `rdd svc create` / `rdd svc start` do,
	// so V(1) tracing fires when autostart launches the daemon (e.g. from
	// `rdd set`).
	klogArgs := []string{"-v", logrusLevelToKlog()}
	if !service.Exists() {
		if err := service.Create(klogArgs); err != nil {
			return err
		}
	}
	if service.Running() {
		ctx, cancel := context.WithTimeout(ctx, 90*time.Second)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return err
		}
		return waitForDiscoveryConfigMap(ctx)
	}
	return startAndWaitForReady(ctx, klogArgs)
}

// startAndWaitForReady starts the service, waits for the API server and the
// discovery ConfigMap to be ready. After an unclean shutdown the ConfigMap may
// survive with stale data, so the freshness check waits for a ConfigMap whose
// creationTimestamp is at or after the current startup.

service.Create persists its arg slice to instance.ArgsFile(). With the new klogArgs, the first rdd set (or any other autostart trigger) on a fresh checkout now writes a verbosity tag derived from the calling shell's logrus level into the on-disk args. serviceStartAction already appends -v <N> to each start invocation, so the persisted value never takes effect — but it does drift from what an explicit rdd svc create would have stored, and anyone inspecting args.json will see a verbosity that depends on which command happened to autostart the daemon first.

Fix: keep service.Create(nil) for the autostart-create branch and pass klogArgs only on the transient startAndWaitForReady path.

-    klogArgs := []string{"-v", logrusLevelToKlog()}
     if !service.Exists() {
-        if err := service.Create(klogArgs); err != nil {
+        if err := service.Create(nil); err != nil {
             return err
         }
     }
     ...
-    return startAndWaitForReady(ctx, klogArgs)
+    return startAndWaitForReady(ctx, []string{"-v", logrusLevelToKlog()})
S3. Four of six Reconcile branches remain uncovered Claude

The two new tests exercise the Ready and MergeFailed terminals. The other four Reconcile branches have no Reconcile-level coverage: NotApplicable (kube_reconciler.go:82-89), NotRunning (kube_reconciler.go:92-99), Probing from a probe error (kube_reconciler.go:103-114), and Probing from a non-200 healthz (kube_reconciler.go:115-125). The NotApplicable and NotRunning paths also call removeKubeContext, which has no Reconcile-level test at all — declined Round 1's Testing Gap #3 covered only probeCurrentKubeContext internals, not the dispatch surface.

		"generation", app.Generation,
		"resourceVersion", app.ResourceVersion,
	)

	// When Kubernetes is disabled, stamp the condition and clean up.
	if !app.Spec.Kubernetes.Enabled {
		r.removeKubeContext(ctx)
		return ctrl.Result{}, r.setKubeCondition(ctx, &app,
			metav1.ConditionFalse,
			appv1alpha1.AppKubernetesReasonNotApplicable,
			"Kubernetes is not enabled",
		)
	}

	// Kubernetes is enabled but the VM is not (yet) running.
	if !running {
		r.removeKubeContext(ctx)
		return ctrl.Result{}, r.setKubeCondition(ctx, &app,
			metav1.ConditionFalse,
			appv1alpha1.AppKubernetesReasonNotRunning,
			"VM is not running",
		)
	}

	// Probe the k3s API server from the instance kubeconfig.
	healthy, err := r.probeK3sAPI(ctx)
	if err != nil {
		// kubeconfig missing or unreadable — k3s has not started yet.
		log.V(1).Info("Cannot probe k3s API server", "err", err)
		if condErr := r.setKubeCondition(ctx, &app,
			metav1.ConditionFalse,
			appv1alpha1.AppKubernetesReasonProbing,
			"Waiting for k3s API server",
		); condErr != nil {
			return ctrl.Result{}, condErr
		}
		return ctrl.Result{RequeueAfter: kubeProbeRequeue}, nil
	}
	if !healthy {
		log.V(1).Info("k3s API server not yet healthy")
		if condErr := r.setKubeCondition(ctx, &app,
			metav1.ConditionFalse,
			appv1alpha1.AppKubernetesReasonProbing,
			"Waiting for k3s API server",
		); condErr != nil {
			return ctrl.Result{}, condErr
		}
		return ctrl.Result{RequeueAfter: kubeProbeRequeue}, nil
	}

	// k3s is healthy — merge the context into ~/.kube/config.
	if err := r.manageKubeContext(ctx); err != nil {
		if condErr := r.setKubeCondition(ctx, &app,
			metav1.ConditionFalse,

Fix: add Test_Reconcile_KubernetesReady_NotApplicable, _NotRunning, and two _Probing cases (probe error + non-200) using the existing newAppRunning / newKubeScheme scaffolding.

S4. Tests do not isolate HOME Claude
}

func Test_Reconcile_KubernetesReady_Ready(t *testing.T) {
	dir := t.TempDir()
	srcPath := fakeK3sServer(t, filepath.Join(dir, "k3s.yaml"))
	t.Setenv("KUBECONFIG", filepath.Join(dir, ".kube", "config"))

	scheme := newKubeScheme(t)
	app := newAppRunning()
	c := fake.NewClientBuilder().
		WithScheme(scheme).

Both tests set KUBECONFIG but not HOME. kubeConfigPath() (kube_context.go:19-28) reads KUBECONFIG first and only falls back to $HOME/.kube/config, so the tests pass today — but a future refactor that adds an "ignore empty KUBECONFIG" path or threads kubeconfig resolution through clientcmd.NewDefaultClientConfigLoadingRules would silently start operating on the developer's real ~/.kube/config. Defense-in-depth costs one line per test.

	clientcmdapi "k8s.io/client-go/tools/clientcmd/api"
)

// kubeConfigPath returns ~/.kube/config, or $KUBECONFIG if set.
// Computed dynamically so t.Setenv("HOME", …) works in tests.
func kubeConfigPath() (string, error) {
	if kc := os.Getenv("KUBECONFIG"); kc != "" {
		return kc, nil
	}
	home, err := os.UserHomeDir()
	if err != nil {
		return "", err
	}
	return filepath.Join(home, ".kube", "config"), nil
}

// loadKubeConfig loads the kubeconfig at path; returns an empty config if absent.
func loadKubeConfig(path string) (*clientcmdapi.Config, error) {
	cfg, err := clientcmd.LoadFromFile(path)
	if os.IsNotExist(err) {
+    t.Setenv("HOME", dir)
     t.Setenv("KUBECONFIG", filepath.Join(dir, ".kube", "config"))
S5. Float-based duration comparison breaks parallel with the sibling test Claude
	t.Cleanup(func() { r.removeKubeContext(context.Background()) })

	req := ctrl.Request{NamespacedName: client.ObjectKey{Name: appName}}
	result, err := r.Reconcile(context.Background(), req)
	assert.NilError(t, err)
	assert.Equal(t, result.RequeueAfter.Seconds(), float64(0),
		"Ready path should not request a requeue")

	got := &appv1alpha1.App{}
	assert.NilError(t, c.Get(context.Background(), client.ObjectKey{Name: appName}, got))

	cond := findKubeReady(got)

Test_Reconcile_KubernetesReady_MergeFailed (line 175) compares the duration directly: assert.Equal(t, result.RequeueAfter, kubeProbeRequeue, ...). The Ready test takes the .Seconds() detour against a float64(0) literal, which is harder to read and invites float comparisons on a future tick.

	}

	req := ctrl.Request{NamespacedName: client.ObjectKey{Name: appName}}
	result, err := r.Reconcile(context.Background(), req)
	assert.NilError(t, err)
	assert.Equal(t, result.RequeueAfter, kubeProbeRequeue,
		"MergeFailed path should requeue after kubeProbeRequeue")

	got := &appv1alpha1.App{}
	assert.NilError(t, c.Get(context.Background(), client.ObjectKey{Name: appName}, got))
-    assert.Equal(t, result.RequeueAfter.Seconds(), float64(0),
-        "Ready path should not request a requeue")
+    assert.Equal(t, result.RequeueAfter, time.Duration(0),
+        "Ready path should not request a requeue")
S6. predicates package docstring frames a one-member package as a collection Claude
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: SUSE LLC
// SPDX-FileCopyrightText: The Rancher Desktop Authors

// Package predicates provides controller-runtime predicates for the App
// controllers.
package predicates

import (
	"sigs.k8s.io/controller-runtime/pkg/event"
	logf "sigs.k8s.io/controller-runtime/pkg/log"
	"sigs.k8s.io/controller-runtime/pkg/predicate"

The package contains exactly one predicate today. If a second member is in flight, the docstring is fine; if not, keeping the helper next to one of its callers and re-extracting on the second consumer keeps the package count honest. No fix required — call the question while the file is fresh.


Design Observations

Concerns

Strengths


Testing Assessment

Unit tests for the changed packages pass (go test ./pkg/controllers/app/...). go test ./cmd/rdd could not compile under the agent worktree because pkg/guestagent/guestagent.go embeds lima-guestagent.gz, which is generated by make and absent from a fresh checkout — not a PR-introduced regression.

Coverage gaps, ranked:

  1. Probe-error and probe-unhealthy Reconcile paths are uncovered — production startup hits both before Ready. (S3)
  2. VM-not-running and Kubernetes-disabled paths are uncovered — both invoke removeKubeContext, which still has no Reconcile-level test. (S3)
  3. App reconciler Settled forwarding of MergeFailed has no dedicated test — the generic Kubernetes-false propagation reduces the risk but does not pin the new reason. Codex
  4. predicates.WatchEventLogger has no test — declined Round 1; the in-production usage is the de-facto smoke.

Documentation Assessment

docs/design/api_app.md gains the KubernetesReady reason rows and WaitingForKubernetes/KubernetesStale forwarded reasons. Internal docstrings on the new K3sConfigPath field, manageKubeContext's sync/async split, and probeCurrentKubeContext's "return true on internal error" contract are accurate and load-bearing. The remaining drift sits in the AppConditionSettled Go docstring and the surrounding prose in api_app.md — see S1.


Commit Structure

Four clean commits, one concept each: f0ff640 extract, 805b48e tracing, 6062ce1 condition propagation, 8543b4f docs. Each is independently revertable and each message matches its diff. No fold-in candidates.


Acknowledged Limitations

Declined in prior rounds


Agent Performance Retro

Claude

Claude produced the largest set this round — six suggestions, one Design concern, and three Strengths — and the only finding rooted in the autostart wiring (S2). That finding required tracing into pkg/service/cmd to verify that service.Create's slice arg lands in args.json, which the diff itself does not reveal. Claude also caught the standalone cmd/app-controller binary missing the kubernetes and engine imports while reading the new controller wiring, surfacing a pre-existing gap as a Design Observation. One Claude finding — "manageKubeContext logs the merge error twice" — restated the noise concern at the same code surface as Round 1's declined S5; moved to Declined in prior rounds. No false positives.

Codex

Codex's single finding (S1) was the one piece any agent caught that ties directly to the Round 1 fix's narrower-than-original scope: the conditions table was extended for KubernetesReady/WaitingForKubernetes/KubernetesStale, but the surrounding AppConditionSettled docstring and two api_app.md prose paragraphs still describe Settled as if KubernetesReady were not part of the gate. Codex was the only agent to read across both Go docstrings and design prose to spot the drift. No false positives.

Gemini

Gemini's two Important findings and one Suggestion all named real bugs in probeCurrentKubeContext (incomplete auth coverage), removeKubeContext (swallowed cleanup errors), and manageKubeContext (unconditional disk I/O). Verified at the merge-base SHA e8033bd: every cited construction predates this PR, so all three moved to Design Observations as pre-existing concerns. Severity calibration ran one notch high on the same I-vs-S axis as Round 1. Gemini also hit a 429 quota error mid-run but produced a complete structured review before that point.

Summary

Claude Opus 4.7Codex GPT 5.5Gemini 3.1 Pro
Duration9m 51s6m 20s6m 46s
Findings5S1Snone
Tool calls52 (Bash 32, Read 20)47 (shell 47)
Design observations422
False positives000
Unique insights514
Files reviewed888
Coverage misses000
Totals5S1Snone
Downgraded003 (I→S)
Dropped100

Reconciliation: Gemini I1 (probeCurrentKubeContext incomplete auth) → Design Observation concern; Gemini I2 (removeKubeContext silent cleanup) → Design Observation concern; Gemini S1 (manageKubeContext unconditional I/O) → Design Observation concern — all three demoted after merge-base verification confirmed the cited code predates this PR. Claude's "manageKubeContext logs the merge error twice" suggestion → dropped under Round 1's declined S5 (log file noise declined as acceptable).

The value distribution this round: Claude on test/coverage polish and the autostart wiring, Codex on the doc fix's residual scope, Gemini on pre-existing design concerns the diff exposed. No agent dominated; the three slices remained complementary.


Review Process Notes

Skill improvements

Repo context updates