# Deep Review: 20260521-205512-pr-361

| | |
|---|---|
| **Date** | 2026-05-21 20:55 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 2 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#361](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/361) — kubernetes-reconciler: add V(1) tracing and merge-failure handling |
| **Commits** | `8543b4f` Document KubernetesReady reasons in api_app.md<br>`6062ce1` Set KubernetesReady=False when kubeconfig merge fails<br>`805b48e` Add V(1) tracing and log silent probe errors in kubernetes-reconciler<br>`f0ff640` Extract WatchEventLogger to shared predicates package |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with optional fixes** — round 1 blockers are addressed (docs, MergeFailed test, message wording); remaining items are doc-drift polish, additional Reconcile-path test coverage, and a daemon-side autostart change that persists `-v` into `args.json`. |
| **Wall-clock time** | `37 min 45 s` |

## Executive Summary

Round 2. The PR closes the round-1 Suggestions on docs, MergeFailed coverage, and message wording, and adds two new commits: a daemon-side wiring that propagates klog verbosity through autostart, and the merge-failure condition itself. Reviewers found no Critical or Important regressions. The residual items are stale Go-side `Settled` docstring/prose that lag the new `KubernetesReady` gate in `api_app.md`, three uncovered Reconcile branches (`NotApplicable`, `NotRunning`, `Probing`), and an autostart change that bakes `-v <N>` into the persisted `args.json`.

---

## Critical Issues

None.

## Important Issues

None.

## Suggestions

S1. **`Settled` docstring and prose still omit `KubernetesReady`** — `pkg/apis/app/v1alpha1/app_types.go:46-49`, `docs/design/api_app.md:100`, `docs/design/api_app.md:132` [Codex GPT 5.5] (suggestion, gap)

```go
// AppConditionSettled reports whether the reconcile chain has
// fully caught up with the current spec: observed generations on
// the feeding conditions match the App's generation, and the VM
// and engine have reached a stable state for the desired config.
```

The PR adds `WaitingForKubernetes` and `KubernetesStale` reasons to the conditions table and extends the *(forwarded)* row to mention `KubernetesReady` (`api_app.md:117-128`), but three prose surfaces still describe `Settled` as if `KubernetesReady` were not a gating condition: the `AppConditionSettled` docstring above; `api_app.md:100` lists only "App controller" and "engine controller" as condition writers; `api_app.md:132` says `Settled` is "computed from the mirrored `Running` and the engine-written `ContainerEngineReady`." A reader who hits these first will not realise `KubernetesReady` blocks `Settled` too.

Fix: extend the docstring at `app_types.go:48-49` to mention Kubernetes when enabled, add the Kubernetes controller to the writers list at `api_app.md:100`, and update the `Settled` derivation paragraph at `api_app.md:132`.

S2. **`ensureServiceRunning` bakes `-v <N>` into the persisted `args.json` on first autostart** — `cmd/rdd/service.go:73-92` [Claude Opus 4.7] (suggestion, regression)

```go
klogArgs := []string{"-v", logrusLevelToKlog()}
if !service.Exists() {
    if err := service.Create(klogArgs); err != nil {
        return err
    }
}
...
return startAndWaitForReady(ctx, klogArgs)
```

`service.Create` persists its arg slice to `instance.ArgsFile()`. With the new `klogArgs`, the first `rdd set` (or any other autostart trigger) on a fresh checkout now writes a verbosity tag derived from the calling shell's logrus level into the on-disk args. `serviceStartAction` already appends `-v <N>` to each start invocation, so the persisted value never takes effect — but it does drift from what an explicit `rdd svc create` would have stored, and anyone inspecting `args.json` will see a verbosity that depends on which command happened to autostart the daemon first.

Fix: keep `service.Create(nil)` for the autostart-create branch and pass `klogArgs` only on the transient `startAndWaitForReady` path.

```diff
-    klogArgs := []string{"-v", logrusLevelToKlog()}
     if !service.Exists() {
-        if err := service.Create(klogArgs); err != nil {
+        if err := service.Create(nil); err != nil {
             return err
         }
     }
     ...
-    return startAndWaitForReady(ctx, klogArgs)
+    return startAndWaitForReady(ctx, []string{"-v", logrusLevelToKlog()})
```

S3. **Four of six Reconcile branches remain uncovered** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler_test.go` [Claude Opus 4.7] (suggestion, gap)

The two new tests exercise the `Ready` and `MergeFailed` terminals. The other four `Reconcile` branches have no Reconcile-level coverage: `NotApplicable` (`kube_reconciler.go:82-89`), `NotRunning` (`kube_reconciler.go:92-99`), `Probing` from a probe error (`kube_reconciler.go:103-114`), and `Probing` from a non-200 healthz (`kube_reconciler.go:115-125`). The `NotApplicable` and `NotRunning` paths also call `removeKubeContext`, which has no Reconcile-level test at all — declined Round 1's Testing Gap #3 covered only `probeCurrentKubeContext` internals, not the dispatch surface.

Fix: add `Test_Reconcile_KubernetesReady_NotApplicable`, `_NotRunning`, and two `_Probing` cases (probe error + non-200) using the existing `newAppRunning` / `newKubeScheme` scaffolding.

S4. **Tests do not isolate `HOME`** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler_test.go:114, 157` [Claude Opus 4.7] (suggestion, gap)

```go
t.Setenv("KUBECONFIG", filepath.Join(dir, ".kube", "config"))
```

Both tests set `KUBECONFIG` but not `HOME`. `kubeConfigPath()` (`kube_context.go:19-28`) reads `KUBECONFIG` first and only falls back to `$HOME/.kube/config`, so the tests pass today — but a future refactor that adds an "ignore empty `KUBECONFIG`" path or threads kubeconfig resolution through `clientcmd.NewDefaultClientConfigLoadingRules` would silently start operating on the developer's real `~/.kube/config`. Defense-in-depth costs one line per test.

```diff
+    t.Setenv("HOME", dir)
     t.Setenv("KUBECONFIG", filepath.Join(dir, ".kube", "config"))
```

S5. **Float-based duration comparison breaks parallel with the sibling test** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler_test.go:136-137` [Claude Opus 4.7] (suggestion, gap)

```go
assert.Equal(t, result.RequeueAfter.Seconds(), float64(0),
    "Ready path should not request a requeue")
```

`Test_Reconcile_KubernetesReady_MergeFailed` (line 175) compares the duration directly: `assert.Equal(t, result.RequeueAfter, kubeProbeRequeue, ...)`. The `Ready` test takes the `.Seconds()` detour against a `float64(0)` literal, which is harder to read and invites float comparisons on a future tick.

```diff
-    assert.Equal(t, result.RequeueAfter.Seconds(), float64(0),
-        "Ready path should not request a requeue")
+    assert.Equal(t, result.RequeueAfter, time.Duration(0),
+        "Ready path should not request a requeue")
```

S6. **`predicates` package docstring frames a one-member package as a collection** — `pkg/controllers/app/predicates/predicates.go:5-7` [Claude Opus 4.7] (suggestion, enhancement)

```go
// Package predicates provides controller-runtime predicates for the App
// controllers.
package predicates
```

The package contains exactly one predicate today. If a second member is in flight, the docstring is fine; if not, keeping the helper next to one of its callers and re-extracting on the second consumer keeps the package count honest. No fix required — call the question while the file is fresh.

---

## Design Observations

### Concerns

- **`probeCurrentKubeContext` ignores `ExecProvider`, `AuthProvider`, `CAFile`, and `CertFile`** *(future)* [Gemini 3.1 Pro]. The hand-rolled HTTP client at `kube_reconciler.go:312-332` only honours inline `CAData` / `CertData` / `KeyData` and a bearer token. A developer whose `current-context` uses an EKS exec-credential plugin, a GKE auth-provider, or any file-based cert path will see the probe fail with TLS or 401, and the background goroutine at `kube_reconciler.go:259-263` interprets the failure as "current-context unhealthy" and silently overwrites it with `rancher-desktop`. Verified at `e8033bd` (merge-base) that the construction predates this PR; the new V(1) log lines make the failure mode diagnosable but do not address the auth coverage. A `rest.HTTPClientFor(restCfg)` call would pick up every kubeconfig auth path client-go supports.
- **`removeKubeContext` swallows cleanup errors** *(future)* [Gemini 3.1 Pro]. `kube_reconciler.go:355-374` logs but does not return errors from `clearCurrentKubeContext` and `deleteKubeContext`. The caller at `kube_reconciler.go:83` discards the call and stamps `KubernetesReady=False/NotApplicable`. A Windows file lock or permission error on `~/.kube/config` therefore leaks the `rancher-desktop` entry indefinitely with no retry or backoff. Pre-existing (unchanged from `e8033bd`). Aggregating the errors with `errors.Join` and returning them from `removeKubeContext` would let controller-runtime apply normal backoff.
- **`manageKubeContext` rewrites `~/.kube/config` unconditionally on every reconcile** *(future)* [Gemini 3.1 Pro]. The call at `kube_reconciler.go:128` fires on every reconcile that reaches the healthy branch (17 per the PR's own smoke test). `createReplaceKubeContext` (`kube_context.go:42-88`) re-reads, re-merges, and re-writes the file each time, then spawns a background `probeCurrentKubeContext` goroutine. Pre-existing, but it deserves a deep-equal short-circuit in `createReplaceKubeContext`: skip the write (and the goroutine) when the target cluster, user, and context already match the on-disk values.
- **`cmd/app-controller/main.go` does not blank-import `kubernetes` or `engine`** *(future)* [Claude Opus 4.7]. The standalone development binary at `cmd/app-controller/main.go:9-16` imports only `app` and `demo`, so a developer running `bin/app-controller` exercises neither the engine reconciler nor the kubernetes reconciler this PR diagnoses. Pre-existing (introduced by #356 for `kubernetes` and earlier for `engine`); flagging it here while the file is adjacent.

### Strengths

- **`K3sConfigPath` injection** [Claude Opus 4.7]. Replacing the process-global `instance.K3sConfig()` call with a `KubernetesReconciler` field threaded through `controller.go:46-49` is the minimum change that breaks the test's dependency on process state — and it enables both new tests without touching any sibling controller.
- **`probeCurrentKubeContext` error-path logging preserves the "leave current-context alone" contract** [Claude Opus 4.7]. Every previously silent fall-through now logs at Error level while still returning `true`, and the new docstring at `kube_reconciler.go:271-274` documents *why* — the kind of comment that would have shortened the #358 investigation.
- **`MergeFailed` propagates cleanly to `Settled`** [Claude Opus 4.7, Codex GPT 5.5]. The new `AppKubernetesReasonMergeFailed` is an exported constant, documented inline, surfaced in `api_app.md`, and forwarded by `computeSettledCondition` via the existing `kubeCond.Reason` path — docs, code, and the Settled forwarding stay in sync.
- **Predicate extraction prevents drift** [Claude Opus 4.7, Codex GPT 5.5, Gemini 3.1 Pro]. The shared `predicates.WatchEventLogger` lets both reconcilers adopt watch-event logging without two copies of the predicate; the move commit is a pure refactor.
- **`manageKubeContext` separates synchronous and asynchronous error handling** [Codex GPT 5.5]. Returning only the synchronous merge error keeps the readiness contract focused on what the user can act on while preserving the deliberate best-effort current-context probe.

## Testing Assessment

Unit tests for the changed packages pass (`go test ./pkg/controllers/app/...`). `go test ./cmd/rdd` could not compile under the agent worktree because `pkg/guestagent/guestagent.go` embeds `lima-guestagent.gz`, which is generated by `make` and absent from a fresh checkout — not a PR-introduced regression.

Coverage gaps, ranked:

1. **Probe-error and probe-unhealthy Reconcile paths** are uncovered — production startup hits both before `Ready`. (S3)
2. **VM-not-running and Kubernetes-disabled paths** are uncovered — both invoke `removeKubeContext`, which still has no Reconcile-level test. (S3)
3. **App reconciler `Settled` forwarding of `MergeFailed`** has no dedicated test — the generic Kubernetes-false propagation reduces the risk but does not pin the new reason. [Codex GPT 5.5]
4. **`predicates.WatchEventLogger` has no test** — declined Round 1; the in-production usage is the de-facto smoke.

## Documentation Assessment

`docs/design/api_app.md` gains the `KubernetesReady` reason rows and `WaitingForKubernetes`/`KubernetesStale` forwarded reasons. Internal docstrings on the new `K3sConfigPath` field, `manageKubeContext`'s sync/async split, and `probeCurrentKubeContext`'s "return true on internal error" contract are accurate and load-bearing. The remaining drift sits in the `AppConditionSettled` Go docstring and the surrounding prose in `api_app.md` — see S1.

## Commit Structure

Four clean commits, one concept each: `f0ff640` extract, `805b48e` tracing, `6062ce1` condition propagation, `8543b4f` docs. Each is independently revertable and each message matches its diff. No fold-in candidates.

## Acknowledged Limitations

- **Code comment** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler.go:204-208`: *"Returns an error if the synchronous merge fails; failures from the async current-context probe are logged but not returned."* The PR's behavior matches the docstring; see the first Design Concern for the residual gap.
- **Code comment** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler.go:271-274`: *"Returns true on unexpected internal errors (path lookup, kubeconfig parse, TLS setup) to leave the user's current-context alone; each such path logs the underlying error."* Implemented by this PR.

### Declined in prior rounds

- **WatchEventLogger nil-object guards** — declined Round 1: cache-driven watch source never delivers nil events; defensive code for an unreachable state contradicts YAGNI. No material change since the decline.
- **MergeFailed re-logs on every requeue with no backoff** — declined Round 1: matches the existing probe-failure pattern; log-file noise only, not user-facing terminal; `setKubeCondition` short-circuits identical reason/message. Claude's pass-1 finding "manageKubeContext logs the merge error twice" proposes a different fix (demote the in-function `log.Error` to V(1)) for the same noise concern at the same code surface (`kube_reconciler.go:213-216`); dropped here under this decline.
- **`probeCurrentKubeContext` error-path tests** — declined Round 1: integration-style mocking cost outweighs value for observability-only paths.
- **`predicates.WatchEventLogger` smoke test** — declined Round 1: pure observability with a single-line return; the in-app reconciler usage is the de-facto smoke test.

---

## Agent Performance Retro

### [Claude]

Claude produced the largest set this round — six suggestions, one Design concern, and three Strengths — and the only finding rooted in the autostart wiring (S2). That finding required tracing into `pkg/service/cmd` to verify that `service.Create`'s slice arg lands in `args.json`, which the diff itself does not reveal. Claude also caught the standalone `cmd/app-controller` binary missing the `kubernetes` and `engine` imports while reading the new controller wiring, surfacing a pre-existing gap as a Design Observation. One Claude finding — "manageKubeContext logs the merge error twice" — restated the noise concern at the same code surface as Round 1's declined S5; moved to Declined in prior rounds. No false positives.

### [Codex]

Codex's single finding (S1) was the one piece any agent caught that ties directly to the Round 1 fix's narrower-than-original scope: the conditions table was extended for `KubernetesReady`/`WaitingForKubernetes`/`KubernetesStale`, but the surrounding `AppConditionSettled` docstring and two `api_app.md` prose paragraphs still describe `Settled` as if `KubernetesReady` were not part of the gate. Codex was the only agent to read across both Go docstrings and design prose to spot the drift. No false positives.

### [Gemini]

Gemini's two Important findings and one Suggestion all named real bugs in `probeCurrentKubeContext` (incomplete auth coverage), `removeKubeContext` (swallowed cleanup errors), and `manageKubeContext` (unconditional disk I/O). Verified at the merge-base SHA `e8033bd`: every cited construction predates this PR, so all three moved to Design Observations as pre-existing concerns. Severity calibration ran one notch high on the same I-vs-S axis as Round 1. Gemini also hit a 429 quota error mid-run but produced a complete structured review before that point.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.5 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 9m 51s | 6m 20s | 6m 46s |
| Findings | 5S | 1S | none |
| Tool calls | 52 (Bash 32, Read 20) | 47 (shell 47) | — |
| Design observations | 4 | 2 | 2 |
| False positives | 0 | 0 | 0 |
| Unique insights | 5 | 1 | 4 |
| Files reviewed | 8 | 8 | 8 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **5S** | **1S** | **none** |
| Downgraded | 0 | 0 | 3 (I→S) |
| Dropped | 1 | 0 | 0 |


**Reconciliation:** Gemini I1 (probeCurrentKubeContext incomplete auth) → Design Observation concern; Gemini I2 (removeKubeContext silent cleanup) → Design Observation concern; Gemini S1 (manageKubeContext unconditional I/O) → Design Observation concern — all three demoted after merge-base verification confirmed the cited code predates this PR. Claude's "manageKubeContext logs the merge error twice" suggestion → dropped under Round 1's declined S5 (`log file noise` declined as acceptable).

The value distribution this round: Claude on test/coverage polish and the autostart wiring, Codex on the doc fix's residual scope, Gemini on pre-existing design concerns the diff exposed. No agent dominated; the three slices remained complementary.

## Review Process Notes

### Skill improvements

- None — the prior-round gates worked as written. The decline-table walk caught Claude's duplicate-logging restatement, and the merge-base verification gate caught all three Gemini pre-existing findings before they reached the consolidated Important section.

### Repo context updates

- None — `deep-review-context.md` already covers the relevant patterns (cross-controller condition propagation, kubeconfig handling, controller imports in `cmd/*-controller/main.go`). The PR's behavior matches every applicable entry.

---

## Appendix: Original Reviews

### [Claude] — Pass 1

Now I have a complete picture. Let me write the review.

---

# Deep Review — PR #361 (kubernetes-reconciler diagnostics)

**Verdict:** Approve with minor cleanups.

## Executive Summary

This three-commit follow-up to #358 promotes the `watchEventLogger` predicate into a shared `pkg/controllers/app/predicates` package, threads V(1) tracing through the kubernetes-reconciler and the autostart path, surfaces the six previously-silent error paths in `probeCurrentKubeContext`, and flips `KubernetesReady=False/MergeFailed` on a synchronous merge failure (previously the controller mis-reported `True`). The diff is small, the new tests cover the introduced behavior, and the unit suite passes. The main residual concern is the absence of `kubernetes` and `engine` from `cmd/app-controller/main.go` — pre-existing but adjacent.

## Findings

### Critical Issues

None.

### Important Issues

None.

### Suggestions

**S1. `ensureServiceRunning` always persists `-v <N>` into `args.json`** — `cmd/rdd/service.go:73-82` (suggestion, regression)

```go
klogArgs := []string{"-v", logrusLevelToKlog()}
if !service.Exists() {
    if err := service.Create(klogArgs); err != nil {
        return err
    }
}
```

`service.Create` writes its arg slice to `instance.ArgsFile()` (see `pkg/service/cmd/service.go:171-187`). With the new `klogArgs`, the autostart code path now bakes the CLI-side verbosity into `args.json` whenever it implicitly creates the control plane (e.g., the first `rdd set` after a fresh checkout). `serviceStartAction` already appends `-v <N>` to the per-start args, so the stored value is overridden on every start — but the stored value also drifts from what an explicit `rdd svc create` would have written. Subsequent `rdd svc log`, `rdd svc status`, or anyone reading `args.json` for diagnostics sees a verbosity tag that reflects a one-time autostart's CLI environment.

Fix: only pass `klogArgs` on the `startAndWaitForReady` path (where they are transient), and keep `service.Create(nil)` for the autostart-create case so the on-disk args remain "just the secure port" until the user explicitly chooses a verbosity.

```diff
 func ensureServiceRunning(ctx context.Context) error {
-    // Pass klog verbosity the same way `rdd svc create` / `rdd svc start` do,
-    // so V(1) tracing fires when autostart launches the daemon (e.g. from
-    // `rdd set`).
-    klogArgs := []string{"-v", logrusLevelToKlog()}
     if !service.Exists() {
-        if err := service.Create(klogArgs); err != nil {
+        if err := service.Create(nil); err != nil {
             return err
         }
     }
     ...
-    return startAndWaitForReady(ctx, klogArgs)
+    return startAndWaitForReady(ctx, []string{"-v", logrusLevelToKlog()})
 }
```

---

**S2. `manageKubeContext` logs the merge error twice (effectively)** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler.go:213-216` (suggestion, regression)

```go
if err := createReplaceKubeContext(contextName, r.K3sConfigPath); err != nil {
    log.Error(err, "Failed to create Kubernetes context", "context", contextName)
    return fmt.Errorf("create Kubernetes context %q: %w", contextName, err)
}
```

The function logs at `Error` (line 214) and then returns a wrapped error. The caller (line 128-137) stamps the same wrapped error string into the `KubernetesReady` condition's `Message`. With `kubeProbeRequeue = 5 * time.Second`, a persistent merge failure (e.g., `~/.kube` owned by root, or the ENOTDIR case the test exercises) produces an `Error`-level log line every five seconds for as long as the failure persists. Because there is no transition-suppression on the log, this is steady-state noise at the highest severity.

Fix: demote the in-function log to `V(1)` (or drop it entirely now that the error reaches the caller and the condition). One of the two log/condition surfaces is enough.

```diff
     if err := createReplaceKubeContext(contextName, r.K3sConfigPath); err != nil {
-        log.Error(err, "Failed to create Kubernetes context", "context", contextName)
         return fmt.Errorf("create Kubernetes context %q: %w", contextName, err)
     }
```

---

**S3. Reconcile path coverage gap in the new tests** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler_test.go` (suggestion, gap)

The two added tests cover the Ready and MergeFailed terminals, but four of the six Reconcile branches remain untested: `NotApplicable` (Kubernetes disabled), `NotRunning` (VM not running), `Probing` from a kubeconfig-unreadable error, and `Probing` from a non-200 healthz. Each of these calls `setKubeCondition` with a distinct reason, and the test file already has the scaffolding (`newAppRunning`, `newKubeScheme`, `findKubeReady`) to add them cheaply. Most importantly, the disabled and not-running paths call `removeKubeContext` — exercising them would also cover the cleanup-side state machine that has no Reconcile-level test today.

Fix: add `Test_Reconcile_KubernetesReady_NotApplicable`, `_NotRunning`, and `_Probing` (probe error + non-200) following the same pattern as the existing tests.

---

**S4. `Test_Reconcile_KubernetesReady_MergeFailed` does not isolate `HOME`** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler_test.go:148-187` (suggestion, gap)

```go
kubeDir := filepath.Join(dir, ".kube")
assert.NilError(t, os.WriteFile(kubeDir, []byte("not a directory"), 0o600))
t.Setenv("KUBECONFIG", filepath.Join(kubeDir, "config"))
```

`kubeConfigPath()` (`kube_context.go:19-28`) reads `KUBECONFIG` first and falls back to `$HOME/.kube/config`. The test sets `KUBECONFIG` but does not also `t.Setenv("HOME", dir)`. If a future refactor makes `KUBECONFIG`-precedence conditional (e.g., skip empty paths), the test would silently start operating on the developer's real `~/.kube/config`. Defense-in-depth: also set `HOME` to the tempdir. The `_Ready` test has the same exposure.

```diff
+    t.Setenv("HOME", dir)
     t.Setenv("KUBECONFIG", filepath.Join(kubeDir, "config"))
```

---

**S5. `assert.Equal(t, result.RequeueAfter.Seconds(), float64(0), ...)` is harder to read than the natural form** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler_test.go:136-137` (suggestion, gap)

```go
assert.Equal(t, result.RequeueAfter.Seconds(), float64(0),
    "Ready path should not request a requeue")
```

Comparing a `time.Duration` via `.Seconds()` against a `float64` literal works but is unusual; on a future tick where the constant changed, the float comparison would invite floating-point surprises that the duration comparison would not. The companion test on line 175 uses `assert.Equal(t, result.RequeueAfter, kubeProbeRequeue, ...)` — consistent form is preferable.

```diff
-    assert.Equal(t, result.RequeueAfter.Seconds(), float64(0),
-        "Ready path should not request a requeue")
+    assert.Equal(t, result.RequeueAfter, time.Duration(0),
+        "Ready path should not request a requeue")
```

---

**S6. Re-evaluate the package docstring once `predicates` has a single member** — `pkg/controllers/app/predicates/predicates.go:5-7` (suggestion, enhancement)

```go
// Package predicates provides controller-runtime predicates for the App
// controllers.
package predicates
```

The package contains exactly one predicate today. The docstring frames it as a collection but there is no second predicate planned in this PR. If the intent is "this is a kitchen-sink for app-controller-tree predicates," that's fine; if the only future addition is unknown, consider keeping the helper next to one of its callers and re-extracting when the second consumer arrives. No fix needed if a second predicate is in flight.

## Design Observations

**Concerns**

- **`cmd/app-controller/main.go` still does not blank-import `kubernetes` or `engine`** (future). The standalone development binary at `cmd/app-controller/main.go:13-14` only imports `app` and `demo`. Per the repo overview rule ("both `cmd/*/main.go` and the embedded service package blank-import the controller (a missing import in either is a silent drop with no startup error)"), the standalone binary loses both the engine controller and the kubernetes controller this PR diagnoses. Pre-existing (introduced by #356 for `kubernetes` and earlier for `engine`), but the same drop also means a developer running `bin/app-controller` will never exercise the V(1) tracing this PR adds. Fix the import list while the file is fresh in memory.

  ```diff
       _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/app"
       _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/demo"
  +    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine"
  +    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/kubernetes"
  ```

**Strengths**

- The `K3sConfigPath` injection on `KubernetesReconciler` is a clean refactor that replaces a process-global `instance.K3sConfig()` call with a field set at construction. It enables the new `MergeFailed` test without touching any process state and is the minimum change to break the dependency.
- `probeCurrentKubeContext` retains the "leave the user's current-context alone on internal errors" contract while surfacing each path's error to the log. The new docstring (lines 271-274) documents the intentional `return true` on errors, which is the kind of comment that would have prevented the original silent-fall-through during the #358 investigation.
- The MergeFailed reason is added as an exported constant (`AppKubernetesReasonMergeFailed`) and immediately documented in `docs/design/api_app.md:121`. The reason constants are also referenced by `computeSettledCondition` via the `kubeCond.Reason` forward (`app_controller.go:572-575`), so the docs/code/Settled-forwarding triple stays in sync.

## Testing Assessment

Unit tests for the changed packages pass (`go test ./pkg/controllers/app/...` is green). Coverage gaps, ranked:

1. **Probe-error and probe-unhealthy Reconcile paths** are uncovered — these are the most-hit branches in production startup. (S3)
2. **VM-not-running and Kubernetes-disabled Reconcile paths** are uncovered — both invoke `removeKubeContext`, which has no Reconcile-level test. (S3)
3. **`WatchEventLogger` has no test** — minor; the predicate always returns true, so the only behavioral surface is the log line. A `testing/zaptest`-style observer test would be straightforward but is not load-bearing.
4. **`Test_Reconcile_KubernetesReady_Ready` does not assert on-disk state** — relies on the merge happening as a side effect of the K8s condition flip. The companion `kube_context_test.go` covers `createReplaceKubeContext` directly, so the gap is partial.

## Documentation Assessment

`docs/design/api_app.md` is updated for `MergeFailed`, `WaitingForKubernetes`, and `KubernetesStale`. Internal docstrings on the new `K3sConfigPath` field and on `probeCurrentKubeContext` are accurate and explain the "why" rather than restating the signature. The `manageKubeContext` docstring at line 204-208 correctly distinguishes the synchronous merge (returned) from the async probe (logged).

## Commit Structure

The three commits divide cleanly: extraction (f0ff640), tracing + silent-error logging (805b48e), and merge-failure-as-condition (6062ce1). Each is independently revertable and the messages match the diffs. No fold-in candidates.

## Acknowledged Limitations

- **`probeCurrentKubeContext` returns `true` on internal errors to preserve the user's current-context.** Quoted from `kube_reconciler.go:271-274`: *"Returns true on unexpected internal errors (path lookup, kubeconfig parse, TLS setup) to leave the user's current-context alone; each such path logs the underlying error."* The author has explicitly chosen "leave the on-disk current-context alone" over "force-switch on internal error." This PR makes the limitation easier to diagnose by adding the missing log calls.
- **The async current-context probe's failures are intentionally not reflected on the `KubernetesReady` condition.** Quoted from `kube_reconciler.go:204-208`: *"Returns an error if the synchronous merge fails; failures from the async current-context probe are logged but not returned."* The condition only reflects the synchronous step; the async best-effort step stays best-effort.

## Coverage Summary

- `cmd/rdd/service.go` — **S1**
- `docs/design/api_app.md` — Reviewed, no issues
- `pkg/apis/app/v1alpha1/app_types.go` — Reviewed, no issues
- `pkg/controllers/app/app/controllers/app_controller.go` — Reviewed, no issues (mechanical predicate swap)
- `pkg/controllers/app/kubernetes/controller.go` — Reviewed, no issues
- `pkg/controllers/app/kubernetes/controllers/kube_reconciler.go` — **S2**
- `pkg/controllers/app/kubernetes/controllers/kube_reconciler_test.go` — **S3, S4, S5**
- `pkg/controllers/app/predicates/predicates.go` — **S6**

### [Codex] — Pass 1

### Executive Summary

The PR correctly surfaces kubeconfig merge failures as `KubernetesReady=False/MergeFailed` and adds shared V(1) watch/reconcile tracing. I found no correctness blockers. The only issue is documentation drift: several App status comments still describe `Settled` and status writers as if `KubernetesReady` were not part of the readiness gate.

### Findings

#### Critical Issues

None.

#### Important Issues

None.

#### Suggestions

S1. **Settled documentation still omits the KubernetesReady gate** — `pkg/apis/app/v1alpha1/app_types.go:49` (suggestion, gap)

```go
// the feeding conditions match the App's generation, and the VM
// and engine have reached a stable state for the desired config.
```

`AppConditionSettled` still says only the VM and engine must reach a stable state at line 49, while this PR documents `KubernetesReady` as a blocking gate in `docs/design/api_app.md:126`. The same drift appears in `docs/design/api_app.md:100`, which lists only the App and engine status writers, and `docs/design/api_app.md:132`, which says `Settled` is computed from `Running` and `ContainerEngineReady` only.

Fix: update the API comments and design paragraph to include Kubernetes when enabled.

```diff
- // and engine have reached a stable state for the desired config.
+ // engine, and Kubernetes when enabled have reached a stable state for the desired config.
```

### Design Observations

**Strengths** — Promoting `WatchEventLogger` avoids two controller-local predicate implementations drifting.

**Strengths** — Returning only synchronous merge errors from `manageKubeContext` keeps the readiness contract focused on publishing the context while preserving the deliberate async current-context behavior.

### Testing Assessment

1. No dedicated App reconciler test covers `Settled` forwarding the new `MergeFailed` reason. Existing generic Kubernetes false-condition coverage reduces the risk.
2. No command-side test covers the new autostart `-v` argument flow in `ensureServiceRunning`.

Verification run: `go test ./pkg/controllers/app/...` passed. `go test ./cmd/rdd ./pkg/service/cmd` could not compile `cmd/rdd` because `pkg/guestagent/guestagent.go` references missing embedded file `lima-guestagent.gz`; `pkg/service/cmd` had no test files.

### Documentation Assessment

S1 covers the remaining documentation gap. The new reason table itself documents `MergeFailed`.

### Acknowledged Limitations

- **Code comment**: `pkg/controllers/app/kubernetes/controllers/kube_reconciler.go:206` documents that async current-context probe failures are logged but not returned. This keeps the user’s current context policy separate from `KubernetesReady`.
- **Code comment**: `pkg/controllers/app/kubernetes/controllers/kube_reconciler.go:272` documents that unexpected current-context probe errors return true to leave the user’s current context alone. The new logging makes this limitation more diagnosable.

### Coverage Summary

- `cmd/rdd/service.go` — Reviewed, no issues
- `docs/design/api_app.md` — S1
- `pkg/apis/app/v1alpha1/app_types.go` — S1
- `pkg/controllers/app/app/controllers/app_controller.go` — Reviewed, no issues
- `pkg/controllers/app/kubernetes/controller.go` — Reviewed, no issues
- `pkg/controllers/app/kubernetes/controllers/kube_reconciler.go` — Reviewed, no issues
- `pkg/controllers/app/kubernetes/controllers/kube_reconciler_test.go` — Reviewed, no issues
- `pkg/controllers/app/predicates/predicates.go` — Reviewed, no issues

### [Gemini] — Pass 1

### Executive Summary

The PR extracts `watchEventLogger` to a shared predicates package, adds V(1) trace logging to the kubernetes reconciler, and correctly surfaces a `MergeFailed` condition when merging the instance kubeconfig fails. The core logic improvements successfully halt the `KubernetesReady` status on file errors, but the surrounding kubeconfig probe mechanism has pre-existing gaps handling complex user authentication and silent teardown failures that warrant attention.

### Findings

#### Critical Issues
None.

#### Important Issues

I1. **`probeCurrentKubeContext` builds an incomplete HTTP client, falsely rejecting complex auth** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler.go:322-351` (important, gap)

```go
	tlsCfg := &tls.Config{ ServerName: restCfg.ServerName }
	if len(restCfg.CAData) > 0 { /* ... */ }
	if len(restCfg.CertData) > 0 && len(restCfg.KeyData) > 0 { /* ... */ }
	httpClient := &http.Client{ Transport: &http.Transport{TLSClientConfig: tlsCfg} }
```

`probeCurrentKubeContext` manually constructs an `http.Client` at lines 322-332 using only inline certificate data (`CAData`, `CertData`, `KeyData`) and a bearer token (line 343). It completely ignores `ExecProvider` (used by AWS CLI/EKS), `AuthProvider` (GCP/GKE), `CAFile`, and `CertFile`. If the user's `current-context` relies on any of these standard mechanisms, the `/healthz` GET at line 345 fails with a TLS error or 401 Unauthorized. Because the probe returns `false` (line 347), the background goroutine in `manageKubeContext` incorrectly assumes the context is unhealthy and silently overwrites the user's active context with `rancher-desktop` (line 259).

Fix: Replace the manual HTTP client construction with upstream client-go helpers. Use `clientcmd.NewNonInteractiveDeferredLoadingClientConfig` with `ExplicitPath` so relative paths resolve correctly against the kubeconfig directory, then use `rest.HTTPClientFor(restCfg)` to get a client that supports all Kubernetes auth providers.

I2. **Silent failure during `removeKubeContext` cleanup leaves stale kubeconfig entries** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler.go:76-80`, `355-374` (important, gap)

```go
	if !app.Spec.Kubernetes.Enabled {
		r.removeKubeContext(ctx)
		return ctrl.Result{}, r.setKubeCondition(...)
	}
```

When Kubernetes is disabled, `Reconcile` triggers cleanup via `r.removeKubeContext(ctx)` at line 76. If `clearCurrentKubeContext` or `deleteKubeContext` fails (e.g., due to a Windows file lock or permission error), `removeKubeContext` only logs the errors at lines 369 and 372 before returning. `Reconcile` then completes with `ctrl.Result{}, nil`. Because the error is swallowed and not returned to the controller-runtime, the reconciler never applies exponential backoff to retry the cleanup. The `rancher-desktop` context permanently leaks in the user's `~/.kube/config`.

Fix: Modify `removeKubeContext` to return an aggregated error (using `errors.Join`) if clearing or deleting the context fails. In `Reconcile` at line 76, check for this error and return it so the framework safely retries the teardown.

#### Suggestions

S1. **Unconditional I/O and goroutine churn in `manageKubeContext` on every reconcile** — `pkg/controllers/app/kubernetes/controllers/kube_reconciler.go:128`, `208-265` (suggestion, enhancement)

```go
	if err := r.manageKubeContext(ctx); err != nil {
```

`Reconcile` unconditionally calls `manageKubeContext` at line 128 whenever Kubernetes is enabled and running. `manageKubeContext` calls `createReplaceKubeContext` (line 211) which unconditionally locks and rewrites `~/.kube/config` to disk. It then spawns a background `probeCurrentKubeContext` goroutine (line 229). Because the `App` resource receives frequent status updates from the LimaVM and Engine controllers, `kubernetes-reconciler` triggers often (resulting in the 17 reconcile loops noted in the PR description). This causes continuous, unnecessary disk writes and goroutine churn even when the kubeconfig is already correct.

Fix: In `createReplaceKubeContext`, deep-equal the target cluster, user, and context entries against the existing `cfg`, returning early without writing to disk if they match. Similarly, skip launching the background probe in `manageKubeContext` if `rancher-desktop` is already the active, healthy context.

### Design Observations

**Concerns**
- **Aggressive current-context hijacking (in-scope)**: `manageKubeContext` launches a background probe against the user's `current-context`. If the probe fails, it aggressively overwrites their active context with `rancher-desktop`. A developer whose external cluster is temporarily unreachable (e.g., VPN disconnected, sleeping VM) will have their context silently swapped, leading to accidental deployments to the local RDD cluster once they resume work. RDD should only take over the active context if it was already pointing to RDD or if explicitly requested, rather than hijacking it based on reachability.

**Strengths**
- The addition of `AppKubernetesReasonMergeFailed` correctly aligns the controller's behavior with the documented API contract, ensuring that users waiting via `rdd set` are properly blocked on a failed file merge instead of silently hanging.

### Testing Assessment

1. `removeKubeContext` teardown: The `!app.Spec.Kubernetes.Enabled` path that triggers `removeKubeContext` is completely untested. There is no assertion that the `rancher-desktop` entries are actually removed from `~/.kube/config` when the feature is disabled.
2. `probeCurrentKubeContext` edge cases: The probe logic is unexercised by unit tests. The incomplete auth handling (I1) ships green because there are no tests simulating a user's pre-existing complex kubeconfig (like an `ExecProvider` AWS EKS config) during a merge.
3. `probeK3sAPI` transient failures: The "Waiting for k3s API server" (`Probing`) states are not explicitly tested, though they are common in practice during boot.

### Documentation Assessment

The updates to `docs/design/api_app.md` clearly describe the new condition reasons. No gaps found.

### Acknowledged Limitations

None.

### Coverage Summary

- `cmd/rdd/service.go`: Reviewed, no issues
- `docs/design/api_app.md`: Reviewed, no issues
- `pkg/apis/app/v1alpha1/app_types.go`: Reviewed, no issues
- `pkg/controllers/app/app/controllers/app_controller.go`: Reviewed, no issues
- `pkg/controllers/app/kubernetes/controller.go`: Reviewed, no issues
- `pkg/controllers/app/kubernetes/controllers/kube_reconciler.go`: I1, I2, S1
- `pkg/controllers/app/kubernetes/controllers/kube_reconciler_test.go`: Reviewed, no issues
- `pkg/controllers/app/predicates/predicates.go`: Reviewed, no issues

---

## Resolution

| | |
|---|---|
| **Addressed** | 2026-05-21 |
| **Commit** | `2f88f63 Document KubernetesReady reasons in api_app.md` + `6076ef8 Add V(1) tracing and log silent probe errors in kubernetes-reconciler` + `f7123c5 Set KubernetesReady=False when kubeconfig merge fails` (S1, S2, and S3+S4+S5 folded into the existing PR commits via `git commit --fixup` + `git rebase --autosquash`) |

| # | Finding | Action | Reason |
|---|---------|--------|--------|
| 1 | Suggestion #1: `Settled` docstring and prose still omit `KubernetesReady` | Fixed | |
| 2 | Suggestion #2: `ensureServiceRunning` bakes `-v <N>` into the persisted `args.json` on first autostart | Fixed | |
| 3 | Suggestion #3: Four of six `Reconcile` branches remain uncovered | Test written | |
| 4 | Suggestion #4: Tests do not isolate `HOME` | Fixed | |
| 5 | Suggestion #5: Float-based duration comparison breaks parallel with the sibling test | Fixed | |
| 6 | Suggestion #6: `predicates` package docstring frames a one-member package as a collection | Skipped | Package serves as the shared home for app-controller-tree predicates; the docstring frames intended future shape, not a bug |
| 7 | Testing Gap #3: App reconciler has no dedicated test that `Settled` forwards `MergeFailed` | Skipped | Generic KubernetesReady-false propagation already exercised; pinning the specific reason string is incremental coverage without a regression vector |
