Deep Review: 20260325-180207-pr-234

Date2026-03-25 18:02
Reporancher-sandbox/rancher-desktop-daemon
Round2
Author@jandubois
PR#234 — Preserve Lima serial logs across VM restarts
Branchrotate-serial-logs
Commits2ce15bc Preserve Lima serial logs across VM restarts
df1d963 Preserve instance logs on LimaVM deletion when RDD_KEEP_LOGS is set
480661b Preserve instance logs in svc delete when RDD_KEEP_LOGS is set
ReviewersClaude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro
VerdictMerge with fixes — Windows file-lock race undermines log preservation in CI
Wall-clock time14 min 40 s

Consolidated Review

Executive Summary

This PR adds three log preservation mechanisms: serial log rotation before each VM start, instance log preservation during LimaVM deletion, and the same during rdd svc delete — all gated on RDD_KEEP_LOGS. The implementation extracts a clean Rotate helper from Create and introduces a shared PreserveLogs function with atomic directory naming. The main concern is that both preservation paths treat failures as best-effort warnings, then proceed to delete the source directories — on Windows, where killed processes retain file locks briefly, this race will cause the preservation to fail and the logs to be deleted, defeating the PR’s purpose for Windows CI.

Structure: 0 critical, 1 important, 4 suggestions.


Critical Issues

None.


Important Issues

1. Log preservation failures do not prevent source deletion (important, gap) CodexGemini

Both preservation paths — preserveInstanceLogs in the controller and preserveAllInstanceLogs in the service — log failures as warnings, then allow the caller to delete the source directory. In the controller, preserveInstanceLogs returns void at line 714, and handleDeletion calls limainstance.Delete at line 82 regardless. In the service, preserveAllInstanceLogs continues to the next instance at line 454, and Delete() calls os.RemoveAll(instance.ShortDir()) at line 428.

// preserveInstanceLogs moves log files from the Lima instance directory to
// a subdirectory of the service log directory before the instance is deleted.
// This is a no-op unless RDD_KEEP_LOGS is set.
func preserveInstanceLogs(ctx context.Context, inst *limatype.Instance) {
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		return
	}

	logger := log.FromContext(ctx)
	count, err := instance.PreserveLogs(inst.Dir, inst.Name)
	if err != nil {
		logger.Error(err, "Failed to preserve instance logs")
		return
	}
	if count > 0 {
		logger.Info("Preserved instance logs", "instance", inst.Name, "count", count)
	}
}

The handleDeletion flow calls preserveInstanceLogs at line 77, then unconditionally proceeds to limainstance.Delete at line 82:

	if existingInst != nil {
		// Only use PID-based force-stop for Running instances. Broken
		// instances may have stale PID files pointing to recycled processes
		// on Windows (Lima's ReadPIDFile treats any live PID as valid).
		// Not tested: simulating stale PID files requires Windows-specific
		// PID file manipulation that BATS cannot easily reproduce.
		if existingInst.Status == limatype.StatusRunning {
			stopInstanceForcibly(ctx, logger, existingInst)
		} else if existingInst.VMType == limatype.WSL2 {
			// A "stopped" WSL2 distro can retain kernel state that deadlocks
			// wsl.exe --unregister. Terminate it without PID-based killing,
			// since the PIDs may have been recycled on Windows.
			terminateWSL2Distro(ctx, logger, existingInst.Name)
		}
		if runtime.GOOS == "windows" {
			// Clear PIDs so Lima's Delete → StopForcibly does not kill
			// unrelated processes if the PIDs were recycled. Windows recycles
			// PIDs aggressively, and Lima's ReadPIDFile treats any live PID
			// as valid. On Unix, PID recycling is rare (wraps around 32768+),
			// so we let Lima's Delete clean up any surviving driver processes.
			//
			// This disables Lima's internal kill retry even if stopInstanceForcibly
			// failed above. That is intentional: a failed kill means KillTree
			// could not reach the process (access denied, already reaped), and
			// the PID may already be recycled. Retrying with a stale PID is
			// worse than letting Delete proceed without a kill.
			existingInst.DriverPID = 0
			existingInst.HostAgentPID = 0
		}
		preserveInstanceLogs(ctx, existingInst)
		logger.Info("Deleting Lima instance", "instance", limaVM.Name)
		// Use a timeout because Lima's WSL2 driver calls wsl.exe --unregister
		// which can hang indefinitely if the WSL subsystem is degraded.
		deleteCtx, deleteCancel := context.WithTimeout(ctx, time.Minute)
		err = limainstance.Delete(deleteCtx, existingInst, true)
		deleteCancel()
		if err != nil {
			logger.Error(err, "Failed to delete Lima instance")
			return ctrl.Result{}, err
		}
		logger.Info("Deleted Lima instance", "instance", limaVM.Name)

In service.go, Delete() calls Stop(), then preserveAllInstanceLogs(), then removes the directories unconditionally:

func Delete() error {
	if !Exists() {
		return fmt.Errorf("%q control plane does not exist", instance.Name())
	}
	_ = Stop()
	preserveAllInstanceLogs()
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		_ = os.RemoveAll(instance.LogDir())
	}
	_ = os.RemoveAll(instance.ShortDir())
	return os.RemoveAll(instance.Dir())
}

// preserveAllInstanceLogs moves .log files from each Lima instance directory
// to the service log directory before the instance directories are deleted.
// This is a no-op unless RDD_KEEP_LOGS is set.
func preserveAllInstanceLogs() {
	if os.Getenv("RDD_KEEP_LOGS") == "" {
		return
	}
	entries, err := os.ReadDir(instance.LimaHome())
	if err != nil {
		if !os.IsNotExist(err) {
			logrus.WithError(err).Warn("Failed to read Lima instance directory")
		}
		return
	}
	for _, entry := range entries {
		if !entry.IsDir() {
			continue
		}
		instDir := filepath.Join(instance.LimaHome(), entry.Name())
		count, err := instance.PreserveLogs(instDir, entry.Name())
		if err != nil {
			logrus.WithError(err).WithField("instance", entry.Name()).Warn("Failed to preserve instance logs")
			continue
		}
		if count > 0 {
			logrus.WithFields(logrus.Fields{"instance": entry.Name(), "count": count}).Info("Preserved instance logs")
		}
	}
}

On Windows, stopInstanceForcibly (line 55) and Stop() (line 423) terminate processes, but Windows does not release file locks until the process handle is fully closed. The subsequent os.Rename calls in PreserveLogs (logs.go:54) fail with ERROR_SHARING_VIOLATION, the error is logged, and the source logs are then deleted — the exact outcome this PR aims to prevent.

On Unix this race is unlikely because killed processes release file descriptors immediately, and os.Rename does not require exclusive access. The practical impact is limited to Windows CI, which is the PR author’s primary concern (issue #247).

Fix: In preserveInstanceLogs, return an error and let handleDeletion decide whether to retry or proceed. In Delete(), add a brief time.Sleep or poll for file accessibility after Stop() on Windows before calling preserveAllInstanceLogs. Alternatively, accept the best-effort design but document the Windows limitation with a TODO.

-func preserveInstanceLogs(ctx context.Context, inst *limatype.Instance) {
+func preserveInstanceLogs(ctx context.Context, inst *limatype.Instance) error {
     if os.Getenv("RDD_KEEP_LOGS") == "" {
-        return
+        return nil
     }
     logger := log.FromContext(ctx)
     count, err := instance.PreserveLogs(inst.Dir, inst.Name)
     if err != nil {
-        logger.Error(err, "Failed to preserve instance logs")
-        return
+        return fmt.Errorf("preserve instance logs: %w", err)
     }
     if count > 0 {
         logger.Info("Preserved instance logs", "instance", inst.Name, "count", count)
     }
+    return nil
 }

Suggestions

1. Rotate could handle nonexistent directory gracefully (suggestion, enhancement) Gemini

Rotate propagates os.ReadDir errors, including fs.ErrNotExist. In startInstance (line 410 of limavm_lifecycle.go), inst.Dir always exists because store.Inspect has already succeeded, so this does not trigger in practice. However, as a public API, Rotate would be more robust if it treated a missing directory as a no-op — there are no logs to rotate.

func Rotate(dir, name string, keepAll bool) error {
	pattern := regexp.MustCompile(`^` + regexp.QuoteMeta(name) + `\.(\d+)\.log$`)

	entries, err := os.ReadDir(dir)
	if err != nil {
		return fmt.Errorf("read log directory %s: %w", dir, err)
	}

	// Find the highest existing sequence number.
	maxN := 0
	var numberedFiles []numberedFile
	for _, entry := range entries {
		matches := pattern.FindStringSubmatch(entry.Name())
		if matches == nil {
			continue

Fix: Return nil for fs.ErrNotExist.

 entries, err := os.ReadDir(dir)
 if err != nil {
+    if errors.Is(err, fs.ErrNotExist) {
+        return nil
+    }
     return fmt.Errorf("read log directory %s: %w", dir, err)
 }
2. os.Rename fails across filesystems (suggestion, gap) ClaudeGemini

os.Rename returns EXDEV when source and destination are on different filesystems. In practice, LimaHome() (under ~/.rd{suffix}/lima/) and LogDir() (under ~/Library/Application Support/rancher-desktop-{instance}/log/) are on the same filesystem. On Windows, both live under %LOCALAPPDATA%. A cross-device failure would require deliberate filesystem layout changes.

	count := 0
	for _, entry := range logFiles {
		src := filepath.Join(srcDir, entry.Name())
		dst := filepath.Join(destDir, entry.Name())
		if err := os.Rename(src, dst); err != nil {
			return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
		}
		count++
	}

Fix: If this ever becomes an issue, add a copy-then-delete fallback. For now, a comment noting the assumption suffices.

3. Empty directory left behind on partial PreserveLogs failure (suggestion, gap) Gemini

nextAvailableDir creates the destination directory before the rename loop. If the first os.Rename fails, the directory is left empty. Successive failures accumulate empty numbered directories (opensuse, opensuse.2, opensuse.3, ...).

	logDir := LogDir()
	if err := os.MkdirAll(logDir, 0o700); err != nil {
		return 0, fmt.Errorf("create log directory %s: %w", logDir, err)
	}

	destDir, err := nextAvailableDir(logDir, instanceName)
	if err != nil {
		return 0, err
	}

	count := 0
	for _, entry := range logFiles {
		src := filepath.Join(srcDir, entry.Name())
		dst := filepath.Join(destDir, entry.Name())
		if err := os.Rename(src, dst); err != nil {
			return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
		}
		count++
	}

Fix: If count == 0 on error return, remove the empty directory.

+    if count == 0 {
+        os.Remove(destDir)
+    }
     return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
4. Test files use different os.IsNotExist conventions (suggestion, regression) Claude

logs_test.go uses the older os.IsNotExist(err) pattern while logfile_test.go (in the same PR) uses the modern errors.Is(err, fs.ErrNotExist). Both work, but the inconsistency within one PR is avoidable.

Fix: Align logs_test.go to use errors.Is(err, fs.ErrNotExist) to match logfile_test.go.


Design Observations

Strengths


Testing Assessment

Test coverage is good. New unit tests cover:

Untested scenarios ranked by risk:

  1. Windows file-lock race in preservation — No test covers RDD_KEEP_LOGS=1 with a LimaVM deletion while processes still hold file locks. This is the path most likely to fail on Windows CI.
  2. PreserveLogs partial failure — If a rename fails midway, the function returns a partial count and an error. No test verifies the destination directory state (some files moved, empty directory left behind).
  3. End-to-end svc delete with log preservation — No integration test covers rdd svc delete preserving instance logs and removing LIMA_HOME afterward.

Documentation Assessment

Documentation updates are thorough:

One minor gap: the header comment in scripts/collect-bats-logs.sh still describes the script as collecting service and hostagent logs, but the script now also collects preserved instance-log directories. Codex


Commit Structure

The three commits represent a clean progression:

  1. 2ce15bc — Extract Rotate, add serial log rotation. Self-contained.
  2. df1d963 — Add instance log preservation during LimaVM deletion and ShortDir cleanup.
  3. 480661b — Extract PreserveLogs helper, add svc delete preservation. Refactors the inline implementation from commit 2 into a shared helper.

Commit messages accurately describe the changes. No fixup commits.


Agent Performance Retro

Claude Opus 4.6

Codex GPT 5.4

Gemini 3.1 Pro

Summary

MetricClaude Opus 4.6Codex GPT 5.4Gemini 3.1 Pro
Duration6:103:405:39
Critical000
Important022
Suggestion202
Design observations521
False positives00 (1 overscoped)1 (severity inflation)
Unique insights222
Files reviewed101010
Coverage misses1 (Windows race)01 (controller preserve gap)

Codex delivered the most actionable finding (preserve-then-delete race) in the shortest time. Gemini independently confirmed the Windows angle and added the empty-directory cleanup suggestion. Claude provided the best design analysis and highest signal-to-noise ratio but missed the review’s central finding. The multi-agent approach proved its value: no single agent identified all the important issues.