Deep Review: PR #238
| Date | 2026-03-16 |
| PR | #238 — Split RDD finalizer into /cleanup and /owned |
| Branch | split-finalizers |
| Commits | 9de9269 Split RDD finalizer into /cleanup and /owned |
| Reviewers | Claude Codex Gemini |
| Verdict | Merge with fixes — upgrade path for legacy finalizers must be addressed |
| Wall-clock time | 12 min 27 s |
Executive Summary
This PR splits the single shared rdd.rancherdesktop.io/cleanup finalizer into two purpose-specific finalizers: /cleanup for self-protection (a resource's own controller runs cleanup before deletion) and /owned for cascade blocking (an owner prevents child deletion until explicit release via DeleteOwnedResources). The split fixes a real bug where DeleteOwnedResources stripped the only finalizer on a child resource, bypassing the child controller's cleanup logic.
The implementation is correct for new resources but leaves existing resources stranded on upgrade: legacy template ConfigMaps still carry /cleanup, which DeleteOwnedResources no longer strips.
Critical Issues
None.
Important Issues
Before this PR, the LimaVM webhook created template ConfigMaps with rdd.rancherdesktop.io/cleanup (the old FinalizerName). After this PR, DeleteOwnedResources strips only /owned. When a user deletes a LimaVM created before the upgrade:
DeleteOwnedResourcestries to strip/ownedfrom the template ConfigMap at line 162 — not present, so no-op.c.Deleteat line 176 marks the ConfigMap for deletion.- The
ValidateDeletewebhook at controller.go:276 checksHasOwnedFinalizer— not present, so allows deletion. - But the old
/cleanupfinalizer still blocks actual deletion. No controller removes/cleanupfrom child resources. - The template ConfigMap is stuck in Terminating permanently.
for _, item := range list.Items {
if !IsOwnedByUID(&item, owner.GetUID()) {
continue
}
// Strip the owned finalizer to unblock deletion. The cleanup finalizer (if any)
// remains so the child's own controller can still run its cleanup logic.
patch := client.MergeFrom(item.DeepCopy())
if controllerutil.RemoveFinalizer(&item, OwnedFinalizerName) {
// Use Patch instead of Update for PartialObjectMetadata
if err := c.Patch(ctx, &item, patch); err != nil {
itemLogger := logger.V(1).WithValues("namespace", item.GetNamespace(), "name", item.GetName(), "gvk", gvk.String())
if apierrors.IsNotFound(err) {
itemLogger.Info("Resource already deleted during finalizer removal")
continue
}
// For other errors, log and collect error but proceed with deletion attempt
// The deletion might still succeed if there are no other blocking finalizers
itemLogger.Info("Failed to remove finalizer, will attempt deletion anyway", "error", err)
errs = append(errs, fmt.Errorf("failed to remove finalizers from %s %s/%s: %w", gvk, item.GetNamespace(), item.GetName(), err))
}
}
if err := c.Delete(ctx, &item); err != nil && client.IgnoreNotFound(err) != nil {
errs = append(errs, fmt.Errorf("failed to delete %s %s/%s during cleanup: %w", gvk, item.GetNamespace(), item.GetName(), err))
}
}
The webhook at controller.go:210-211 now creates new template ConfigMaps with /owned, but existing ones carry /cleanup:
// Create template ConfigMap with label
templateConfigMap := &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: limavm.GetTemplateConfigMapName(),
Namespace: limavm.Namespace,
Labels: map[string]string{
controllers.TemplateConfigMapLabel: "true",
},
Finalizers: []string{
base.OwnedFinalizerName,
},
},
Data: map[string]string{
v1alpha1.TemplateConfigMapKey: templateData,
},
}
All three agents identified this issue. Claude rated it important/gap; Codex rated it important/regression; Gemini rated it critical/regression. The correct classification is important, regression: the PR introduces the incompatibility, but recovery is straightforward in the single-user embedded context (delete the kine database or manually patch the finalizer).
Fix: Migrate legacy children. In the LimaVM reconciler, when fetching the template ConfigMap during normal reconciliation, check whether it carries /cleanup without /owned. If so, patch to replace /cleanup with /owned. This keeps the migration close to the only controller that creates these ConfigMaps and avoids polluting the generic DeleteOwnedResources helper with knowledge of the legacy scheme.
Suggestions
After this PR, template ConfigMaps carry the /owned finalizer. The practical conclusion (delete in one pass) remains correct because DeleteOwnedResources strips /owned before issuing the delete, but the stated reasoning at lines 60-64 is wrong. Blame confirms these lines predate this PR (de1f2298), but the PR made the comment inaccurate.
// Delete owned resources and remove the finalizer in one pass. This is safe
// because owned resources (ConfigMaps) delete instantly and have no finalizers.
// If we later own resources with complex teardown, split this into two reconcile
// cycles: delete owned resources, then verify they are gone before removing
// the finalizer.
if err := base.DeleteOwnedResources(ctx, r.Client, limaVM, r.Manager); err != nil {
logger.Error(err, "Failed to delete owned resources")
return ctrl.Result{}, err
}
// Remove finalizer
if err := base.RemoveCleanupFinalizer(ctx, r.Client, limaVM); err != nil {
logger.Error(err, "Failed to remove finalizer")
return ctrl.Result{}, err
}
Fix: Rewrite the comment to explain that DeleteOwnedResources strips /owned before deletion, and the two-cycle caveat applies if a child ever needs /cleanup alongside /owned.
/cleanup finalizer
Claude
Codex
The test covers no finalizers, owned-only, and both-finalizers. The missing case is a child with only /cleanup — the upgrade scenario from Important Issue #1. RemoveFinalizer at line 162 returns false (different code path), and the resource gets stuck in Terminating.
testCases := []struct {
name string
finalizers []string
// expectDeleted is true when the owned resource should be fully deleted
// (NotFound) after DeleteOwnedResources. When false, the resource still
// exists because a remaining finalizer blocks deletion.
expectDeleted bool
}{
{
name: "config map without finalizers",
expectDeleted: true,
},
{
name: "config map with owned finalizer only",
finalizers: []string{OwnedFinalizerName},
expectDeleted: true,
},
{
name: "config map with both finalizers strips only owned",
finalizers: []string{CleanupFinalizerName, OwnedFinalizerName},
expectDeleted: false,
},
}
Fix: Add a test case:
{
name: "config map with cleanup finalizer only (legacy)",
finalizers: []string{CleanupFinalizerName},
expectDeleted: false,
},
After implementing the migration fix, add a corresponding integration test that creates a legacy-finalizered ConfigMap and verifies LimaVM deletion cleans it up.
EnsureOwnedFinalizer and RemoveOwnedFinalizer are unused
Gemini
The LimaVM webhook applies OwnedFinalizerName directly in the struct literal (controller.go:211), and DeleteOwnedResources removes it via controllerutil.RemoveFinalizer with a direct patch (finalizer.go:162). Neither caller uses the new helpers.
// EnsureOwnedFinalizer adds the owned finalizer to a child object if not already present.
// Returns true if the finalizer was added and the object has been updated.
func EnsureOwnedFinalizer(ctx context.Context, c client.Client, obj client.Object) (bool, error) {
if !controllerutil.AddFinalizer(obj, OwnedFinalizerName) {
return false, nil
}
if err := c.Update(ctx, obj); err != nil {
return false, fmt.Errorf("failed to add owned finalizer: %w", err)
}
return true, nil
}
// RemoveOwnedFinalizer removes the owned finalizer from the object and updates it.
func RemoveOwnedFinalizer(ctx context.Context, c client.Client, obj client.Object) error {
if controllerutil.RemoveFinalizer(obj, OwnedFinalizerName) {
if err := c.Update(ctx, obj); err != nil {
return fmt.Errorf("failed to remove owned finalizer: %w", err)
}
}
return nil
}
Fix: Either remove these functions or refactor existing code to use them. They provide a consistent API surface for future controllers, so keeping them is reasonable — but document that intent.
Design Observations
DeleteOwnedResources is now schema-version-sensitive (in-scope)
Codex
The helper assumes every child that should be unblocked already uses /owned. Encoding a short compatibility window (stripping the legacy finalizer from owned resources) in the helper would eliminate the entire class of upgrade bugs across all controllers, rather than requiring each controller to handle its own migration. The tradeoff: this would contradict the design principle that DeleteOwnedResources does not strip /cleanup. A targeted migration in the owning controller is safer and keeps the helper's contract clean.
/owned finalizer doubles as a user-facing deletion guard (in-scope)
Gemini
Child resources like template ConfigMaps do not need /owned for the garbage collection loop to work — DeleteOwnedResources can delete resources without a finalizer. The /owned finalizer's primary value is blocking accidental kubectl delete attempts via the ValidateDelete webhook. This layered approach (webhook + finalizer) is sound but worth documenting explicitly so future contributors understand the finalizer's dual purpose.
Testing Assessment
The unit tests cover the three key scenarios for the new design: no finalizer, owned-only (deleted), and both finalizers (owned stripped, cleanup remains). The BATS integration test verifies that LimaVM gets /cleanup and the template ConfigMap gets /owned.
Untested scenarios ranked by risk:
- Upgrade path: Child resource with only the legacy
/cleanupfinalizer processed byDeleteOwnedResources. Highest risk — this is the scenario from Important Issue #1. - Webhook validation during cascade deletion: Verifying that
ValidateDeleteallows deletion of a template ConfigMap immediately afterDeleteOwnedResourcesstrips/owned. - Concurrent owner deletion: Two owners deleting simultaneously, both calling
DeleteOwnedResourceson a shared child. The patch/delete is naturally idempotent (NotFound handled at lines 166-168), so this is likely safe but untested.
Documentation Assessment
The new docs/design/controllers.md documents the finalizer design clearly, including the "Why Two Finalizers?" rationale. Cross-links from README.md and discovery.md are appropriate.
Gap: The design doc should mention the upgrade path for existing resources, or state explicitly that migration from the single-finalizer scheme is required.
Agent Performance Retro
Claude Opus 4.6
- Unique contributions: Identified the stale comment in
limavm_lifecycle.go:60-64. - Accuracy: All findings verified. Correctly classified the upgrade issue as "important, gap" (though "regression" is more precise since this PR introduced the incompatibility).
- Depth: Traced through
DeleteOwnedResources, the webhook, and the reconciler. Used git blame to confirm origins. Explored the design tradeoff between helper-level and controller-level migration. - Signal-to-noise: Excellent. Every finding is actionable. No false positives.
Codex GPT 5.4
- Unique contributions: None beyond the shared upgrade finding.
- Accuracy: All findings verified. Correctly classified the upgrade issue as "important, regression."
- Depth: Traced the full deletion path from
DeleteOwnedResourcesthrough to the parent finalizer removal. Concise analysis. - Signal-to-noise: High signal, very concise review. No false positives.
Gemini 3.1 Pro
- Unique contributions: Identified unused helper functions (
EnsureOwnedFinalizer,RemoveOwnedFinalizer). Flagged the ConfigMap orphan issue (pre-existing gap, not a regression). - Accuracy: Rated the upgrade issue as "critical, regression" — overcalibrated for an embedded single-user context. The ConfigMap orphan finding is legitimate but pre-existing.
- Depth: Read the webhook code, traced the admission flow, and identified the orphan scenario. Good cross-component analysis.
- Signal-to-noise: Good, but the orphan finding predates this PR and the severity calibration was off.
Summary
| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 4:16 | 3:20 | 4:26 |
| Critical | 0 | 0 | 1 (overcalibrated) |
| Important | 1 | 1 | 1 |
| Suggestion | 2 | 1 | 1 |
| Design observations | 2 | 1 | 1 |
| False positives | 0 | 0 | 0 |
| Unique insights | 1 | 0 | 2 |
Claude provided the most thorough analysis with the best severity calibration. Gemini contributed the most unique findings but overcalibrated severity. Codex was accurate and concise but offered no unique insights beyond the shared finding.
Skill Improvement Recommendations
- The prompt's instruction to "calibrate severity for an embedded single-user context" worked for Claude and Codex but not Gemini. Consider adding a concrete example to the repo context (e.g., "an issue that requires deleting the kine database to recover is IMPORTANT, not CRITICAL").
- All three agents found the same upgrade issue, confirming it as the PR's central risk. The prompt dimensions are comprehensive for this type of change.
- No aspects were missed by all agents.
- The clarification round was unnecessary — all findings were clear and consistent.