◐ Shell
clean mode source ↗

fix: shutdown hook deadlock under leader election and deprecate Operator#installShutdownHook(Duration) by Dennis-Mircea · Pull Request #3383 · operator-framework/java-operator-sdk

Summary

Resolves all three concerns raised in #3376:

  1. The JVM shutdown hook installed via Operator#installShutdownHook(...) was previously skipped whenever leader election was enabled (introduced in fix: leader election stop deadlock #1618 to work around the deadlock reported in LeaderElectionManager#stopLeading maybe deadlock with Operator#installShutdownHook #1614). As a result, a leader pod receiving SIGTERM did not release its lease, forcing standby replicas to wait for lease expiry before they could take over.
  2. The gracefulShutdownTimeout argument on Operator#installShutdownHook(Duration) has been ignored since PR feat: support for graceful shutdown based on configuration #2479. The reconciliation termination timeout is now read from ConfigurationService#reconciliationTerminationTimeout().
  3. The actual rationale for the leader-election skip and the dead Duration parameter was not documented anywhere; neither were the Operator#stop() shutdown sequence and the LeaderElectionManager lifecycle.

What changed

Fix the underlying deadlock (LeaderElectionManager)

The deadlock from #1614 was very specific: Operator#stop() called from inside a JVM shutdown hook would cancel the leader-election future, which fired the onStopLeading callback, which called System.exit(1). That System.exit then blocked indefinitely on the java.lang.Shutdown class lock that the shutdown hook thread itself was already holding.

This PR breaks the recursion. LeaderElectionManager now carries a stoppingGracefully AtomicBoolean, set in stop() before the future is cancelled and checked at the top of stopLeading(). When the flag is set, stopLeading() returns immediately instead of invoking System.exit. The "restart on lost lead" behavior (when the leader-election library detects a real lost lease without a prior stop()) is preserved because that path runs without the flag ever being set.

Re-enable the shutdown hook unconditionally (Operator)

With the deadlock fixed at its source, the conditional in Operator#installShutdownHook() is no longer required. The hook is now registered regardless of leader-election state, and the "Leader election is on, shutdown hook will not be installed." warn log line is removed. A leader pod receiving SIGTERM will run the hook, which calls Operator#stop(), which now cleanly cancels the leader-election future and releases the lease.

Deprecate installShutdownHook(Duration) (Operator)

The Duration argument has been dead since #2479. This PR adds a new no-arg installShutdownHook() overload (the recommended replacement) whose JavaDoc points users at ConfigurationServiceOverrider#withReconciliationTerminationTimeout(Duration) as the real configuration knob. The existing installShutdownHook(Duration) is marked @Deprecated(forRemoval = true) and delegates to the no-arg overload. The Duration parameter is kept only for source and binary compatibility.

Documentation additions

  • Class-level JavaDoc on LeaderElectionManager explaining its role, configuration entry points, the Lease-based coordination, the three-way behavior of stopLeading(), and the lifecycle ownership by Operator.
  • JavaDoc on Operator#stop() documenting the four-step shutdown sequence (controller manager, executor service manager with timeout, leader-election manager, optional client close), an explicit "safe to call from a JVM shutdown hook" guarantee, the not-started edge-case behavior, and the closeClientOnStop() opt-out (default true).
  • JavaDoc on Operator#installShutdownHook() explaining the timeout configuration knob and adding a terminationGracePeriodSeconds note that recommends sizing the pod's grace period to fit the configured reconciliationTerminationTimeout plus a small buffer.

Regression test

LeaderElectionManagerTest#stopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirst calls stop() and then stopLeading() directly. If the graceful-shutdown short-circuit is ever reintroduced as System.exit(1), this test method would terminate the JUnit JVM rather than failing cleanly, making the regression impossible to miss in CI. LeaderElectionManager#stopLeading was lowered from private to protected to enable the test (and to allow subclasses to extend the behavior).

Behavior summary by scenario

Scenario Before this PR After this PR
No leader election, SIGTERM Hook installed, stop() runs, graceful shutdown Same
Leader election, leader receives SIGTERM Hook not installed, JVM exits, lease leaks Hook installed, stop() runs, lease released cleanly
Leader election, non-leader receives SIGTERM Hook not installed Hook installed, stop() is effectively a no-op for leader-election state (no lease to release)
Leader loses lead (onStopLeading fires without prior stop()) System.exit(1) triggered, JVM restarts Same
installShutdownHook(Duration) called by user Duration silently ignored Duration silently ignored, plus deprecation warning at compile time

Test plan

  • mvn -pl operator-framework-core test passes, including the new stopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirst regression.
  • Spotless / Google Java Format clean.
  • Compiler emits a deprecation warning at every existing call site of installShutdownHook(Duration).