fix: shutdown hook deadlock under leader election and deprecate Operator#installShutdownHook(Duration) by Dennis-Mircea · Pull Request #3383 · operator-framework/java-operator-sdk
Summary
Resolves all three concerns raised in #3376:
- The JVM shutdown hook installed via
Operator#installShutdownHook(...)was previously skipped whenever leader election was enabled (introduced in fix: leader election stop deadlock #1618 to work around the deadlock reported in LeaderElectionManager#stopLeading maybe deadlock with Operator#installShutdownHook #1614). As a result, a leader pod receivingSIGTERMdid not release its lease, forcing standby replicas to wait for lease expiry before they could take over. - The
gracefulShutdownTimeoutargument onOperator#installShutdownHook(Duration)has been ignored since PR feat: support for graceful shutdown based on configuration #2479. The reconciliation termination timeout is now read fromConfigurationService#reconciliationTerminationTimeout(). - The actual rationale for the leader-election skip and the dead
Durationparameter was not documented anywhere; neither were theOperator#stop()shutdown sequence and theLeaderElectionManagerlifecycle.
What changed
Fix the underlying deadlock (LeaderElectionManager)
The deadlock from #1614 was very specific: Operator#stop() called from inside a JVM shutdown hook would cancel the leader-election future, which fired the onStopLeading callback, which called System.exit(1). That System.exit then blocked indefinitely on the java.lang.Shutdown class lock that the shutdown hook thread itself was already holding.
This PR breaks the recursion. LeaderElectionManager now carries a stoppingGracefully AtomicBoolean, set in stop() before the future is cancelled and checked at the top of stopLeading(). When the flag is set, stopLeading() returns immediately instead of invoking System.exit. The "restart on lost lead" behavior (when the leader-election library detects a real lost lease without a prior stop()) is preserved because that path runs without the flag ever being set.
Re-enable the shutdown hook unconditionally (Operator)
With the deadlock fixed at its source, the conditional in Operator#installShutdownHook() is no longer required. The hook is now registered regardless of leader-election state, and the "Leader election is on, shutdown hook will not be installed." warn log line is removed. A leader pod receiving SIGTERM will run the hook, which calls Operator#stop(), which now cleanly cancels the leader-election future and releases the lease.
Deprecate installShutdownHook(Duration) (Operator)
The Duration argument has been dead since #2479. This PR adds a new no-arg installShutdownHook() overload (the recommended replacement) whose JavaDoc points users at ConfigurationServiceOverrider#withReconciliationTerminationTimeout(Duration) as the real configuration knob. The existing installShutdownHook(Duration) is marked @Deprecated(forRemoval = true) and delegates to the no-arg overload. The Duration parameter is kept only for source and binary compatibility.
Documentation additions
- Class-level JavaDoc on
LeaderElectionManagerexplaining its role, configuration entry points, theLease-based coordination, the three-way behavior ofstopLeading(), and the lifecycle ownership byOperator. - JavaDoc on
Operator#stop()documenting the four-step shutdown sequence (controller manager, executor service manager with timeout, leader-election manager, optional client close), an explicit "safe to call from a JVM shutdown hook" guarantee, the not-started edge-case behavior, and thecloseClientOnStop()opt-out (defaulttrue). - JavaDoc on
Operator#installShutdownHook()explaining the timeout configuration knob and adding aterminationGracePeriodSecondsnote that recommends sizing the pod's grace period to fit the configuredreconciliationTerminationTimeoutplus a small buffer.
Regression test
LeaderElectionManagerTest#stopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirst calls stop() and then stopLeading() directly. If the graceful-shutdown short-circuit is ever reintroduced as System.exit(1), this test method would terminate the JUnit JVM rather than failing cleanly, making the regression impossible to miss in CI. LeaderElectionManager#stopLeading was lowered from private to protected to enable the test (and to allow subclasses to extend the behavior).
Behavior summary by scenario
| Scenario | Before this PR | After this PR |
|---|---|---|
No leader election, SIGTERM |
Hook installed, stop() runs, graceful shutdown |
Same |
Leader election, leader receives SIGTERM |
Hook not installed, JVM exits, lease leaks | Hook installed, stop() runs, lease released cleanly |
Leader election, non-leader receives SIGTERM |
Hook not installed | Hook installed, stop() is effectively a no-op for leader-election state (no lease to release) |
Leader loses lead (onStopLeading fires without prior stop()) |
System.exit(1) triggered, JVM restarts |
Same |
installShutdownHook(Duration) called by user |
Duration silently ignored |
Duration silently ignored, plus deprecation warning at compile time |
Test plan
mvn -pl operator-framework-core testpasses, including the newstopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirstregression.- Spotless / Google Java Format clean.
- Compiler emits a deprecation warning at every existing call site of
installShutdownHook(Duration).