-
Notifications
You must be signed in to change notification settings - Fork 30
K8SPS-357: Improve full cluster crash recovery #928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Our full cluster crash recovery procedure requires at least 1 restart in primary and 3 restarts in secondaries: 1. Cluster started after crash 2. Pods are started 3. Full cluster crash detected (1st restart) 4. Operator reboots the cluster 5. Secondary pods are restarted to join the cluster (2nd restart) 6. Secondary pods receive data with Clone (3rd restart) Even though these restarts are by design, they give the impression something's wrong with the cluster. These changes attempt to reduce restarts to 1. After a succesful crash recovery, operator deletes all secondary pods so they can join the cluster. Only restart will be the 3rd restart required after clone. Secondary pods will be deleted by **best effort**. Which means if they can not be deleted, operator won't do anything. In this case secondary pods should be ready to serve traffic after 3-4 restarts. --- To recover a cluster from full cluster crash, we use `dba.rebootClusterFromCompleteOutage` in mysql-shell. This command connects to each MySQL pod to find out the node with the latest transaction and reboots it. This means mysqld needs to be up and running during crash recovery. After these changes, pods will be marked ready only if MySQL state is ready in `$MYSQL_STATE_FILE`. --- This commit also introduces more events in PerconaServerMySQL: ``` Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ClusterStateChanged 6m33s ps-controller -> Initializing Warning ClusterStateChanged 5m10s ps-controller Initializing -> Error Warning FullClusterCrashDetected 3m32s (x23 over 5m10s) ps-controller Full cluster crash detected Normal FullClusterCrashRecovered 2m40s ps-controller Cluster recovered from full cluster crash Warning ClusterStateChanged 2s ps-controller Initializing -> Ready ```
I think we have conflicts with the mysql8.4 support pr |
commit: 7366156 |
hors
approved these changes
Aug 6, 2025
pooknull
approved these changes
Aug 6, 2025
nmarukovich
approved these changes
Aug 6, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CHANGE DESCRIPTION
Our full cluster crash recovery procedure requires at least 1 restart in primary and 3 restarts in secondaries:
Even though these restarts are by design, they give the impression something's wrong with the cluster.
These changes attempt to reduce restarts to 1. After a succesful crash recovery, operator deletes all secondary pods so they can join the cluster. Only restart will be the 3rd restart required after clone. Secondary pods will be deleted by best effort. Which means if they can not be deleted, operator won't do anything. In this case secondary pods should be ready to serve traffic after 3-4 restarts.
To recover a cluster from full cluster crash, we use
dba.rebootClusterFromCompleteOutage
in mysql-shell. This command connects to each MySQL pod to find out the node with the latest transaction and reboots it. This means mysqld needs to be up and running during crash recovery.After these changes, pods will be marked ready only if MySQL state is ready in
$MYSQL_STATE_FILE
.This commit also introduces more events in PerconaServerMySQL:
CHECKLIST
Jira
Needs Doc
) and QA (Needs QA
)?Tests
Config/Logging/Testability