-
Notifications
You must be signed in to change notification settings - Fork 41k
fix: handle corner cases in the async preemption #133213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This reverts commit 006d762.
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold Need approver's approval + I'd like to make sure if the test pass several times. |
1f29ab1
to
f6cc41c
Compare
errCh.SendErrorWithCancel(err, cancel) | ||
errCh.SendError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't cancel context if the error is not found. So, I changed here to SendError and then cancel() at the receiver side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canceling the context has another purpose as well - it terminates the Parallelizer().Until
, i.e. if there is an error, we stop preempting the rest of the pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think we need to check apierrors.IsNotFound(err)
here and handle it appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so calling cancel() at the receiver side is equivalent, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the not found error here would be a bit tricky and I prefer the current impl, if you don't have the objection there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so calling cancel() at the receiver side is equivalent, right?
No, Parallelizer().Until()
blocks until all goroutines are finished. Then, the error is checked. So, in the current implementation, you would cancel the context after all pods were tried to be preempted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the error channel should be called only once (as the length of the buffer is 1). Currently, if one preemptPod
would return Not Found error and second will return another error, we would only handle the Not Found one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, the simplest would be just to set an allPodsAlreadyDeleted
boolean atomic in preemptPod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, Parallelizer().Until() blocks until all goroutines are finished.
It's a good point, fixed.
st.MakePod().Name("medium").Priority(mediumPriority).Req(map[v1.ResourceName]string{v1.ResourceCPU: "3"}).Obj(), | ||
}, | ||
{ | ||
st.MakePod().Name("high").Priority(highPriority).Req(map[v1.ResourceName]string{v1.ResourceCPU: "3"}).Obj(), | ||
st.MakePod().Name("high").Priority(highPriority).Req(map[v1.ResourceName]string{v1.ResourceCPU: "4"}).Obj(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't matter. it's just that the test name didn't match what it tests
cancel() | ||
result = metrics.GoroutineResultError | ||
default: | ||
allPodsAlreadyDeleted = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't "disable activate on finish" here, because we don't rely on notifications sent from pods that were removed asynchronously. We explicitly call removing the last pod synchronously to have a gurantee we don't miss the notification.
For that reason I'd rename allPodsAlreadyDeleted
to something like noGuaranteedNotification
, which would be false at the beginning and will turn to true after we send the last removal synchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We explicitly call removing the last pod synchronously to have a gurantee we don't miss the notification.
Deleting the last pod synchronously is for completely another scenario: if all pods are deleted async and all pod/delete events arrive at the queue immediately, before the preemptor pod is un-gated, the preemptor pod could miss all pod/delete events.
What we need to do is "when all pods are already deleted after all, activate the preemptor pod". Because, if all pods are already deleted, the preemptor pod might miss pod/delete while it's being gated.
So, I don't think your proposed change would cover the corner case that we're solving here.
f6cc41c
to
1d3ee06
Compare
1d3ee06
to
6d11c36
Compare
What type of PR is this?
/kind bug
What this PR does / why we need it:
See #133167, which was reverted due to a failing test. This PR contains additional fix to a test as well.
Which issue(s) this PR is related to:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: