-
Notifications
You must be signed in to change notification settings - Fork 985
lxd: Make the LXD shutdown sequence more concurrent to avoid long-running operations blocking unrelated instances shutdown #15016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lxd: Make the LXD shutdown sequence more concurrent to avoid long-running operations blocking unrelated instances shutdown #15016
Conversation
|
@tomponline we can start discussing the approach. I mark it as draft as I'm stillin the process to test it but just wanted to let you know about my idea. @simondeziel what was the test setup you used so that I can reproduce a bunch of long running operations and start testing this appraoch. |
It's been a while and it seems it was an image refresh taking longer than usual. I think mimic something akin with: Then |
cc831d8 to
34c2e89
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShellCheck found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
34c2e89 to
258618c
Compare
2542323 to
42ae29a
Compare
42ae29a to
8cc26b7
Compare
|
If one wants to reproduce the tests: this should be working https://paste.ubuntu.com/p/MGxxJz35N7/ (still fuguring out why this is different in the CI) |
8cc26b7 to
a428ad5
Compare
45e8db0 to
1569f90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the testing side. The standalone+LVM is the most impacted in terms of test duration with ~6m10s but I don't think much can be done about it. For standalone+ceph, it takes ~3m45s so it's not that bad.
12b0719 to
6e567f0
Compare
|
@tomponline fixed the comments. Tests are green. |
tomponline
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few minor changes and a question, thanks!
`pendingInstanceOperations` is only used in the LXD shutdown sequence Stop() function at it is a blocking function. We replace it by a non-blocking function that simply populate a map of instance URL to their operation. It is also worth mentionning that once the LXD shutdown sequence is triggered, no more operation can be added, so I don't see the use of "waiting". Signed-off-by: Gabriel Mougard <[email protected]>
…(staticcheck)`) Signed-off-by: Gabriel Mougard <[email protected]>
We will no longer call `daemonStorageVolumesUnmount` in a separate goroutine but after the instancesShutdown function. However, we still need to be able to cancel the volume unmount after a timeout, hence the introduction of the 'select' statement with a context handling. This change also enable a context to cancel an ongoing unmount procedure. Signed-off-by: Gabriel Mougard <[email protected]>
…sShutdown` function * `isInstanceBusy`: check if an instance has a running op tied to it. Signed-off-by: Gabriel Mougard <[email protected]>
This enables `instancesShutdown` to react upon operation changes (from busy to idle) and to context cancellation while keeping the shutdown order correct if instances have `boot.stop.priority` set. Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
For the testing phase, we need a convenient way to create dummy operations in LXD with a certain duration. Signed-off-by: Gabriel Mougard <[email protected]>
… tests Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
6e567f0 to
e08ef31
Compare
tomponline
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thanks!
| for i := 0; i < maxConcurrent; i++ { | ||
| go func(instShutdownCh <-chan instance.Instance) { | ||
| for inst := range instShutdownCh { | ||
| logger.Debug("Instance received for shutdown", logger.Ctx{"project": inst.Project().Name, "instance": inst.Name(), "timestamp": time.Now().UnixNano()}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabrielmougard are the new timestamp nano second fields required, they are rather noisy.
closes: #13460
JIRA ticket: https://warthogs.atlassian.net/browse/LXD-2600