-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kill all children on SIGTERM #106
base: master
Are you sure you want to change the base?
Conversation
any news? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you confirm whether or not ninja has a similar behavior when it receives SIGTERM?
Extract it from jobwork() so that build() can call it on a signal. Signed-off-by: Paolo Bonzini <[email protected]>
Keep the system clean by propagating SIGTERM to all children, and by not starting new jobs on both SIGTERM and SIGINT. The only tricky bit is that previously fd[i].revents was used to skip both jobs that are not in use and jobs that did not have output; that's because negative file descriptors do not cause POLLNVAL and therefore fd[i].revents is zero for inactive jobs as well. But because all jobs must be killed, build() now has to check fd[i].fd == -1 explicitly. While at it, also clean up jobdone() by clearing job[i].edge; it's not nice to leave a dangling pointer in the jobs array, even if it's harmless. Signed-off-by: Paolo Bonzini <[email protected]>
Thinking about this some more, I'm worried a race condition where the signal arrives outside of the poll. If this happens, then we won't forward the signal to the subprocesses until the next one produces output or finishes. I think writing to a self-pipe in the handler and adding that to the pollfd array is probably the simplest way to solve this. I also did some digging into ninja to see how it deals with these signals and found a few things:
This leaves me with a few questions. Since ninja doesn't forward the signal to any foreground job, doesn't it have the same issue you're fixing here? If SIGTERM is sent to ninja only, what happens? Does the subprocess remain running? I also wonder if whoever sent the SIGTERM ought to have sent it to samurai's process group. Currently, samurai doesn't make new process groups for jobs. In this PR, we're making the assumption that SIGINT is usually sent to the whole process group due to a Ctrl-C. However, if SIGINT is sent to only samurai, then I believe it will just stop starting new jobs and wait until any active jobs finish naturally. Similarly, if SIGTERM is sent to samurai's process group, then I think the subprocesses will end up seeing SIGTERM twice (once from the initial signal, once from samurai). |
That is caught by:
but indeed there is a microscopic window between this line and the poll() right below. I can fix it once we agree on what to do.
ninja forwards the signal to the non-console processes (using process groups):
But I think for SIGTERM it should send it to console processes as well. Unlike SIGINT and SIGHUP, which are sent by the OS, SIGTERM is usually sent by the user with kill, and you cannot assume that the user sent it to the process group. In fact, I'd argue that because the idea of SIGTERM is to let the process clean up after itself, 1) it should not be sent to a process group, 2) it is a bug to not trap it if you spawn processes. The behavior of moving the children in their process group was implemented for ninja-build/ninja#110 with no particular explanation; then it was changed to move the process into a session (ninja-build/ninja#909) and reverted (ninja-build/ninja#1097, but see also ninja-build/ninja#1001). Frankly I wouldn't take it as a good example even if samurai is a "ninja clone". Make instead does the same as my implementation: it does not place processes in separate process groups, and forwards SIGTERM.
That's a problem of the user that sent it to the process group; it's not samurai's problem. Generally I don't think that it would be an issue, because SIGTERM will either exit on the first or trigger orderly cleanup in the child. In the latter case it would be triggered twice but, because signal handlers in general don't do much work themselves, it should be safe to consider SIGTERM idempotent; and if they're not, that should be considered a bug in the program. |
Handle SIGTERM by forwarding it to all children and waiting for them to stop. This is a better behavior than letting the children continue in the background.
The price to pay is that if a program does not respond to SIGTERM, samurai will have to be killed with SIGKILL. This however is consistent with many other programs that invoke and manage child processes, and the reason will be apparent from e.g.
ps
ortop
output, so overall I think this is in improvement.