Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kill all children on SIGTERM #106

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

bonzini
Copy link
Contributor

@bonzini bonzini commented May 3, 2024

Handle SIGTERM by forwarding it to all children and waiting for them to stop. This is a better behavior than letting the children continue in the background.

The price to pay is that if a program does not respond to SIGTERM, samurai will have to be killed with SIGKILL. This however is consistent with many other programs that invoke and manage child processes, and the reason will be apparent from e.g. ps or top output, so overall I think this is in improvement.

@bonzini
Copy link
Contributor Author

bonzini commented May 21, 2024

any news?

Copy link
Owner

@michaelforney michaelforney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you confirm whether or not ninja has a similar behavior when it receives SIGTERM?

build.c Show resolved Hide resolved
build.c Show resolved Hide resolved
Extract it from jobwork() so that build() can call it on a signal.

Signed-off-by: Paolo Bonzini <[email protected]>
Keep the system clean by propagating SIGTERM to all children,
and by not starting new jobs on both SIGTERM and SIGINT.

The only tricky bit is that previously fd[i].revents was used
to skip both jobs that are not in use and jobs that did not
have output; that's because negative file descriptors
do not cause POLLNVAL and therefore fd[i].revents is zero for
inactive jobs as well.  But because all jobs must be killed,
build() now has to check fd[i].fd == -1 explicitly.

While at it, also clean up jobdone() by clearing job[i].edge;
it's not nice to leave a dangling pointer in the jobs array,
even if it's harmless.

Signed-off-by: Paolo Bonzini <[email protected]>
@michaelforney
Copy link
Owner

Thinking about this some more, I'm worried a race condition where the signal arrives outside of the poll. If this happens, then we won't forward the signal to the subprocesses until the next one produces output or finishes. I think writing to a self-pipe in the handler and adding that to the pollfd array is probably the simplest way to solve this.

I also did some digging into ninja to see how it deals with these signals and found a few things:

This leaves me with a few questions. Since ninja doesn't forward the signal to any foreground job, doesn't it have the same issue you're fixing here? If SIGTERM is sent to ninja only, what happens? Does the subprocess remain running? I also wonder if whoever sent the SIGTERM ought to have sent it to samurai's process group.

Currently, samurai doesn't make new process groups for jobs. In this PR, we're making the assumption that SIGINT is usually sent to the whole process group due to a Ctrl-C. However, if SIGINT is sent to only samurai, then I believe it will just stop starting new jobs and wait until any active jobs finish naturally. Similarly, if SIGTERM is sent to samurai's process group, then I think the subprocesses will end up seeing SIGTERM twice (once from the initial signal, once from samurai).

@bonzini
Copy link
Contributor Author

bonzini commented Dec 17, 2024

I'm worried a race condition where the signal arrives outside of the poll. If this happens, then we won't forward the signal to the subprocesses until the next one produces output or finishes. I think writing to a self-pipe in the handler and adding that to the pollfd array is probably the simplest way to solve this.

That is caught by:

                if (!have_work() && !numjobs)
                        break;

but indeed there is a microscopic window between this line and the poll() right below. I can fix it once we agree on what to do.

If SIGTERM is sent to ninja only, what happens? Does the subprocess remain running? I also wonder if whoever sent the SIGTERM ought to have sent it to samurai's process group.

ninja forwards the signal to the non-console processes (using process groups):

    if (!(*i)->use_console_)
      kill(-(*i)->pid_, interrupted_);

But I think for SIGTERM it should send it to console processes as well. Unlike SIGINT and SIGHUP, which are sent by the OS, SIGTERM is usually sent by the user with kill, and you cannot assume that the user sent it to the process group. In fact, I'd argue that because the idea of SIGTERM is to let the process clean up after itself, 1) it should not be sent to a process group, 2) it is a bug to not trap it if you spawn processes.

The behavior of moving the children in their process group was implemented for ninja-build/ninja#110 with no particular explanation; then it was changed to move the process into a session (ninja-build/ninja#909) and reverted (ninja-build/ninja#1097, but see also ninja-build/ninja#1001). Frankly I wouldn't take it as a good example even if samurai is a "ninja clone".

Make instead does the same as my implementation: it does not place processes in separate process groups, and forwards SIGTERM.

Similarly, if SIGTERM is sent to samurai's process group, then I think the subprocesses will end up seeing SIGTERM twice (once from the initial signal, once from samurai).

That's a problem of the user that sent it to the process group; it's not samurai's problem. Generally I don't think that it would be an issue, because SIGTERM will either exit on the first or trigger orderly cleanup in the child. In the latter case it would be triggered twice but, because signal handlers in general don't do much work themselves, it should be safe to consider SIGTERM idempotent; and if they're not, that should be considered a bug in the program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants