5.9 Handling SIGCHLD
Signals
The purpose of the zombie state is to maintain
information about the child for the parent to fetch at some later
time. This information includes the process ID of the child, its
termination status, and information on the resource utilization of
the child (CPU time, memory, etc.). If a process terminates, and
that process has children in the zombie state, the parent process
ID of all the zombie children is set to 1 (the init
process), which will inherit the children and clean them up (i.e.,
init will wait for them, which removes the
zombie). Some Unix systems show the COMMAND column for a zombie
process as <defunct>.
Handling Zombies
Obviously we do not want to leave zombies
around. They take up space in the kernel and eventually we can run
out of processes. Whenever we fork children, we must
wait for them to prevent them from becoming zombies. To do
this, we establish a signal handler to catch SIGCHLD, and
within the handler, we call wait. (We will describe the
wait and waitpid functions in Section
5.10.) We establish the signal handler by adding the function
call
Signal (SIGCHLD, sig_chld);
in Figure 5.2, after the
call to listen. (It must be done sometime before we
fork the first child and needs to be done only once.) We
then define the signal handler, the function sig_chld,
which we show in Figure
5.7.
Figure 5.7
Version of SIGCHLD signal handler that calls wait
(improved in Figure 5.11).
tcpcliserv/sigchldwait.c
1 #include "unp.h"
2 void
3 sig_chld(int signo)
4 {
5 pid_t pid;
6 int stat;
7 pid = wait(&stat);
8 printf("child %d terminated\", pid);
9 return;
10 }
Warning:
Calling standard I/O functions such as printf in a signal
handler is not recommended, for reasons that we will discuss in
Section 11.18. We
call printf here as a diagnostic tool to see when the
child terminates.
Under System V and Unix 98, the child of a
process does not become a zombie if the process sets the
disposition of SIGCHLD to SIG_IGN. Unfortunately,
this works only under System V and Unix 98. POSIX explicitly states
that this behavior is unspecified. The portable way to handle
zombies is to catch SIGCHLD and call wait or
waitpid.
If we compile this program鈥?a class="docLink"
href="0131411551_ch05lev1sec2.html#ch05fig02">Figure 5.2, with
the call to Signal, with our sig_chld
handler鈥攗nder Solaris 9 and use the signal function from
the system library (not our version from Figure 5.6), we have
the following:
solaris % tcpserv02 &
|
start server in
background
|
[2] 16939
|
|
solaris % tcpcli01 127.0.0.1
|
then start client in
foreground
|
hi
there
|
we type this
|
hi there
|
and this is
echoed
|
^D
|
we type our EOF
character
|
child 16942 terminated
|
output by
printf in signal
handler
|
accept error: Interrupted system call
|
main function
aborts
|
The sequence of steps is as follows:
-
We terminate the
client by typing our EOF character. The client TCP sends a FIN to
the server and the server responds with an ACK.
-
The
receipt of the FIN delivers an EOF to the child's pending
readline. The child terminates.
-
The
parent is blocked in its call to accept when the
SIGCHLD signal is delivered. The sig_chld
function executes (our signal handler), wait fetches the
child's PID and termination status, and printf is called
from the signal handler. The signal handler returns.
-
Since
the signal was caught by the parent while the parent was blocked in
a slow system call (accept), the kernel causes the
accept to return an error of EINTR (interrupted
system call). The parent does not handle this error (Figure
5.2), so it aborts.
The purpose of this example is to show that when
writing network programs that catch signals, we must be cognizant
of interrupted system calls, and we must handle them. In this
specific example, running under Solaris 9, the signal
function provided in the standard C library does not cause an
interrupted system call to be automatically restarted by the
kernel. That is, the SA_RESTART flag that we set in
Figure 5.6 is not set
by the signal function in the system library. Some other
systems automatically restart the interrupted system call. If we
run the same example under 4.4BSD, using its library version of the
signal function, the kernel restarts the interrupted
system call and accept does not return an error. To handle
this potential problem between different operating systems is one
reason we define our own version of the signal function
that we use throughout the text (Figure 5.6).
As part of the coding conventions used in this
text, we always code an explicit return in our signal
handlers (Figure 5.7),
even though falling off the end of the function does the same thing
for a function returning void. When reading the code, the
unnecessary return statement acts as a reminder that the return may
interrupt a system call.
Handling Interrupted System Calls
We used the term "slow system call" to describe
accept, and we use this term for any system call that can
block forever. That is, the system call need never return. Most
networking functions fall into this category. For example, there is
no guarantee that a server's call to accept will ever
return, if there are no clients that will connect to the server.
Similarly, our server's call to read in Figure 5.3 will
never return if the client never sends a line for the server to
echo. Other examples of slow system calls are reads and writes of
pipes and terminal devices. A notable exception is disk I/O, which
usually returns to the caller (assuming no catastrophic hardware
failure).
The basic rule that applies here is that when a
process is blocked in a slow system call and the process catches a signal and the signal handler returns, the system
call can return an error of
EINTR. Some kernels
automatically restart some
interrupted system calls. For portability, when we write a program
that catches signals (most concurrent servers catch
SIGCHLD), we must be prepared for slow system calls to
return EINTR. Portability problems are caused by the
qualifiers "can" and "some," which were used earlier, and the fact
that support for the POSIX SA_RESTART flag is optional.
Even if an implementation supports the SA_RESTART flag,
not all interrupted system calls may automatically be restarted.
Most Berkeley-derived implementations, for example, never
automatically restart select, and some of these
implementations never restart accept or
recvfrom.
To handle an interrupted accept, we
change the call to accept in Figure 5.2, the
beginning of the for loop, to the following:
for ( ; ; ) {
clilen = sizeof (cliaddr);
if ( (connfd = accept (listenfd, (SA *) &cliaddr, &clilen)) < 0) {
if (errno == EINTR)
continue; /* back to for () */
else
err_sys ("accept error");
}
Notice that we call accept and not our
wrapper function Accept, since we must handle the failure
of the function ourselves.
What we are doing in this piece of code is
restarting the interrupted system call. This is fine for
accept, along with functions such as read,
write, select, and open. But there is
one function that we cannot restart: connect. If this
function returns EINTR, we cannot call it again, as doing
so will return an immediate error. When connect is
interrupted by a caught signal and is not automatically restarted,
we must call select to wait for the connection to
complete, as we will describe in Section 16.3.
|