7.5 Generic Socket
Options
We start with a discussion of the generic socket
options. These options are protocol-independent (that is, they are
handled by the protocol-independent code within the kernel, not by
one particular protocol module such as IPv4), but some of the
options apply to only certain types of sockets. For example, even
though the SO_BROADCAST socket option is called "generic,"
it applies only to datagram sockets.
SO_BROADCAST Socket
Option
This option enables or disables the ability of
the process to send broadcast messages. Broadcasting is supported
for only datagram sockets and only on networks that support the
concept of a broadcast message (e.g., Ethernet, token ring, etc.).
You cannot broadcast on a point-to-point link or any
connection-based transport protocol such as SCTP or TCP. We will
talk more about broadcasting in Chapter 20.
Since an application must set this socket option
before sending a broadcast datagram, it prevents a process from
sending a broadcast when the application was never designed to
broadcast. For example, a UDP application might take the
destination IP address as a command-line argument, but the
application never intended for a user to type in a broadcast
address. Rather than forcing the application to try to determine if
a given address is a broadcast address or not, the test is in the
kernel: If the destination address is a broadcast address and this
socket option is not set, EACCES is returned (p. 233 of
TCPv2).
SO_DEBUG Socket Option
This option is supported only by TCP. When
enabled for a TCP socket, the kernel keeps track of detailed
information about all the packets sent or received by TCP for the
socket. These are kept in a circular buffer within the kernel that
can be examined with the trpt program. Pages 916鈥?20 of
TCPv2 provide additional details and an example that uses this
option.
SO_DONTROUTE Socket
Option
This option specifies that outgoing packets are
to bypass the normal routing mechanisms of the underlying protocol.
For example, with IPv4, the packet is directed to the appropriate
local interface, as specified by the network and subnet portions of
the destination address. If the local interface cannot be
determined from the destination address (e.g., the destination is
not on the other end of a point-to-point link, or is not on a
shared network), ENETUNREACH is returned.
The equivalent of this option can also be
applied to individual datagrams using the MSG_DONTROUTE
flag with the send, sendto, or sendmsg
functions.
This option is often used by routing daemons
(e.g., routed and gated) to bypass the routing
table and force a packet to be sent out a particular interface.
SO_ERROR Socket Option
When an error occurs on a socket, the protocol
module in a Berkeley-derived kernel sets a variable named
so_error for that socket to one of the standard Unix
Exxx values. This is
called the pending error for the
socket. The process can be immediately notified of the error in one
of two ways:
-
If the process is
blocked in a call to select on the socket (Section
6.3), for either readability or writability, select
returns with either or both conditions set.
-
If the process
is using signal-driven I/O (Chapter 25), the SIGIO
signal is generated for either the process or the process
group.
The process can then obtain the value of
so_error by fetching the SO_ERROR socket option.
The integer value returned by getsockopt is the pending
error for the socket. The value of so_error is then reset
to 0 by the kernel (p. 547 of TCPv2).
If so_error is nonzero when the process
calls read and there is no data to return, read
returns鈥? with errno set to the value of so_error
(p. 516 of TCPv2). The value of so_error is then reset to
0. If there is data queued for the socket, that data is returned by
read instead of the error condition. If so_error
is nonzero when the process calls write, 鈥? is returned
with errno set to the value of so_error (p. 495
of TCPv2) and so_error is reset to 0.
There is a bug in the code shown on p. 495 of
TCPv2 in that so_error is not reset to 0. This has been
fixed in most modern releases. Anytime the pending error for a
socket is returned, it must be reset to 0.
This is the first socket option that we have
encountered that can be fetched but cannot be set.
SO_KEEPALIVE Socket
Option
When the keep-alive option is set for a TCP
socket and no data has been exchanged across the socket in either
direction for two hours, TCP automatically sends a keep-alive probe to the peer. This probe is a
TCP segment to which the peer must respond. One of three scenarios
results:
-
The peer responds
with the expected ACK. The application is not notified (since
everything is okay). TCP will send another probe following another
two hours of inactivity.
-
The peer
responds with an RST, which tells the local TCP that the peer host
has crashed and rebooted. The socket's pending error is set to
ECONNRESET and the socket is closed.
-
There is no
response from the peer to the keep-alive probe. Berkeley-derived
TCPs send 8 additional probes, 75 seconds apart, trying to elicit a
response. TCP will give up if there is no response within 11
minutes and 15 seconds after sending the first probe.
HP-UX 11 treats the keep-alive probes in the
same way as it would treat data, sending the second probe after a
retransmission timeout and doubling the timeout for each packet
until the configured maximum interval, with a default of 10
minutes.
If there is no response at all to TCP's
keep-alive probes, the socket's pending error is set to
ETIMEDOUT and the socket is closed. But if the socket
receives an ICMP error in response to one of the keep-alive probes,
the corresponding error (Figures A.15 and
A.16) is returned
instead (and the socket is still closed). A common ICMP error in
this scenario is "host unreachable," indicating that the peer host
is unreachable, in which case, the pending error is set to
EHOSTUNREACH. This can occur either because of a network
failure or because the remote host has crashed and the last-hop
router has detected the crash.
Chapter 23 of TCPv1 and pp. 828鈥?31 of TCPv2
contain additional details on the keep-alive option.
Undoubtedly the most common question regarding
this option is whether the timing parameters can be modified
(usually to reduce the two-hour period of inactivity to some
shorter value). Appendix E of TCPv1 discusses how to change these
timing parameters for various kernels, but be aware that most
kernels maintain these parameters on a per-kernel basis, not on a
per-socket basis, so changing the inactivity period from 2 hours to
15 minutes, for example, will affect all sockets on the host that enable this
option. However, such questions usually result from a
misunderstanding of the purpose of this option.
The purpose of this option is to detect if the
peer host crashes or becomes
unreachable (e.g., dial-up modem connection drops, power fails,
etc.). If the peer process
crashes, its TCP will send a FIN across the connection, which we
can easily detect with select. (This was why we used
select in Section 6.4.) Also
realize that if there is no response to any of the keep-alive
probes (scenario 3), we are not guaranteed that the peer host has
crashed, and TCP may well terminate a valid connection. It could be
that some intermediate router has crashed for 15 minutes, and that
period of time just happens to completely overlap our host's
11-minute and 15-second keep-alive probe period. In fact, this
function might more properly be called "make-dead" rather than
"keep-alive" since it can terminate live connections.
This option is normally used by servers,
although clients can also use the option. Servers use the option
because they spend most of their time blocked waiting for input
across the TCP connection, that is, waiting for a client request.
But if the client host's connection drops, is powered off, or
crashes, the server process will never know about it, and the
server will continually wait for input that can never arrive. This
is called a half-open connection.
The keep-alive option will detect these half-open connections and
terminate them.
Some servers, notably FTP servers, provide an
application timeout, often on the order of minutes. This is done by
the application itself, normally around a call to read,
reading the next client command. This timeout does not involve this
socket option. This is often a better method of eliminating
connections to missing clients, since the application has complete
control if it implements the timeout itself.
SCTP has a heartbeat mechanism that is similar to TCP's
"keep-alive" mechanism. The heartbeat mechanism is controlled
through parameters of the SCTP_SET_PEER_ADDR_PARAMS socket
option discussed later in this chapter, rather than the
SO_KEEPALIVE socket option. The settings made by
SO_KEEPALIVE on a SCTP socket are ignored and do not
affect the SCTP heartbeat mechanism.
Figure
7.6 summarizes the various methods that we have to detect when
something happens on the other end of a TCP connection. When we say
"using select for readability," we mean calling
select to test whether a socket is readable.
SO_LINGER Socket Option
This option specifies how the close
function operates for a connection-oriented protocol (e.g., for TCP
and SCTP, but not for UDP). By default, close returns
immediately, but if there is any data still remaining in the socket
send buffer, the system will try to deliver the data to the
peer.
The SO_LINGER socket option lets us
change this default. This option requires the following structure
to be passed between the user process and the kernel. It is defined
by including <sys/socket.h>.
struct linger {
int l_onoff; /* 0=off, nonzero=on */
int l_linger; /* linger time, POSIX specifies units as seconds */
};
Calling setsockopt leads to one of the
following three scenarios, depending on the values of the two
structure members:
-
If
l_onoff is 0, the option is turned off. The value of
l_linger is ignored and the previously discussed TCP
default applies: close returns immediately.
-
If
l_onoff is nonzero and l_linger is zero, TCP
aborts the connection when it is closed (pp. 1019鈥?020 of TCPv2).
That is, TCP discards any data still remaining in the socket send
buffer and sends an RST to the peer, not the normal four-packet
connection termination sequence (Section 2.6). We
will show an example of this in Figure 16.21. This
avoids TCP's TIME_WAIT state, but in doing so, leaves open the
possibility of another incarnation of this connection being created
within 2MSL seconds (Section 2.7) and
having old duplicate segments from the just-terminated connection
being incorrectly delivered to the new incarnation.
SCTP will also do an abortive close of the
socket by sending an ABORT chunk to the peer (see Section 9.2 of
[Stewart and Xie 2001]) when l_onoff is nonzero and
l_linger is zero.
Occasional USENET postings advocate the use of
this feature just to avoid the TIME_WAIT state and to be able to
restart a listening server even if connections are still in use
with the server's well-known port. This should NOT be done and
could lead to data corruption, as detailed in RFC 1337 [Braden
1992]. Instead, the SO_REUSEADDR socket option should
always be used in the server before the call to bind, as
we will describe shortly. The TIME_WAIT state is our friend and is
there to help us (i.e., to let old duplicate segments expire in the
network). Instead of trying to avoid the state, we should
understand it (Section 2.7).
There are certain circumstances which warrant
using this feature to send an abortive close. One example is an
RS-232 terminal server, which might hang forever in CLOSE_WAIT
trying to deliver data to a struck terminal port, but would
properly reset the stuck port if it got an RST to discard the
pending data.
-
If
l_onoff is nonzero and l_linger is nonzero, then
the kernel will linger when the
socket is closed (p. 472 of TCPv2). That is, if there is any data
still remaining in the socket send buffer, the process is put to
sleep until either: (i) all the data is sent and acknowledged by
the peer TCP, or (ii) the linger time expires. If the socket has
been set to nonblocking (Chapter 16), it will not wait for
the close to complete, even if the linger time is nonzero.
When using this feature of the SO_LINGER option, it is
important for the application to check the return value from
close, because if the linger time expires before the
remaining data is sent and acknowledged, close returns
EWOULDBLOCK and any remaining data in the send buffer is
discarded.
We now need to see exactly when a close
on a socket returns given the various scenarios we looked at. We
assume that the client writes data to the socket and then calls
close. Figure 7.7
shows the default situation.
We assume that when the client's data arrives,
the server is temporarily busy, so the data is added to the socket
receive buffer by its TCP. Similarly, the next segment, the
client's FIN, is also added to the socket receive buffer (in
whatever manner the implementation records that a FIN has been
received on the connection). But by default, the client's
close returns immediately. As we show in this scenario,
the client's close can return before the server reads the
remaining data in its socket receive buffer. Therefore, it is
possible for the server host to crash before the server application
reads this remaining data, and the client application will never
know.
The client can set the SO_LINGER socket
option, specifying some positive linger time. When this occurs, the
client's close does not return until all the client's data
and its FIN have been acknowledged by the server TCP. We show this
in Figure 7.8.
But we still have the same problem as in
Figure 7.7: The server
host can crash before the server application reads its remaining
data, and the client application will never know. Worse, Figure 7.9 shows what can happen
when the SO_LINGER option is set to a value that is too
low.
The basic principle here is that a successful
return from close, with the SO_LINGER socket
option set, only tells us that the data we sent (and our FIN) have
been acknowledged by the peer TCP. This does not tell us whether the peer application has
read the data. If we do not set the SO_LINGER socket
option, we do not know whether the peer TCP has acknowledged the
data.
One way for the client to know that the server
has read its data is to call shutdown (with a second
argument of SHUT_WR) instead of close and wait
for the peer to close its end of the connection. We show
this scenario in Figure
7.10.
Comparing this figure to Figures 7.7 and 7.8 we see that when we close our end of the
connection, depending on the function called (close or
shutdown) and whether the SO_LINGER socket option
is set, the return can occur at three different times:
-
close
returns immediately, without waiting at all (the default; Figure 7.7).
-
close
lingers until the ACK of our FIN is received (Figure 7.7).
-
shutdown followed by a read
waits until we receive the peer's FIN (Figure 7.10).
Another way to know that the peer application
has read our data is to use an application-level acknowledgment, or
application ACK. For example, in
the following, the client sends its data to the server and then
calls read for one byte of data:
char ack;
Write(sockfd, data, nbytes); /* data from client to server */
n = Read(sockfd, &ack, 1); /* wait for application-level ACK */
The server reads the data from the client and
then sends back the one-byte application-level ACK:
nbytes = Read(sockfd, buff, sizeof(buff)); /* data from client */
/* server verifies it received correct
amount of data from client */
Write(sockfd, "", 1); /* server's ACK back to client */
We are guaranteed that when the read in
the client returns, the server process has read the data we sent.
(This assumes that either the server knows how much data the client
is sending, or there is some application-defined end-of-record
marker, which we do not show here.) Here, the application-level ACK
is a byte of 0, but the contents of this byte could be used to
signal other conditions from the server to the client. Figure 7.11 shows the possible
packet exchange.
Figure
7.12 summarizes the two possible calls to shutdown and
the three possible calls to close, and the effect on a TCP
socket.
SO_OOBINLINE Socket
Option
When this option is set, out-of-band data will
be placed in the normal input queue (i.e., inline). When this
occurs, the MSG_OOB flag to the receive functions cannot
be used to read the out-of-band data. We will discuss out-of-band
data in more detail in Chapter 24.
SO_RCVBUF and
SO_SNDBUF Socket Options
Every socket has a send buffer and a receive
buffer. We described the operation of the send buffers with TCP,
UDP, and SCTP in Figures 2.15,
2.16, and 2.17.
The receive buffers are used by TCP, UDP, and
SCTP to hold received data until it is read by the application.
With TCP, the available room in the socket receive buffer limits
the window that TCP can advertise to the other end. The TCP socket
receive buffer cannot overflow because the peer is not allowed to
send data beyond the advertised window. This is TCP's flow control,
and if the peer ignores the advertised window and sends data beyond
the window, the receiving TCP discards it. With UDP, however, when
a datagram arrives that will not fit in the socket receive buffer,
that datagram is discarded. Recall that UDP has no flow control: It
is easy for a fast sender to overwhelm a slower receiver, causing
datagrams to be discarded by the receiver's UDP, as we will show in
Section 8.13. In
fact, a fast sender can overwhelm its own network interface,
causing datagrams to be discarded by the sender itself.
These two socket options let us change the
default sizes. The default values differ widely between
implementations. Older Berkeley-derived implementations would
default the TCP send and receive buffers to 4,096 bytes, but newer
systems use larger values, anywhere from 8,192 to 61,440 bytes. The
UDP send buffer size often defaults to a value around 9,000 bytes
if the host supports NFS, and the UDP receive buffer size often
defaults to a value around 40,000 bytes.
When setting the size of the TCP socket receive
buffer, the ordering of the function calls is important. This is
because of TCP's window scale option (Section 2.6), which
is exchanged with the peer on the SYN segments when the connection
is established. For a client, this means the SO_RCVBUF
socket option must be set before calling connect. For a
server, this means the socket option must be set for the listening
socket before calling listen. Setting this option for the
connected socket will have no effect whatsoever on the possible
window scale option because accept does not return with
the connected socket until TCP's three-way handshake is complete.
That is why this option must be set for the listening socket. (The
sizes of the socket buffers are always inherited from the listening
socket by the newly created connected socket: pp. 462鈥?63 of
TCPv2.)
The TCP socket buffer sizes should be at least
four times the MSS for the connection. If we are dealing with
unidirectional data transfer, such as a file transfer in one
direction, when we say "socket buffer sizes," we mean the socket
send buffer size on the sending host and the socket receive buffer
size on the receiving host. For bidirectional data transfer, we
mean both socket buffer sizes on the sender and both socket buffer
sizes on the receiver. With typical default buffer sizes of 8,192
bytes or larger, and a typical MSS of 512 or 1,460, this
requirement is normally met.
The minimum MSS multiple of four is a result of
the way that TCP's fast recovery algorithm works. The TCP sender
uses three duplicate acknowledgments to detect that a packet was
lost (RFC 2581 [Allman, Paxson, and Stevens 1999]). The receiver
sends a duplicate acknowledgment for each segment it receives after
a lost segment. If the window size is smaller than four segments,
there cannot be three duplicate acknowledgments, so the fast
recovery algorithm cannot be invoked.
To avoid wasting potential buffer space, the TCP
socket buffer sizes should also be an even multiple of the MSS for
the connection. Some implementations handle this detail for the
application, rounding up the socket buffer size after the
connection is established (p. 902 of TCPv2). This is another reason
to set these two socket options before establishing a connection.
For example, using the default 4.4BSD size of 8,192 and assuming an
Ethernet with an MSS of 1,460, both socket buffers are rounded up
to 8,760 (6 x 1,460) when the connection is established. This is
not a crucial requirement; the additional space in the socket
buffer above the multiple of the MSS is simply unused.
Another consideration in setting the socket
buffer sizes deals with performance. Figure 7.13 shows a TCP connection between two
endpoints (which we call a pipe)
with a capacity of eight segments.
We show four data segments on the top and four
ACKs on the bottom. Even though there are only four segments of
data in the pipe, the client must have a send buffer capacity of at
least eight segments, because the client TCP must keep a copy of
each segment until the ACK is received from the server.
We are ignoring some details here. First, TCP's
slow-start algorithm limits the rate at which segments are
initially sent on an idle connection. Next, TCP often acknowledges
every other segment, not every segment as we show. All these
details are covered in Chapters 20 and 24 of TCPv1.
What is important to understand is the concept
of the full-duplex pipe, its capacity, and how that relates to the
socket buffer sizes on both ends of the connection. The capacity of
the pipe is called the bandwidth-delay
product and we calculate this by multiplying the bandwidth
(in bits/sec) times the RTT (in seconds), converting the result
from bits to bytes. The RTT is easily measured with the
ping program.
The bandwidth is the value corresponding to the
slowest link between two endpoints and must somehow be known. For
example, a T1 line (1,536,000 bits/sec) with an RTT of 60 ms gives
a bandwidth-delay product of 11,520 bytes. If the socket buffer
sizes are less than this, the pipe will not stay full, and the
performance will be less than expected. Large socket buffers are
required when the bandwidth gets larger (e.g., T3 lines at 45
Mbits/sec) or when the RTT gets large (e.g., satellite links with
an RTT around 500 ms). When the bandwidth-delay product exceeds
TCP's maximum normal window size (65,535 bytes), both endpoints
also need the TCP long fat pipe
options that we mentioned in Section 2.6.
Most implementations have an upper limit for the
sizes of the socket send and receive buffers, and sometimes this
limit can be modified by the administrator. Older Berkeley-derived
implementations had a hard upper limit of around 52,000 bytes, but
newer implementations have a default limit of 256,000 bytes or
more, and this can usually be increased by the administrator.
Unfortunately, there is no simple way for an application to
determine this limit. POSIX defines the fpathconf
function, which most implementations support, and using the
_PC_SOCK_MAXBUF constant as the second argument, we can
retrieve the maximum size of the socket buffers. Alternately, an
application can try setting the socket buffers to the desired
value, and if that fails, cut the value in half and try again until
it succeeds. Finally, an application should make sure that it's not
actually making the socket buffer smaller when it sets it to a
preconfigured "large" value; calling getsockopt first to
retrieve the system's default and seeing if that's large enough is
often a good start.
SO_RCVLOWAT and
SO_SNDLOWAT Socket Options
Every socket also has a receive low-water mark
and a send low-water mark. These are used by the select
function, as we described in Section 6.3. These
two socket options, SO_RCVLOWAT and SO_SNDLOWAT,
let us change these two low-water marks.
The receive low-water mark is the amount of data
that must be in the socket receive buffer for select to
return "readable." It defaults to 1 for TCP, UDP, and SCTP sockets.
The send low-water mark is the amount of available space that must
exist in the socket send buffer for select to return
"writable." This low-water mark normally defaults to 2,048 for TCP
sockets. With UDP, the low-water mark is used, as we described in
Section 6.3, but
since the number of bytes of available space in the send buffer for
a UDP socket never changes (since UDP does not keep a copy of the
datagrams sent by the application), as long as the UDP socket send
buffer size is greater than the socket's low-water mark, the UDP
socket is always writable. Recall from Figure 2.16 that UDP
does not have a send buffer; it has only a send buffer size.
SO_RCVTIMEO and
SO_SNDTIMEO Socket Options
These two socket options allow us to place a
timeout on socket receives and sends. Notice that the argument to
the two sockopt functions is a pointer to a
timeval structure, the same one used with select
(Section 6.3). This
lets us specify the timeouts in seconds and microseconds. We
disable a timeout by setting its value to 0 seconds and 0
microseconds. Both timeouts are disabled by default.
The receive timeout affects the five input
functions: read, readv, recdv,
recvfrom, and recvmsg. The send timeout affects
the five output functions: write, writev,
send, sendto, and sendmsg. We will talk
more about socket timeouts in Section 14.2.
These two socket options and the concept of
inherent timeouts on socket receives and sends were added with
4.3BSD Reno.
In Berkeley-derived implementations, these two
values really implement an inactivity timer and not an absolute
timer on the read or write system call. Pages 496 and 516 of TCPv2
talk about this in more detail.
SO_REUSEADDR and
SO_REUSEPORT Socket Options
The SO_REUSEADDR socket option serves
four different purposes:
-
SO_REUSEADDR allows a listening server
to start and bind its well-known port, even if previously
established connections exist that use this port as their local
port. This condition is typically encountered as follows:
-
A listening
server is started.
-
A connection
request arrives and a child process is spawned to handle that
client.
-
The listening
server terminates, but the child continues to service the client on
the existing connection.
-
The listening
server is restarted.
By default, when the listening server is
restarted in (d) by calling socket, bind, and
listen, the call to bind fails because the
listening server is trying to bind a port that is part of an
existing connection (the one being handled by the previously
spawned child). But if the server sets the SO_REUSEADDR
socket option between the calls to socket and
bind, the latter function will succeed. All TCP servers should specify this socket
option to allow the server to be restarted in this situation.
This scenario is one of the most frequently
asked questions on USENET.
-
SO_REUSEADDR allows a new server to
be started on the same port as an existing server that is bound to
the wildcard address, as long as each instance binds a different
local IP address. This is common for a site hosting multiple HTTP
servers using the IP alias technique (Section A.4).
Assume the local host's primary IP address is 198.69.10.2 but it
has two aliases: 198.69.10.128 and 198.69.10.129. Three HTTP
servers are started. The first HTTP server would call bind
with the wildcard as the local IP address and a local port of 80
(the well-known port for HTTP). The second server would call
bind with a local IP address of 198.69.10.128 and a local
port of 80. But, this second call to bind fails unless
SO_REUSEADDR is set before the call. The third server
would bind 198.69.10.129 and port 80. Again,
SO_REUSEADDR is required for this final call to succeed.
Assuming SO_REUSEADDR is set and the three servers are
started, incoming TCP connection requests with a destination IP
address of 198.69.10.128 and a destination port of 80 are delivered
to the second server, incoming requests with a destination IP
address of 198.69.10.129 and a destination port of 80 are delivered
to the third server, and all other TCP connection requests with a
destination port of 80 are delivered to the first server. This
"default" server handles requests destined for 198.69.10.2 in
addition to any other IP aliases that the host may have configured.
The wildcard means "everything that doesn't have a better (more
specific) match." Note that this scenario of allowing multiple
servers for a given service is handled automatically if the server
always sets the SO_REUSEADDR socket option (as we
recommend).
With TCP, we are never able to start multiple
servers that bind the same IP address and the same port: a
completely duplicate binding. That
is, we cannot start one server that binds 198.69.10.2 port 80 and
start another that also binds 198.69.10.2 port 80, even if we set
the SO_REUSEADDR socket option for the second server.
For security reasons, some operating systems
prevent any "more specific" bind
to a port that is already bound to the wildcard address, that is,
the series of binds described here would not work with or without
SO_REUSEADDR. On such a system, the server that performs
the wildcard bind must be started last. This is to avoid the
problem of a rogue server binding to an IP address and port that
are being served already by a system service and intercepting
legitimate requests. This is a particular problem for NFS, which
generally does not use a privileged port.
-
SO_REUSEADDR allows a single process
to bind the same port to multiple sockets, as long as each bind
specifies a different local IP address. This is common for UDP
servers that need to know the destination IP address of client
requests on systems that do not provide the IP_RECVDSTADDR
socket option. This technique is normally not used with TCP servers
since a TCP server can always determine the destination IP address
by calling getsockname after the connection is
established. However, a TCP server wishing to serve connections to
some, but not all, addresses belonging to a multihomed host should
use this technique.
-
SO_REUSEADDR allows completely duplicate bindings: a bind
of an IP address and port, when that same IP address and port are
already bound to another socket, if the transport protocol supports
it. Normally this feature is supported only for UDP sockets.
This feature is used with multicasting to allow
the same application to be run multiple times on the same host.
When a UDP datagram is received for one of these multiply bound
sockets, the rule is that if the datagram is destined for either a
broadcast address or a multicast address, one copy of the datagram
is delivered to each matching socket. But if the datagram is
destined for a unicast address, the datagram is delivered to only
one socket. If, in the case of a unicast datagram, there are
multiple sockets that match the datagram, the choice of which
socket receives the datagram is implementation-dependent. Pages
777鈥?79 of TCPv2 talk more about
this feature. We will talk more about broadcasting and multicasting
in Chapters
20 and 21.
Exercises 7.5 and
7.6 show some
examples of this socket option.
4.4BSD introduced the SO_REUSEPORT
socket option when support for multicasting was added. Instead of
overloading SO_REUSEADDR with the desired multicast
semantics that allow completely duplicate bindings, this new socket
option was introduced with the following semantics:
-
This option
allows completely duplicate bindings, but only if each socket that
wants to bind the same IP address and port specify this socket
option.
-
SO_REUSEADDR is considered equivalent
to SO_REUSEPORT if the IP address being bound is a
multicast address (p. 731 of TCPv2).
The problem with this socket option is that not
all systems support it, and on those that do not support the option
but do support multicasting, SO_REUSEADDR is used instead
of SO_REUSEPORT to allow completely duplicate bindings
when it makes sense (i.e., a UDP server that can be run multiple
times on the same host at the same time and that expects to receive
either broadcast or multicast datagrams).
We can summarize our discussion of these socket
options with the following recommendations:
-
Set the
SO_REUSEADDR socket option before calling bind in
all TCP servers.
-
When writing a
multicast application that can be run multiple times on the same
host at the same time, set the SO_REUSEADDR socket option
and bind the group's multicast address as the local IP
address.
Chapter 22 of TCPv2 talks about these two socket
options in more detail.
There is a potential security problem with
SO_REUSEADDR. If a socket exists that is bound to, say,
the wildcard address and port 5555, if we specify
SO_REUSEADDR, we can bind that same port to a different IP
address, say the primary IP address of the host. Any future
datagrams that arrive destined to port 5555 and the IP address that
we bound to our socket are delivered to our socket, not to the
other socket bound to the wildcard address. These could be TCP SYN
segments, SCTP INIT chunks, or UDP datagrams. (Exercises
11.9 shows this feature with UDP.) For most well-known
services, HTTP, FTP, and Telnet, for example, this is not a problem
because these servers all bind a reserved port. Hence, any process
that comes along later and tries to bind a more specific instance
of that port (i.e., steal the port) requires superuser privileges.
NFS, however, can be a problem since its normal port (2049) is not
reserved.
One underlying problem with the sockets API is
that the setting of the socket pair is done with two function calls
(bind and connect) instead of one. [Torek 1994]
proposes a single function that solves this problem.
int bind_connect_listen(int
sockfd, const struct
sockaddr *laddr, int
laddrlen, const struct
sockaddr *faddr, int
faddrlen, int
listen);
|
laddr specifies
the local IP address and local port, faddr specifies the foreign IP address and
foreign port, and listen specifies
a client (zero) or a server (nonzero; same as the backlog argument
to listen). Then, bind would be a library
function that calls this function with faddr a null pointer and faddrlen 0, and connect would be a
library function that calls this function with laddr a null pointer and laddrlen 0. There are a few applications,
notably TFTP, that need to specify both the local pair and the
foreign pair, and they could call bind_connect_listen
directly. With such a function, the need for SO_REUSEADDR
disappears, other than for multicast UDP servers that explicitly
need to allow completely duplicate bindings of the same IP address
and port. Another benefit of this new function is that a TCP server
could restrict itself to servicing connection requests that arrive
from one specific IP address and port, something which RFC 793
[Postel 1981c] specifies but is impossible to implement with the
existing sockets API.
SO_TYPE Socket Option
This option returns the socket type. The integer
value returned is a value such as SOCK_STREAM or
SOCK_DGRAM. This option is typically used by a process
that inherits a socket when it is started.
SO_USELOOPBACK Socket
Option
This option applies only to sockets in the
routing domain (AF_ROUTE). This option defaults to ON for
these sockets (the only one of the SO_xxx socket options that defaults to ON instead
of OFF). When this option is enabled, the socket receives a copy of
everything sent on the socket.
Another way to disable these loopback copies is
to call shutdown with a second argument of
SHUT_RD.
|