28.3 Raw Socket Output
Output on a raw socket is governed by the
following rules:
-
Normal output is performed by calling
sendto or sendmsg and specifying the destination
IP address. write, writev, or send can
also be called if the socket has been connected.
-
If the IP_HDRINCL option is not set,
the starting address of the data for the kernel to send specifies
the first byte following the IP header because the kernel will
build the IP header and prepend it to the data from the process.
The kernel sets the protocol field of the IPv4 header that it
builds to the third argument from the call to socket.
-
If the IP_HDRINCL option is set, the
starting address of the data for the kernel to send specifies the
first byte of the IP header. The amount of data to write must
include the size of the caller's IP header. The process builds the
entire IP header, except: (i) the IPv4 identification field can be
set to 0, which tells the kernel to set this value; (ii) the kernel
always calculates and stores the IPv4 header checksum; and (iii) IP
options may or may not be included; see Section 27.2
-
The kernel fragments raw packets that exceed the
outgoing interface MTU.
Raw sockets are documented to provide an
identical interface to the one a protocol would have if it was
resident in the kernel [McKusick et al. 1996] Unfortunately, this
means that certain pieces of the API are dependent on the OS
kernel, specifically with regard to the byte ordering of the fields
in the IP header. On many Berkeley-derived kernels, all fields are
in network byte order except ip_len and ip_off,
which are in host byte order (pp. 233 and 1057 of TCPv2). On Linux
and OpenBSD, however, all the fields must be in network byte
order.
The IP_HDRINCL socket option was
introduced with 4.3BSD Reno. Before this, the only way for an
application to specify its own IP header in packets sent on a raw
IP socket was to apply a kernel patch that was introduced in 1988
by Van Jacobson to support traceroute. This patch required
the application to create a raw IP socket specifying a protocol of IPPROTO_RAW, which has
a value of 255 (and is a reserved value and must never appear as
the protocol field in an IP header).
The functions that perform input and output on
raw sockets are some of the simplest in the kernel. For example, in
TCPv2, each function requires about 40 lines of C code (pp.
1054鈥?057), compared to TCP input at about 2,000 lines and TCP
output at about 700 lines.
Our description of the IP_HDRINCL
socket option is for 4.4BSD. Earlier versions, such as Net/2,
filled in more fields in the IP header when this option was
set.
With IPv4, it is the responsibility of the user
process to calculate and set any header checksums contained in
whatever follows the IPv4 header. For example, in our ping
program (Figure 28.14), we must
calculate the ICMPv4 checksum and store it in the ICMPv4 header
before calling sendto.
IPv6 Differences
There are a few differences with raw IPv6
sockets (RFC 3542 [Stevens et al. 2003]):
-
All fields in the protocol headers sent or
received on a raw IPv6 socket are in network byte order.
-
There is nothing similar to the IPv4
IP_HDRINCL socket option with IPv6. Complete IPv6 packets
(including the IPv6 header or extension headers) cannot be read or
written on an IPv6 raw socket. Almost all fields in an IPv6 header
and all extension headers are available to the application through
socket options or ancillary data (see Exercise 28.1).
Should an application need to read or write complete IPv6
datagrams, datalink access (described in Chapter 29) must be used.
-
Checksums on raw IPv6 sockets are handled
differently, as will be described shortly.
IPV6_CHECKSUM Socket
Option
For an ICMPv6 raw socket, the kernel always
calculates and stores the checksum in the ICMPv6 header. This
differs from an ICMPv4 raw socket, where the application must do
this itself (compare Figures 28.14 and
28.16). While ICMPv4
and ICMPv6 both require the sender to calculate the checksum,
ICMPv6 includes a pseudoheader in its checksum (we will discuss the
concept of a pseudoheader when we calculate the UDP checksum in
Figure 29.14). One of
the fields in this pseudoheader is the source IPv6 address, and
normally the application lets the kernel choose this value. To
prevent the application from having to try to choose this address
just to calculate the checksum, it is easier to let the kernel
calculate the checksum.
For other raw IPv6 sockets (i.e., those created
with a third argument to socket other than
IPPROTO_ICMPV6), a socket option tells the kernel whether
to calculate and store a checksum in outgoing packets and verify
the checksum in received packets. By default, this option is
disabled, and it is enabled by setting the option value to a
nonnegative value, as in
int offset = 2;
if (setsockopt(sockfd, IPPROTO_IPV6, IPV6_CHECKSUM,
&offset, sizeof(offset)) < 0)
error
This not only enables checksums on this socket,
it also tells the kernel the byte offset of the 16-bit checksum: 2
bytes from the start of the application data in this example. To
disable the option, it must be set to -1. When enabled, the kernel
will calculate and store the checksum for outgoing packets sent on
the socket and also verify the checksums for packets received on
the socket.
|