I/O multiplexing: select, poll and epoll in Linux

1.0 I/O multiplexing

I/O multiplexing is the the ability to perform I/O operations on multiple file descriptors. Input operations like read, accept and calls for receiving messages block when there is no incoming data. So, if an input call is made and it blocks, we may miss data from other file descriptors. To circumvent this, I/O multiplexing calls, viz., select, poll, and the epoll API calls, are provided. A process blocks on an I/O multiplexing call. When this call returns, the process is provided a set of file descriptors which are ready for I/O. And, the process can do I/O on these file descriptors before it goes for the next iteration of the I/O multiplexing call.

2.0 select

#include <sys/select.h>

int select (int nfds, fd_set *readfds, fd_set *writefds,
            fd_set *exceptfds, struct timeval *timeout);

The select system call monitors three sets of file descriptors, readfds, writefds and exceptfds. The file descriptors in readfds are monitored if characters are available in one or more of them for reading. Similarly, the file descriptors in writefds are examined as to whether space is available for write operations on one or more of the descriptors. The file descriptors in exceptfds are watched for exceptions. Any of pointers to the three file descriptor sets may be null, and in that case nothing is done for that set. The first parameter, nfds, is the highest numbered file descriptor plus one. The last parameter is a pointer to a timeout. A null value of timeout means "block forever".

select returns when one or more of file descriptors in the three sets are ready for I/O, or the timeout has expired. The three sets are modified to include only the file descriptors ready for I/O. select is discussed with example server and client programs in Socket programming using the select system call.

select can monitor only the file descriptors which are less than FD_SETSIZE, which is defined as 1024.

3.0 poll

#include <poll.h>

struct pollfd {
    int   fd;         /* file descriptor */
    short events;     /* requested events */
    short revents;    /* returned events */
};         

int poll (struct pollfd *fds, nfds_t nfds, int timeout);

#define _GNU_SOURCE         
#include <signal.h>
#include <poll.h>

int ppoll (struct pollfd *fds, nfds_t nfds,
           const struct timespec *tmo_p, const sigset_t *sigmask);

The poll system call monitors file descriptors for events and returns when some events are indicated or the timeout has occurred. The first parameter, fds is an array of struct pollfd, which contains fd, the file descriptor to be monitored and two short integers containing bit masks for requested and returned events. The important events are POLLIN, indicating data can be read on the file descriptor, POLLPRI, indicating exception condition on the file descriptor, POLLOUT, indicating that a write operation is possible, and, POLLRDHUP, which means hang up in stream socket as the peer closed the connection. The _GNU_SOURCE feature test macro must be defined for using the POLLRDHUP event. These events can be passed in the requested events and, if that event happens, the relevant events are returned in revents. There are error conditions, which are ignored if passed in the requested events, but, if those conditions happen, the events are returned in revents. These error events are, POLLERR for error conditions on the file descriptor, POLLHUP for hang up, and, POLLNVAL, for invalid requests as the file descriptor was not open.

If fd is negative, that element of struct pollfd is ignored, and revents in the structure is returned as zero. This provides a way to quickly invalidate a file descriptor for which we are no longer interested in finding events.

The second parameter, nfs speciifes the number of struct pollfd's in the array fds. The last parameter timeout is in milliseconds. A negative value of timeout means infinite timeout.

Just like we have the companion pselect call for select, there is the ppoll call for poll. ppoll is just like poll, except that the timeout has structure struct timespec, which has seconds and nanoseconds. And, there is that last parameter, pointer to sigset_t, which becomes the signal mask for the thread during the ppoll call. After the call, the original signal mask is restored. ppoll returns when there are one or more of the requested events, or the timeout has expired or a signal outside the signal mask has been caught.

3.1 Example program using poll for I/O multiplexing

In a previous post, I had listed an example server using the select call for I/O multiplexing. Now we have another version of that program and this version uses poll for I/O multiplexing. The server code is given below.

3.2 The server

/* 
 *           flight-time-server.c: record and provide time of a
 *                                 flight from the airport
 *
 */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <poll.h>
#include <errno.h>
#include <syslog.h>
#include <unistd.h>
#include <stdbool.h>
#include <ctype.h>
#include <stdint.h>
#include <time.h>

#define FLIGHT_NUM_SIZE            15

#define SERVER_PORT                "4358"
#define STORE_FLIGHT               1
#define FLIGHT_TIME_STORED         2
#define FLIGHT_TIME                3
#define FLIGHT_TIME_RESULT         4
#define FLIGHT_NOT_FOUND           5
#define ERROR_IN_INPUT             9

#define BACKLOG                   10
#define NUM_FDS                    5

void error (char *msg);

struct message {
    int32_t message_id;
    char flight_no [FLIGHT_NUM_SIZE + 1];
    char departure [1 + 1]; // 'D': departure, 'A': arrival
    char date [10 + 1]; // dd/mm/yyyy
    char time [5 + 1];   // hh:mm
};

struct tnode {
    char *flight_no;
    bool departure; // true: departure, false: arrival
    time_t flight_time;
    struct tnode *left;
    struct tnode *right;
};

struct message recv_message, send_message;

struct tnode *add_to_tree (struct tnode *p, char *flight_no, bool departure, time_t flight_time);
struct tnode *find_flight_rec (struct tnode *p, char *flight_no);
void print_tree (struct tnode *p);
void trim (char *dest, char *src); 
void error (char *msg);

int main (int argc, char **argv)
{
    const char * const ident = "flight-time-server";

    openlog (ident, LOG_CONS | LOG_PID | LOG_PERROR, LOG_USER);
    syslog (LOG_USER | LOG_INFO, "%s", "Hello world!");
    
    struct addrinfo hints;
    memset(&hints, 0, sizeof (struct addrinfo));
    hints.ai_family = AF_UNSPEC;    /* allow IPv4 or IPv6 */
    hints.ai_socktype = SOCK_STREAM; /* Stream socket */
    hints.ai_flags = AI_PASSIVE;    /* for wildcard IP address */

    struct addrinfo *result;
    int s; 
    if ((s = getaddrinfo (NULL, SERVER_PORT, &hints, &result)) != 0) {
        fprintf (stderr, "getaddrinfo: %s\n", gai_strerror (s));
        exit (EXIT_FAILURE);
    }

    /* Scan through the list of address structures returned by 
       getaddrinfo. Stop when the the socket and bind calls are successful. */

    int listener, optval = 1;
    socklen_t length;
    struct addrinfo *rptr;
    for (rptr = result; rptr != NULL; rptr = rptr -> ai_next) {
        listener = socket (rptr -> ai_family, rptr -> ai_socktype,
                       rptr -> ai_protocol);
        if (listener == -1)
            continue;

        if (setsockopt (listener, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof (int)) == -1)
            error("setsockopt");

        if (bind (listener, rptr -> ai_addr, rptr -> ai_addrlen) == 0)  // Success
            break;

        if (close (listener) == -1)
            error ("close");
    }

    if (rptr == NULL) {               // Not successful with any address
        fprintf(stderr, "Not able to bind\n");
        exit (EXIT_FAILURE);
    }

    freeaddrinfo (result);

    // Mark socket for accepting incoming connections using accept
    if (listen (listener, BACKLOG) == -1)
        error ("listen");

    nfds_t nfds = 0;
    struct pollfd *pollfds;
    int maxfds = 0, numfds = 0;

    if ((pollfds = malloc (NUM_FDS * sizeof (struct pollfd))) == NULL)
	error ("malloc");
    maxfds = NUM_FDS;

    pollfds -> fd = listener;
    pollfds -> events = POLLIN;
    pollfds -> revents = 0;
    numfds = 1;

    socklen_t addrlen;
    struct sockaddr_storage client_saddr;
    char str [INET6_ADDRSTRLEN];
    struct sockaddr_in  *ptr;
    struct sockaddr_in6  *ptr1;
    struct tnode *root = NULL;

    while (1) {
        // monitor readfds for readiness for reading
	nfds = numfds;
	if (poll (pollfds, nfds, -1) == -1)
	    error ("poll");
        
        // Some sockets are ready. Examine readfds
        for (int fd = 0; fd < (nfds + 1); fd++) {
            if ((pollfds + fd) -> fd <= 0) // file desc == 0 is not expected, as these are socket fds and not stdin
		continue;

            if (((pollfds + fd) -> revents & POLLIN) == POLLIN) {  // fd is ready for reading 
                if ((pollfds + fd) -> fd == listener) {  // request for new connection
                    addrlen = sizeof (struct sockaddr_storage);
                    int fd_new;
                    if ((fd_new = accept (listener, (struct sockaddr *) &client_saddr, &addrlen)) == -1)
                        error ("accept");
                    // add fd_new to pollfds
		    if (numfds == maxfds) { // create space
                        if ((pollfds = realloc (pollfds, (maxfds + NUM_FDS) * sizeof (struct pollfd))) == NULL)
	                    error ("malloc");
                        maxfds += NUM_FDS;
		    }
                    numfds++;
		    (pollfds + numfds - 1) -> fd = fd_new;
                    (pollfds + numfds - 1) -> events = POLLIN;
                    (pollfds + numfds - 1) -> revents = 0;

                    // print IP address of the new client
                    if (client_saddr.ss_family == AF_INET) {
                        ptr = (struct sockaddr_in *) &client_saddr;
                        inet_ntop (AF_INET, &(ptr -> sin_addr), str, sizeof (str));
                    }
                    else if (client_saddr.ss_family == AF_INET6) {
                        ptr1 = (struct sockaddr_in6 *) &client_saddr;
	                inet_ntop (AF_INET6, &(ptr1 -> sin6_addr), str, sizeof (str));
                    }
                    else
                    {
                        ptr = NULL;
                        fprintf (stderr, "Address family is neither AF_INET nor AF_INET6\n");
                    }
                    if (ptr) 
                        syslog (LOG_USER | LOG_INFO, "%s %s", "Connection from client", str);
                
                }
                else  // data from an existing connection, receive it
                {
                    memset (&recv_message, '\0', sizeof (struct message));
                    ssize_t numbytes = recv ((pollfds + fd) -> fd, &recv_message, sizeof (struct message), 0);
   
                    if (numbytes == -1)
                        error ("recv");
                    else if (numbytes == 0) {
                        // connection closed by client
                        fprintf (stderr, "Socket %d closed by client\n", (pollfds + fd) -> fd);
                        if (close ((pollfds + fd) -> fd) == -1)
                            error ("close");
			(pollfds + fd) -> fd *= -1; // make it negative so that it is ignored in future
                    }
                    else 
                    {
                        // data from client
                        bool valid;
                        char temp_buf [FLIGHT_NUM_SIZE + 1];
                        
                        switch (ntohl (recv_message.message_id)) {
                            case STORE_FLIGHT:
                                   valid = true;
                                   // validate flight number
                                   if (recv_message.flight_no [FLIGHT_NUM_SIZE])
                                       recv_message.flight_no [FLIGHT_NUM_SIZE] = '\0';
                                   if (strlen (recv_message.flight_no) < 3)
                                       valid = false;
                                   trim (temp_buf, recv_message.flight_no);
                                   strcpy (recv_message.flight_no, temp_buf);
                                   bool departure;
                                   if (toupper (recv_message.departure [0]) == 'D')
                                       departure = true;
                                   else if (toupper (recv_message.departure [0]) == 'A')
                                       departure = false; 
                                   else
                                       valid = false;

                                   char delim [] = "/";
                                   char *mday, *month, *year, *saveptr;
                                   mday = month = year = NULL;
                                   mday = strtok_r (recv_message.date, delim, &saveptr);
                                   if (mday)
                                       month = strtok_r (NULL, delim, &saveptr);
                                   else 
                                       valid = false;
                                   if (month)
                                       year = strtok_r (NULL, delim, &saveptr);
                                   else 
                                       valid = false;
                                   if (!year)
                                       valid = false;
                                   char *hrs, *min;
                                   // get time
                                   if (recv_message.time [5])
                                       recv_message.time [5] = '\0';
                                   delim [0] = ':';
                                   hrs = min = NULL;
                                   hrs = strtok_r (recv_message.time, delim, &saveptr);
                                   if (hrs) 
                                       min = strtok_r (NULL, delim, &saveptr);
                                   if (!hrs || !min)
                                       valid = false;

                                   time_t ts;

                                   if (valid) {
                                       struct tm tm;

                                       tm.tm_sec = 0;
                                       sscanf (min, "%d", &tm.tm_min);
                                       sscanf (hrs, "%d", &tm.tm_hour);
                                       sscanf (mday, "%d", &tm.tm_mday);
                                       sscanf (month, "%d", &tm.tm_mon);
                                       (tm.tm_mon)--;
                                       sscanf (year, "%d", &tm.tm_year);
                                       tm.tm_year -= 1900;
                                       tm.tm_isdst = -1;

                                       if ((ts = mktime (&tm)) == (time_t) -1)
                                           valid = false;
                                  
                                       time_t now;

                                       if ((now = time (NULL)) == (time_t) -1)
                                           error ("time");

                                       if (ts < now)
                                           valid = false;
                                   }

                                   if (!valid) {
                                       // send error message to client
                                       send_message.message_id = htonl (ERROR_IN_INPUT);
                                       size_t msg_len = sizeof (long);
                                       if (send ((pollfds + fd) -> fd, &send_message, msg_len, 0) == -1)
                                           error ("send");
                                   }
                                   else
                                   {
                                       // add flight data to tree
                                       root = add_to_tree (root, recv_message.flight_no, departure, ts);
                                       // send confirmation to client
                                       send_message.message_id = htonl (FLIGHT_TIME_STORED);
                                       strcpy (send_message.flight_no, recv_message.flight_no);
                                       strcpy (send_message.departure, (departure) ? "D" : "A");
                                       struct tm *tms;  
                                       if ((tms = localtime (&ts)) == NULL)  
                                            perror ("localtime");                    
                                       sprintf (send_message.date, "%02d/%02d/%d", tms -> tm_mday, 
                                            tms -> tm_mon + 1, tms -> tm_year + 1900);
                                       sprintf (send_message.time, "%02d:%02d", tms -> tm_hour,
                                            tms -> tm_min);
                                       size_t msg_len = sizeof (struct message);
                                       if (send ((pollfds + fd) -> fd, &send_message, msg_len, 0) == -1)
                                           error ("send");
                                   }
                                   break;
                            case FLIGHT_TIME:
                                   valid = true;
                                   // validate flight number
                                   if (recv_message.flight_no [FLIGHT_NUM_SIZE])
                                       recv_message.flight_no [FLIGHT_NUM_SIZE] = '\0';
                                   if (strlen (recv_message.flight_no) < 3)
                                       valid = false;
                                   if (!valid) {
                                       // send error message to client
                                       send_message.message_id = htonl (ERROR_IN_INPUT);
                                       size_t msg_len = sizeof (long);
                                       if (send ((pollfds + fd) -> fd, &send_message, msg_len, 0) == -1)
                                           error ("send");
                                       break;
                                   }
                                   char temp_buf [FLIGHT_NUM_SIZE + 1];
                                   trim (temp_buf, recv_message.flight_no);
                                   strcpy (recv_message.flight_no, temp_buf);
                                   struct tnode *ptr;
                                   ptr = find_flight_rec (root, recv_message.flight_no);
                                   if (!ptr) {
                                       memset (&send_message, '\0', sizeof (struct message));
                                       send_message.message_id = htonl (FLIGHT_NOT_FOUND);
                                       strcpy (send_message.flight_no, recv_message.flight_no);
                                       size_t msg_len = sizeof (struct message);
                                       if (send ((pollfds + fd) -> fd, &send_message, msg_len, 0) == -1)
                                           error ("send");
                                       break;
                                   }
                                   send_message.message_id = htonl (FLIGHT_TIME_RESULT);
                                   strcpy (send_message.flight_no, recv_message.flight_no);
                                   strcpy (send_message.departure, (ptr -> departure) ? "D" : "A");
                                   struct tm *tms;  
                                   if ((tms = localtime (&(ptr -> flight_time))) == NULL)  
                                        perror ("localtime");                    
                                   sprintf (send_message.date, "%02d/%02d/%d", tms -> tm_mday, 
                                            tms -> tm_mon + 1, tms -> tm_year + 1900);
                                   sprintf (send_message.time, "%02d:%02d", tms -> tm_hour,
                                            tms -> tm_min);
                                   size_t msg_len = sizeof (struct message);
                                   if (send ((pollfds + fd) -> fd, &send_message, msg_len, 0) == -1)
                                       error ("send");
                                   break;

                        }

                    }
                }
            } // if (fd == ...
        } // for
    } // while (1)

    exit (EXIT_SUCCESS);
} // main

// record the flight departure / arrival time    
struct tnode *add_to_tree (struct tnode *p, char *flight_no, bool departure, time_t flight_time)
{
    int res;

    if (p == NULL) {  // new entry
        if ((p = (struct tnode *) malloc (sizeof (struct tnode))) == NULL)
            error ("malloc");
        p -> flight_no = strdup (flight_no);
        p -> departure = departure;
        p -> flight_time = flight_time;
        p -> left = p -> right = NULL;
    }
    else if ((res = strcmp (flight_no, p -> flight_no)) == 0) { // entry exists
        p -> departure = departure;
        p -> flight_time = flight_time;
    }
    else if (res < 0) // less than flight_no for this node, put in left subtree
        p -> left = add_to_tree (p -> left, flight_no, departure, flight_time);
    else   // greater than flight_no for this node, put in right subtree
        p -> right = add_to_tree (p -> right, flight_no, departure, flight_time);
    return p;
}

// find node for the flight for which departure or arrival time is queried
struct tnode *find_flight_rec (struct tnode *p, char *flight_no)
{
    int res;

    if (!p) 
        return p;
    res = strcmp (flight_no, p -> flight_no);
    
    if (!res)
        return p;

    if (res < 0)
        return find_flight_rec (p -> left, flight_no);
    else 
        return find_flight_rec (p -> right, flight_no);
}

// print_tree: print the tree (in-order traversal)
void print_tree (struct tnode *p)
{
    if (p != NULL) {
        print_tree (p -> left);
        printf ("%s: %d %s\n\n", p -> flight_no, (int) p -> departure, ctime (&(p -> flight_time)));
        print_tree (p -> right);
    }
}

void error (char *msg)
{
    perror (msg);
    exit (1);
}

// trim: leading and trailing whitespace of string
void trim (char *dest, char *src)
{
    if (!src || !dest)
       return;

    int len = strlen (src);

    if (!len) {
        *dest = '\0';
        return;
    }
    char *ptr = src + len - 1;

    // remove trailing whitespace
    while (ptr > src) {
        if (!isspace (*ptr))
            break;
        ptr--;
    }

    ptr++;

    char *q;
    // remove leading whitespace
    for (q = src; (q < ptr && isspace (*q)); q++)
        ;

    while (q < ptr)
        *dest++ = *q++;

    *dest = '\0';
}

3.3 Client

The client code is the same as before.

/* 
 *       flight-time-client.c : get flight time from the server
 *
 */

#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <ctype.h>
#include <stdint.h>
#include <time.h>

#define FLIGHT_NUM_SIZE            15

#define SERVER_PORT                "4358"
#define STORE_FLIGHT               1
#define FLIGHT_TIME_STORED         2
#define FLIGHT_TIME                3
#define FLIGHT_TIME_RESULT         4
#define FLIGHT_NOT_FOUND           5
#define ERROR_IN_INPUT             9
#define QUIT                       0

void error (char *msg);

struct message {
    int32_t message_id;
    char flight_no [FLIGHT_NUM_SIZE + 1];
    char departure [1 + 1]; // 'D': departure, 'A': arrival
    char date [10 + 1]; // dd/mm/yyyy
    char time [5 + 1];   // hh:mm
};

struct message message;

int get_input (void);
void error (char *msg);

int main (int argc, char **argv)
{
    if (argc != 2) {
        fprintf (stderr, "Usage: client hostname\n");
        exit (EXIT_FAILURE);
    }

    struct addrinfo hints;
    memset(&hints, 0, sizeof (struct addrinfo));
    hints.ai_family = AF_UNSPEC;
    hints.ai_socktype = SOCK_STREAM;

    struct addrinfo *result;
    int s; 
    if ((s = getaddrinfo (argv [1], SERVER_PORT, &hints, &result)) != 0) {
        fprintf (stderr, "getaddrinfo: %s\n", gai_strerror (s));
        exit (EXIT_FAILURE);
    }

    /* Scan through the list of address structures returned by 
       getaddrinfo. Stop when the the socket and connect calls are successful. */

    int sock_fd;
    socklen_t length;
    struct addrinfo *rptr;
    for (rptr = result; rptr != NULL; rptr = rptr -> ai_next) {
        sock_fd = socket (rptr -> ai_family, rptr -> ai_socktype,
                       rptr -> ai_protocol);
        if (sock_fd == -1)
            continue;

        if (connect (sock_fd, rptr -> ai_addr, rptr -> ai_addrlen) == -1) {
            if (close (sock_fd) == -1)
                error ("close");
            continue;
        }
        
        break;
    }

    if (rptr == NULL) {               // Not successful with any address
        fprintf(stderr, "Not able to connect\n");
        exit (EXIT_FAILURE);
    }

    freeaddrinfo (result);

    int option;

    while (1) {
         option = get_input ();
         if (option == QUIT)
             break;

         // send request to server
         if (send (sock_fd, &message, sizeof (struct message), MSG_NOSIGNAL) == -1)
             error ("send");

         // receive response from server
         if (recv (sock_fd, &message, sizeof (struct message), 0) == -1)
             error ("recv");

         // process server response 
         switch (ntohl (message.message_id)) {
             case FLIGHT_TIME_STORED: 
             case FLIGHT_TIME_RESULT: printf ("\nResponse: \n\n");
                    printf ("\t%s: %s %s %s\n\n", message.flight_no, message.departure, 
                                              message.date, message.time);
                     break;
             case FLIGHT_NOT_FOUND: printf ("\nFlight not found\n\n");
                     break;
             case ERROR_IN_INPUT: printf ("\nError in input\n\n");
                     break;
             default: printf ("\nUnrecongnized message from server\n\n");
         }
    }

    exit (EXIT_SUCCESS);
}

char inbuf [512];

int get_input (void)
{
    int option;

    while (1) {
        printf ("Flight Info\n\n");
        printf ("\tFlight time query\t1\n");
        printf ("\tStore flight time\t2\n");
        printf ("\tQuit\t\t0\n\n");
        printf ("Your option: ");
        if (fgets (inbuf, sizeof (inbuf),  stdin) == NULL)
            error ("fgets");
        sscanf (inbuf, "%d", &option);

        int len;

        switch (option) {

            case 1: message.message_id = htonl (FLIGHT_TIME);
                    printf ("Flight no: ");
                    if (fgets (inbuf, sizeof (inbuf),  stdin) == NULL)
                        error ("fgets");
                    len = strlen (inbuf);
                    if (inbuf [len - 1] == '\n')
                        inbuf [len - 1] = '\0';
                    strcpy (message.flight_no, inbuf);
                    break;

            case 2: message.message_id = htonl (STORE_FLIGHT);
                    printf ("Flight no: ");
                    if (fgets (inbuf, sizeof (inbuf),  stdin) == NULL)
                        error ("fgets");
                    len = strlen (inbuf);
                    if (inbuf [len - 1] == '\n')
                        inbuf [len - 1] = '\0';
                    strcpy (message.flight_no, inbuf);

                    while (1) {
                        printf ("A/D: ");
                        if (fgets (inbuf, sizeof (inbuf),  stdin) == NULL)
                            error ("fgets");
                        message.departure [0] = toupper (inbuf [0]);
                        message.departure [1] = '\0';
                        if ((message.departure [0] == 'A') || (message.departure [0] == 'D'))
                            break;
                        printf ("Error in input, valid values are A and D\n");
                    }
                    
                    printf ("date (dd/mm/yyyy): ");
                    if (fgets (inbuf, sizeof (inbuf),  stdin) == NULL)
                        error ("fgets");
                    strncpy (message.date, inbuf, 10);
                    message.date [10] = '\0';
                    printf ("time (hh:mm): ");
                    if (fgets (inbuf, sizeof (inbuf),  stdin) == NULL)
                        error ("fgets");
                    strncpy (message.time, inbuf, 5);
                    message.time [5] = '\0';
                    break;

            case 0:
                    break;

            default: printf ("Illegal option, try again\n\n");
                     continue;

        }

        return option;
    }
}

void error (char *msg)
{
    perror (msg);
    exit (1);
}

We can compile and run the server and client programs.

$ gcc server.c -o server
$ gcc client.c -o client

Running poll server and clients
Fig. 1: Running poll server and clients

3.4 poll vs select

How does poll compare with select? Is there any benefit of using poll as compared with using select? We can summarize the following points.

  1. select can only handle file descriptors smaller than FD_SETSIZE, which is 1024. poll has no such restriction.
  2. select uses a set of file descriptors. One has to check all file descriptors from 0 through fdmax for ready file descriptors. This is quite cumbersome. In poll, one can pass an array of pollfd structures for only file descriptors of interest. That makes poll more compact.
  3. select modifies the file descriptor set parameters. So one has to initialize the file descriptor sets for each iteration. poll has no such problem. The input events field is left untouched. The events that occur are returned in revents.

4.0 epoll

The epoll API is a fast I/O notification facility. It provides a functionality similar to select and poll calls. The epoll_create and epoll_create1 calls create an epoll instance for monitoring file descriptors for I/O. The epoll_ctl call adds, modifies or deletes a file descriptor from the set monitored by an epoll instance. epoll_wait, waits till I/O to be possible on one or more file descriptors, or a timeout occurs.

epoll is Linux specific.

By default, epoll is level-triggered (LT). It can be made edge-triggered by specifying the EPOLLET flag in events in the epoll_ctl call. Edge-triggered means an event is considered when data availability (bytes) on a file descriptor changes from zero to some positive value. Level-triggered, simply, means that there is data available on a file descriptor to be read. Level-triggered, is simpler and is easy to use and we will use it in our example programs.

4.1 epoll_create, epoll_create1

#include <sys/epoll.h>

int epoll_create (int size);
int epoll_create1 (int flags);

epoll_create creates a new epoll instance. The argument size is obsolete, but, must be greater than zero. epoll_create1 is the newer call, and, it also creates a new instance of epoll. Both epoll_create and epoll_create1 return the file descriptor for the newly created epoll instance. In epoll_create1, the flags could be zero, or EPOLL_CLOEXEC, which means that close-on-exec (FD_CLOEXEC) should be set on the new file descriptor.

4.2 epoll_ctl

#include <sys/epoll.h>

typedef union epoll_data {
    void        *ptr;
    int          fd;
    uint32_t     u32;
    uint64_t     u64;
} epoll_data_t;

struct epoll_event {
    uint32_t     events;      /* Epoll events */
    epoll_data_t data;        /* User data variable */
};

int epoll_ctl (int epfd, int op, int fd, struct epoll_event *event);

epoll_ctl performs a control operation on the epoll instance referred by the file descriptor, epfd. The third parameter, fd is the file descriptor for which the operation is to be performed. The fourth parameter is a pointer to struct epoll_event and specifies the events to be monitored for the file descriptor, fd. The second parameter, op, is the operation and it can be one of the following:

  • EPOLL_CTL_ADD: The file descriptor, fd, is to be added to the set of file descriptors, monitored by the epoll instance, epfd. The events to be monitored are specified in the last parameter, event.
  • EPOLL_CTL_MOD: The events associated with the file descriptor, fd, need to be changed with that specified in the last parameter, event..
  • EPOLL_CTL_DEL: The file descriptor, fd, is to be deleted. The event parameter is ignored.

struct epoll_event contains events, which is an unsigned 32-bit integer containing a bit mask. The bit mask is composed by ORing zero or more of the event types which are listed below.

epoll_ctl - events
EventDescription
EPOLLINFile descriptor is available for read.
EPOLLOUTFile descriptor is available for write.
EPOLLRDHUPStream socket peer closed connection.
EPOLLPRIExceptional condition on file descriptor.
EPOLLERRError condition on file descriptor.
EPOLLHUPHang up on file descriptor.
EPOLLETSet edge-triggered behavior for file descriptor.
EPOLLONESHOTSet one shot behavior for file descriptor.
(File descriptor is disabled after one event.)
EPOLLWAKEUPEnsure that system does not enter "suspend" or "hibernate" when this event is pending or is being processed. (EPOLLONESHOT and EPOLLET must be clear and the process should have the CAP_BLOCK_SUSPEND capability.)
EPOLLEXCLUSIVESets up the exclusive mode of wake up for the epoll file descriptor to which this file descriptor is being attached. Useful for avoiding the thundering herd problem in certain scenarios.

4.3 epoll_wait, epoll_pwait

#include <sys/epoll.h>

int epoll_wait (int epfd, struct epoll_event *events, 
                int maxevents, int timeout);

int epoll_pwait (int epfd, struct epoll_event *events, 
                 int maxevents, int timeout, const sigset_t *sigmask);

epoll_wait waits and returns when at least one event is available or the timeout has occured. epfd is the file descriptor of the epoll instance. The events are returned in the array events, the second parameter. A maximum of maxevents can be returned and maxevents must be greater than zero. The timeout is in milliseconds.

If successful, epoll_wait returns the number of file descriptors ready for I/O, or zero if none are ready and the timeout occurred. In case of error, epoll_wait returns -1 and errno is set to the error.

Just as we had pselect and ppoll calls vis-à-vis select and poll calls respectively, we have the epoll_pwait call. However, the timeout is still in milliseconds. The last parameter is a signal mask, which is made the signal mask of the thread before the call. epoll_pwait returns when there are events, or, timeout has happenned, or, a signal outside the sigmask has come. After the epoll_pwait call is over, the original signal mask is restored for the thread.

5.0 Example program using epoll for I/O multiplexing

The flight time server program given earlier is modified to use epoll instead of poll and is given below.

5.1 The server

/* 
 *           flight-time-server.c: record and provide time of a
 *                                 flight from the airport
 *
 */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <sys/epoll.h>
#include <errno.h>
#include <syslog.h>
#include <unistd.h>
#include <stdbool.h>
#include <ctype.h>
#include <stdint.h>
#include <time.h>

#define FLIGHT_NUM_SIZE            15

#define SERVER_PORT                "4358"
#define STORE_FLIGHT               1
#define FLIGHT_TIME_STORED         2
#define FLIGHT_TIME                3
#define FLIGHT_TIME_RESULT         4
#define FLIGHT_NOT_FOUND           5
#define ERROR_IN_INPUT             9

#define BACKLOG                   10
#define NUM_FDS                    5

#define MAX_EVENTS                10

void error (char *msg);

struct message {
    int32_t message_id;
    char flight_no [FLIGHT_NUM_SIZE + 1];
    char departure [1 + 1]; // 'D': departure, 'A': arrival
    char date [10 + 1]; // dd/mm/yyyy
    char time [5 + 1];   // hh:mm
};

struct tnode {
    char *flight_no;
    bool departure; // true: departure, false: arrival
    time_t flight_time;
    struct tnode *left;
    struct tnode *right;
};

struct message recv_message, send_message;

struct tnode *add_to_tree (struct tnode *p, char *flight_no, bool departure, time_t flight_time);
struct tnode *find_flight_rec (struct tnode *p, char *flight_no);
void print_tree (struct tnode *p);
void trim (char *dest, char *src); 
void error (char *msg);

int main (int argc, char **argv)
{
    const char * const ident = "flight-time-server";

    openlog (ident, LOG_CONS | LOG_PID | LOG_PERROR, LOG_USER);
    syslog (LOG_USER | LOG_INFO, "%s", "Hello world!");
    
    struct addrinfo hints;
    memset(&hints, 0, sizeof (struct addrinfo));
    hints.ai_family = AF_UNSPEC;    /* allow IPv4 or IPv6 */
    hints.ai_socktype = SOCK_STREAM; /* Stream socket */
    hints.ai_flags = AI_PASSIVE;    /* for wildcard IP address */

    struct addrinfo *result;
    int s; 
    if ((s = getaddrinfo (NULL, SERVER_PORT, &hints, &result)) != 0) {
        fprintf (stderr, "getaddrinfo: %s\n", gai_strerror (s));
        exit (EXIT_FAILURE);
    }

    /* Scan through the list of address structures returned by 
       getaddrinfo. Stop when the the socket and bind calls are successful. */

    int listener, optval = 1;
    socklen_t length;
    struct addrinfo *rptr;
    for (rptr = result; rptr != NULL; rptr = rptr -> ai_next) {
        listener = socket (rptr -> ai_family, rptr -> ai_socktype,
                       rptr -> ai_protocol);
        if (listener == -1)
            continue;

        if (setsockopt (listener, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof (int)) == -1)
            error("setsockopt");

        if (bind (listener, rptr -> ai_addr, rptr -> ai_addrlen) == 0)  // Success
            break;

        if (close (listener) == -1)
            error ("close");
    }

    if (rptr == NULL) {               // Not successful with any address
        fprintf(stderr, "Not able to bind\n");
        exit (EXIT_FAILURE);
    }

    freeaddrinfo (result);

    // Mark socket for accepting incoming connections using accept
    if (listen (listener, BACKLOG) == -1)
        error ("listen");

    int efd;
    if ((efd = epoll_create1 (0)) == -1)
	error ("epoll_create1");
    struct epoll_event ev, ep_event [MAX_EVENTS];

    ev.events = EPOLLIN;
    ev.data.fd = listener;
    if (epoll_ctl (efd, EPOLL_CTL_ADD, listener, &ev) == -1)
	error ("epoll_ctl");

    int nfds = 0;

    socklen_t addrlen;
    struct sockaddr_storage client_saddr;
    char str [INET6_ADDRSTRLEN];
    struct sockaddr_in  *ptr;
    struct sockaddr_in6  *ptr1;
    struct tnode *root = NULL;

    while (1) {
        // monitor readfds for readiness for reading
	if ((nfds = epoll_wait (efd, ep_event, MAX_EVENTS,  -1)) == -1)
	    error ("epoll_wait");

        // Some sockets are ready. Examine readfds
        for (int i = 0; i < nfds; i++) {

	    if 	((ep_event [i].events & EPOLLIN) == EPOLLIN) {
                if (ep_event [i].data.fd == listener) {  // request for new connection
                    addrlen = sizeof (struct sockaddr_storage);
                    int fd_new;
                    if ((fd_new = accept (listener, (struct sockaddr *) &client_saddr, &addrlen)) == -1)
                        error ("accept");
                    // add fd_new to epoll
                    ev.events = EPOLLIN;
                    ev.data.fd = fd_new;
                    if (epoll_ctl (efd, EPOLL_CTL_ADD, fd_new, &ev) == -1)
	                error ("epoll_ctl");
	
                    // print IP address of the new client
                    if (client_saddr.ss_family == AF_INET) {
                        ptr = (struct sockaddr_in *) &client_saddr;
                        inet_ntop (AF_INET, &(ptr -> sin_addr), str, sizeof (str));
                    }
                    else if (client_saddr.ss_family == AF_INET6) {
                        ptr1 = (struct sockaddr_in6 *) &client_saddr;
	                inet_ntop (AF_INET6, &(ptr1 -> sin6_addr), str, sizeof (str));
                    }
                    else
                    {
                        ptr = NULL;
                        fprintf (stderr, "Address family is neither AF_INET nor AF_INET6\n");
                    }
                    if (ptr) 
                        syslog (LOG_USER | LOG_INFO, "%s %s", "Connection from client", str);
                
                }
                else  // data from an existing connection, receive it
                {
                    memset (&recv_message, '\0', sizeof (struct message));
                    ssize_t numbytes = recv (ep_event [i].data.fd, &recv_message, sizeof (struct message), 0);
   
                    if (numbytes == -1)
                        error ("recv");
                    else if (numbytes == 0) {
                        // connection closed by client
                        fprintf (stderr, "Socket %d closed by client\n", ep_event [i].data.fd);
			// delete fd from epoll
                        if (epoll_ctl (efd, EPOLL_CTL_DEL, ep_event [i].data.fd, &ev) == -1)
	                    error ("epoll_ctl");
                        if (close (ep_event [i].data.fd) == -1)
                            error ("close");
                    }
                    else 
                    {
                        // data from client
                        bool valid;
                        char temp_buf [FLIGHT_NUM_SIZE + 1];
                        
                        switch (ntohl (recv_message.message_id)) {
                            case STORE_FLIGHT:
                                   valid = true;
                                   // validate flight number
                                   if (recv_message.flight_no [FLIGHT_NUM_SIZE])
                                       recv_message.flight_no [FLIGHT_NUM_SIZE] = '\0';
                                   if (strlen (recv_message.flight_no) < 3)
                                       valid = false;
                                   trim (temp_buf, recv_message.flight_no);
                                   strcpy (recv_message.flight_no, temp_buf);
                                   bool departure;
                                   if (toupper (recv_message.departure [0]) == 'D')
                                       departure = true;
                                   else if (toupper (recv_message.departure [0]) == 'A')
                                       departure = false; 
                                   else
                                       valid = false;

                                   char delim [] = "/";
                                   char *mday, *month, *year, *saveptr;
                                   mday = month = year = NULL;
                                   mday = strtok_r (recv_message.date, delim, &saveptr);
                                   if (mday)
                                       month = strtok_r (NULL, delim, &saveptr);
                                   else 
                                       valid = false;
                                   if (month)
                                       year = strtok_r (NULL, delim, &saveptr);
                                   else 
                                       valid = false;
                                   if (!year)
                                       valid = false;
                                   char *hrs, *min;
                                   // get time
                                   if (recv_message.time [5])
                                       recv_message.time [5] = '\0';
                                   delim [0] = ':';
                                   hrs = min = NULL;
                                   hrs = strtok_r (recv_message.time, delim, &saveptr);
                                   if (hrs) 
                                       min = strtok_r (NULL, delim, &saveptr);
                                   if (!hrs || !min)
                                       valid = false;

                                   time_t ts;

                                   if (valid) {
                                       struct tm tm;

                                       tm.tm_sec = 0;
                                       sscanf (min, "%d", &tm.tm_min);
                                       sscanf (hrs, "%d", &tm.tm_hour);
                                       sscanf (mday, "%d", &tm.tm_mday);
                                       sscanf (month, "%d", &tm.tm_mon);
                                       (tm.tm_mon)--;
                                       sscanf (year, "%d", &tm.tm_year);
                                       tm.tm_year -= 1900;
                                       tm.tm_isdst = -1;

                                       if ((ts = mktime (&tm)) == (time_t) -1)
                                           valid = false;
                                  
                                       time_t now;

                                       if ((now = time (NULL)) == (time_t) -1)
                                           error ("time");

                                       if (ts < now)
                                           valid = false;
                                   }

                                   if (!valid) {
                                       // send error message to client
                                       send_message.message_id = htonl (ERROR_IN_INPUT);
                                       size_t msg_len = sizeof (long);
                                       if (send (ep_event [i].data.fd, &send_message, msg_len, 0) == -1)
                                           error ("send");
                                   }
                                   else
                                   {
                                       // add flight data to tree
                                       root = add_to_tree (root, recv_message.flight_no, departure, ts);
                                       // send confirmation to client
                                       send_message.message_id = htonl (FLIGHT_TIME_STORED);
                                       strcpy (send_message.flight_no, recv_message.flight_no);
                                       strcpy (send_message.departure, (departure) ? "D" : "A");
                                       struct tm *tms;  
                                       if ((tms = localtime (&ts)) == NULL)  
                                            perror ("localtime");                    
                                       sprintf (send_message.date, "%02d/%02d/%d", tms -> tm_mday, 
                                            tms -> tm_mon + 1, tms -> tm_year + 1900);
                                       sprintf (send_message.time, "%02d:%02d", tms -> tm_hour,
                                            tms -> tm_min);
                                       size_t msg_len = sizeof (struct message);
                                       if (send (ep_event [i].data.fd, &send_message, msg_len, 0) == -1)
                                           error ("send");
                                   }
                                   break;
                            case FLIGHT_TIME:
                                   valid = true;
                                   // validate flight number
                                   if (recv_message.flight_no [FLIGHT_NUM_SIZE])
                                       recv_message.flight_no [FLIGHT_NUM_SIZE] = '\0';
                                   if (strlen (recv_message.flight_no) < 3)
                                       valid = false;
                                   if (!valid) {
                                       // send error message to client
                                       send_message.message_id = htonl (ERROR_IN_INPUT);
                                       size_t msg_len = sizeof (long);
                                       if (send (ep_event [i].data.fd, &send_message, msg_len, 0) == -1)
                                           error ("send");
                                       break;
                                   }
                                   char temp_buf [FLIGHT_NUM_SIZE + 1];
                                   trim (temp_buf, recv_message.flight_no);
                                   strcpy (recv_message.flight_no, temp_buf);
                                   struct tnode *ptr;
                                   ptr = find_flight_rec (root, recv_message.flight_no);
                                   if (!ptr) {
                                       memset (&send_message, '\0', sizeof (struct message));
                                       send_message.message_id = htonl (FLIGHT_NOT_FOUND);
                                       strcpy (send_message.flight_no, recv_message.flight_no);
                                       size_t msg_len = sizeof (struct message);
                                       if (send (ep_event [i].data.fd, &send_message, msg_len, 0) == -1)
                                           error ("send");
                                       break;
                                   }
                                   send_message.message_id = htonl (FLIGHT_TIME_RESULT);
                                   strcpy (send_message.flight_no, recv_message.flight_no);
                                   strcpy (send_message.departure, (ptr -> departure) ? "D" : "A");
                                   struct tm *tms;  
                                   if ((tms = localtime (&(ptr -> flight_time))) == NULL)  
                                        perror ("localtime");                    
                                   sprintf (send_message.date, "%02d/%02d/%d", tms -> tm_mday, 
                                            tms -> tm_mon + 1, tms -> tm_year + 1900);
                                   sprintf (send_message.time, "%02d:%02d", tms -> tm_hour,
                                            tms -> tm_min);
                                   size_t msg_len = sizeof (struct message);
                                   if (send (ep_event [i].data.fd, &send_message, msg_len, 0) == -1)
                                       error ("send");
                                   break;

                        }

                    }
                }
            } // if (fd == ...
        } // for
    } // while (1)

    exit (EXIT_SUCCESS);
} // main

// record the flight departure / arrival time    
struct tnode *add_to_tree (struct tnode *p, char *flight_no, bool departure, time_t flight_time)
{
    int res;

    if (p == NULL) {  // new entry
        if ((p = (struct tnode *) malloc (sizeof (struct tnode))) == NULL)
            error ("malloc");
        p -> flight_no = strdup (flight_no);
        p -> departure = departure;
        p -> flight_time = flight_time;
        p -> left = p -> right = NULL;
    }
    else if ((res = strcmp (flight_no, p -> flight_no)) == 0) { // entry exists
        p -> departure = departure;
        p -> flight_time = flight_time;
    }
    else if (res < 0) // less than flight_no for this node, put in left subtree
        p -> left = add_to_tree (p -> left, flight_no, departure, flight_time);
    else   // greater than flight_no for this node, put in right subtree
        p -> right = add_to_tree (p -> right, flight_no, departure, flight_time);
    return p;
}

// find node for the flight for which departure or arrival time is queried
struct tnode *find_flight_rec (struct tnode *p, char *flight_no)
{
    int res;

    if (!p) 
        return p;
    res = strcmp (flight_no, p -> flight_no);
    
    if (!res)
        return p;

    if (res < 0)
        return find_flight_rec (p -> left, flight_no);
    else 
        return find_flight_rec (p -> right, flight_no);
}

// print_tree: print the tree (in-order traversal)
void print_tree (struct tnode *p)
{
    if (p != NULL) {
        print_tree (p -> left);
        printf ("%s: %d %s\n\n", p -> flight_no, (int) p -> departure, ctime (&(p -> flight_time)));
        print_tree (p -> right);
    }
}

void error (char *msg)
{
    perror (msg);
    exit (1);
}

// trim: leading and trailing whitespace of string
void trim (char *dest, char *src)
{
    if (!src || !dest)
       return;

    int len = strlen (src);

    if (!len) {
        *dest = '\0';
        return;
    }
    char *ptr = src + len - 1;

    // remove trailing whitespace
    while (ptr > src) {
        if (!isspace (*ptr))
            break;
        ptr--;
    }

    ptr++;

    char *q;
    // remove leading whitespace
    for (q = src; (q < ptr && isspace (*q)); q++)
        ;

    while (q < ptr)
        *dest++ = *q++;

    *dest = '\0';
}

5.2 Client

The client is the same as in the case of poll.

5.3 Running the server and clients

We can compile the server and client programs and run them.

$ gcc server.c -o server
$ gcc client.c -o client
$ ./server
flight-time-server[30417]: Hello world!

Running epoll server and clients
Fig. 2: Running the epoll server and clients

6.0 epoll vs. poll

  1. In poll, one has to pass an array of struct pollfd. Each element of this array has a file descriptor, requested events and returned events. If a file descriptor is closed, it is difficult to delete it from this array. One option is to make it negative, and it will be ignored by poll. This technique has limitation that it can not be done for the file descriptor 0. In case of epoll, one can add or delete file descriptors using the epoll_ctl call and monitor file descriptors for I/O using epoll_wait. So epoll has a cleaner interface and is easier to use.
  2. epoll is faster and scales well to support a large number of file descriptors.
  3. poll is as per the POSIX standard. epoll is Linux specific.