File I/O in Linux

1.0 Input and Output (I/O)

All programs need to interact with the external world which makes I/O important. Programs store data in files which provide large persistent storage. In this post we will look at the system calls and functions for file I/O and the issues that govern the program and I/O device interaction.

2.0 Buffered I/O

Reading and writing to the hard disk takes a lot of time as compared with read and write from the main memory. There is also the observation of localized data access by programs in subsequent I/O calls. For example, take the case of sequential file access by a program. The program reads data at a particular location, the file access pointer moves by the amount of data read and the program reads from that location the next time. Data is read or written to a hard disk in units of blocks, where block size is determined by the filesystem on the hard disk. The block size is mostly 4K bytes, which implies buffering of data. A block of data is read and smaller chunks of data are given from it to the program in subsequent read calls. There are two levels of buffering, the kernel and user levels. The kernel keeps a copy of recently accessed disk blocks in the main memory. When a process wants some data from a file via a read call, the kernel first checks its page cache, and if the data is available, it is given to the process. If data is not available in the cache, the concerned disk block is read into the cache and data is given to the process. Similarly for a write call, availability of the concerned block is first checked in the page cache. If the concerned block is available, it is modified and marked for write to the disk. If data is not in the page cache, the concerned block is read, updated and marked for write to the disk. Since processes ask for data located in close proximity in successive read and write calls, page cache helps in minimizing the device access for I/O.

There is also buffering in the user space. If a program uses a library like the Standard I/O Library, and makes a fread call, a larger amount of data, say 8K, is read and the amount of data requested is returned from it to the caller. In subsequent fread calls, the library buffer is checked for the requested data, and if data is available in the buffer, it is given to the caller straightaway. If the request can not be satisfied from the buffered data, a read from the hard disk is done. This way, the process keeps on working in the user space for more time and context switches are minimized.

3.0 Synchronized I/O

For a write operation, data is only updated in the page cache, and is not immediately written on the hard disk. So, if for some reason, the system shuts down abruptly without being able to write the updated pages on the hard disk, there is the possibility of data loss.

Synchronized I/O means that when we make a write-like call, the data is physically written on the hard disk and all the control metadata is updated and, only then, the call returns. Synchronized I/O is not be confused with synchronous I/O. Linux calls like open, read, write and close are all synchronous; they block by default and return only when the required functionality is done. It is a different matter that the functionality required of write is only that of writing to the page cache. However, if we say synchronized I/O, the write must actually write all the way down to the hard disk and update all the concerned control metadata.

Synchronized I/O, comes with a performance penalty and is not available by default. It has to be enabled by using certain flags or options. There might be situations where synchronized I/O is desirable. In this post, we will look at ways to use synchronized I/O in Linux.

4.0 Primary I/O System Calls

The primary system calls for I/O are, open, creat, read, write, lseek, close and unlink.

4.1 open

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int open (const char *pathname, int flags);
int open (const char *pathname, int flags, mode_t mode); 

The open system call is for opening a file identified by the first parameter, pathname. The second parameter, flags specify the access mode. flags must include one of the access modes, O_RDONLY, O_WRONLY, or O_RDWR, for read only, write only, and, read and write respectively. Also, more file creation and status flags can specified by OR-ing to one of the access modes. If O_CREAT is specified and the file does not exist, it is created. If O_CREAT is specified, the second form of open needs to be used and the third parameter, mode, is to be specified. mode specifies the permission bits for the file for the user, group and others. For example, a mode value of octal 755 (for -rwxr-xr-x), gives the read, write and execute permissions to the owner, and read and execute permissions to the group and others. And instead of octal 755, we can write using symbolic constants,

S_IRUSR | S_IWUSR | S_IXUSR | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH

And, we can replace S_IRUSR | S_IWUSR | S_IXUSR by S_IRWXU, which says all the three permissions for the User. And, we have the symbolic constant representation of the mode value octal 755, as

S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH

Similarly, the read, write and execute permissions for the group and others are S_IRWXG and IRWXO respectively. If the file is opened for writing, i.e., with access mode O_WRONLY or O_RDWR, and the O_TRUNC flag is specified, and the file already exists, it is truncated to length zero. Next, there is an O_EXCL flag. If, O_EXCL is specified along with O_TRUNC, and a file with pathname already exists, open fails and the errno is set to EEXIST.

There are two flags O_DSYNC and O_SYNC related to synchronized I/O. If O_DSYNC is specified as a part of flags, it means that synchronized I/O is to be completed ensuring data integrity. Similarly, if O_SYNC is specified as a part of flags, it means that synchronized I/O is to be done maintaining the file integrity. As an example, take the case of the write system call. If neither O_DSYNC nor O_SYNC is specified, write just needs to update the kernel page cache and return. This is most efficient, but if the system stops abruptly and buffers might not get written to the disk and some data is lost. If O_DSYNC flag is specified, there needs to be data integrity, which means that if a subsequent read call is made the data written earlier must be available even if the system had shutdown earlier abruptly. For this, the following needs to be done before the write call returns. The data passed in the write call needs to be written to the hard disk and the necessary metadata so that this data can be read subsequently, also needs to be written to the hard disk. For example, if this data extends the size of the file, the file size needs to be updated in the inode before the write returns. But the file modification timestamp is not required for maintaining data integrity. What about the O_SYNC flag? If the O_SYNC flag is used, not just the data integrity, the file integrity is to be ensured and the file modification timestamp is updated in the inode before write returns. So, compared to the O_SYNC flag, the O_DSYNC flag tries to save some disk access during the write call.

open returns a file descriptor on success. If open fails, it returns -1 and errno is set accordingly.

4.2 creat

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int creat (const char *pathname, mode_t mode);

creat is equivalent to the open system call with flags O_CREAT | O_WRONLY | O_TRUNC. creat is there for historical reasons. We can safely ignore creat and use open with appropriate flags.

4.3 read

#include <unistd.h>

ssize_t read (int fd, void *buf, size_t count);

The read system call reads count bytes from file identified by the descriptor, fd, into the buffer pointed by buf. The number of bytes actually read is returned. In case of the end of file, 0 is returned. The file offset is incremented by the number of bytes read.

4.4 write

#include <unistd.h>

ssize_t write (int fd, const void *buf, size_t count);

The write system call writes count bytes in buffer pointed by buf to the file identified by the descriptor, fd. If the file had been opened with O_DSYNC or O_SYNC flag, write returns only after data has been written to the disk, as explained under open, above. write returns number of bytes actually written. In case of error, -1 is returned and errno is set appropriately.

4.5 lseek

#include <sys/types.h>
#include <unistd.h>

off_t lseek (int fd, off_t offset, int whence);

lseek positions the read/write file offset by offset bytes in the file identified by the descriptor, fd. The last parameter, whence, can have one of the following three values. whence can be SEEK_SET, in which case, the offset is set to offset bytes. Or, whence may be SEEK_CUR and the offset is set to the current position plus offset bytes. Also, whence can be SEEK_END and the offset is set to the length of the file plus offset bytes.

If lseek is successful, it returns the resulting offset value in bytes from the beginning of the file. If there is an error, (off_t) -1 is returned and the errno is set accordingly.

4.6 close

#include <unistd.h>

int close (int fd);

The close system call closes the file descriptor fd. close indicates the end of I/O with the file using the descriptor, fd.

4.7 unlink

#include <unistd.h>

int unlink (const char *pathname);

The unlink system call removes the pathname from the filesystem. If the pathname is the only link to the file, the file is also discarded.

5.0 Synchronized I/O, revisited

These system calls help in ensuring that the data written to a file is actually written to the underlying filesystem.

5.1 sync

#include <unistd.h>

void sync (void);

The sync call causes all the updated file data and metadata to be written to the underlying filesystems. As per POSIX, sync only has to schedule the writing of data and can return before data is written. But Linux waits for the data to be written and only then sync returns.

5.2 syncfs

#include <unistd.h>

int syncfs (int fd);

syncfs is just like sync, but is applicable to only buffers of the filesystem of the file identified by the descriptor fd. syncfs is specific to Linux only.

5.3 fsync

#include <unistd.h>

int fsync (int fd);

fsync writes all updated data and metadata for file identified by the descriptor fd to the underlying filesystem. However, doing fsync does not necessarily result in the directory containing the file pathname also getting written to the filesystem. For this, a separate fsync for the directory needs to be done.

5.4 fdatasync

#include <unistd.h>

int fdatasync (int fd);

fdatasync is similar to fsync, except only metadata required for accessing the file's data is written. For example, file length is updated but not the file modification timestamp. fdatasync tries to minimize the I/O, while flushing the updated file data to the underlying filesystem.

6.0 truncate and ftruncate

6.1 truncate

#include <unistd.h>
#include <sys/types.h>

int truncate (const char *path, off_t length);

truncate causes the file identified by path to be made to length bytes long. If the file is bigger than length bytes, it is cut short. If the file is initially shorter than length bytes, its length is increased and the bytes added contain the null character. The file should be writable by the calling process.

6.2 ftruncate

#include <unistd.h>
#include <sys/types.h>

int ftruncate (int fd, off_t length);

ftruncate is similar to truncate, except that the file identified by the descriptor, fd, is truncated. The file must have been opened for writing.