|
|
A Comparison of Solaris, Linux, and FreeBSD Kernelsby Max BruningOctober 14, 2005 I spend most of my time teaching classes on Solaris internals, device drivers, and kernel crash dump analysis and debugging. When explaining to classes how various subsystems are implemented in Solaris, students often ask, "How does it work in Linux?" or, "In FreeBSD, it works like this, how about Solaris?" This article examines three of the basic subsystems of the kernel and compares implementation between Solaris 10, Linux 2.6, and FreeBSD 5.3. The three subsystems examined are scheduling, memory management, and file system architecture. I chose these subsystems because they are common to any operating system (not just Unix and Unix-like systems), and they tend to be the most well-understood components of the operating system. This article does not go into in-depth details on any of the subsystems described. For that, refer to the source code, various websites, and books on the subject. For specific books, see:
If you search the Web for Linux, FreeBSD, and Solaris comparisons, most of the hits discuss old (in some cases, Solaris 2.5, Linux 2.2, etc.) versions of the OSes. Many of the "facts" are incorrect for the newest releases, and some were incorrect for the releases they intended to describe. Of course, most of them also make value judgments on the merits of the OSes in question, and there is little information comparing the kernels themselves. The following sites seem more or less up to date:
One of the more interesting aspects of the three OSes is the amount of similarities between them. Once you get past the different naming conventions, each OS takes fairly similar paths toward implementing the different concepts. Each OS supports time-shared scheduling of threads, demand paging with a not-recently-used page replacement algorithm, and a virtual file system layer to allow the implementation of different file system architectures. Ideas that originate in one OS often find their way into others. For instance, Linux also uses the concepts behind Solaris's slab memory allocator. Much of the terminology seen in the FreeBSD source is also present in Solaris. With Sun's move to open source Solaris, I expect to see much more cross-fertilization of features. Currently, the LXR project provides a source cross-reference browser for FreeBSD, Linux, and other Unix-related OSes, available at fxr.watson.org. It would be great to see OpenSolaris source added to that site. Scheduling and SchedulersThe basic unit of scheduling in Solaris is the Scheduling decisions are based on priority. In Linux and FreeBSD, the lower the priority value, the better. This is an inversion; a value closer to 0 represents a higher priority. In Solaris, the higher the value, the higher the priority. Table 1 shows the priority values of the different OSes.
All three OSes favor interactive threads/processes. Interactive threads run at better priority than compute-bound threads, but tend to run for shorter time slices. Solaris, FreeBSD, and Linux all use a per-CPU "runqueue." FreeBSD and Linux use an "active" queue and an "expired" queue. Threads are scheduled in priority from the active queue. A thread moves from the active queue to the expired queue when it uses up its time slice (and possibly at other times to avoid starvation). When the active queue is empty, the kernel swaps the active and expired queues. FreeBSD has a third queue for "idle" threads. Threads run on this queue only when the other two queues are empty. Solaris uses a "dispatch queue" per CPU. If a thread uses up its time slice, the kernel gives it a new priority and returns it to the dispatch queue. The "runqueues" for all three OSes have separate linked lists of runnable threads for different priorities. (Though FreeBSD uses one list per four priorities, both Solaris and Linux use a separate list for each priority.) Linux and FreeBSD use an arithmetic calculation based on run time versus sleep time of a thread (as a measure of "interactive-ness") to arrive at a priority for the thread. Solaris performs a table lookup. None of the three OSes support "gang scheduling." Rather than schedule n threads, each OS schedules, in effect, the next thread to run. All three OSes have mechanisms to take advantage of caching (warm affinity) and load balancing. For hyperthreaded CPUs, FreeBSD has a mechanism to help keep threads on the same CPU node (though possibly a different hyperthread). Solaris has a similar mechanism, but it is under control of the user and application, and is not restricted to hyperthreads (called "processor sets" in Solaris and "processor groups" in FreeBSD). One of the big differences between Solaris and the other two OSes is the
capability to support multiple "scheduling classes" on the system at the same
time. All three OSes support Posix The ability to add new scheduling classes to the system comes with a price. Everywhere in the kernel that a scheduling decision can be made (except for the actual act of choosing the thread to run) involves an indirect function call into scheduling class-specific code. For instance, when a thread is going to sleep, it calls scheduling-class-dependent code that does whatever is necessary for sleeping in the class. On Linux and FreeBSD, the scheduling code simply does the needed action. There is no need for an indirect call. The extra layer means there is slightly more overhead for scheduling on Solaris (but more features). Memory Management and PagingIn Solaris, every process has an "address space" made up of logical section
divisions called "segments." The segments of a process address space are
viewable via Linux divides machine-dependent layers from machine-independent layers at a much higher level in the software. On Solaris and FreeBSD, much of the code dealing with, for instance, page fault handling is machine-independent. On Linux, the code to handle page faults is pretty much machine-dependent from the beginning of the fault handling. A consequence of this is that Linux can handle much of the paging code more quickly because there is less data abstraction (layering) in the code. However, the cost is that a change in the underlying hardware or model requires more changes to the code. Solaris and FreeBSD isolate such changes to the HAT and pmap layers respectively. Segments, regions, and memory areas are delimited by:
For instance, the text of a program is in a segment/region/memory area. The mechanisms in the three OSes to manage address spaces are very similar, but the names of data structures are completely different. Again, more of the Linux code is machine-dependent than is true of the other two OSes. PagingAll three operating systems use a variation of a least recently used
algorithm for page stealing/replacement. All three have a daemon process/thread
to do page replacement. On FreeBSD, the FreeBSD has several page lists for keeping track of recently used pages.
These track "active," "inactive," "cached," and "free" pages. Pages move
between these linked lists depending on their uses. Frequently accessed pages
will tend to stay on the active list. Data pages of a process that exits can be
immediately placed on the free list. FreeBSD may swap entire processes out if
Linux also uses different linked lists of pages to facilitate an LRU-style algorithm. Linux divides physical memory into (possibly multiple sets of) three "zones:" one for DMA pages, one for normal pages, and one for dynamically allocated memory. These zones seem to be very much an implementation detail caused by x86 architectural constraints. Pages move between "hot," "cold," and "free" lists. Movement between the lists is very similar to the mechanism on FreeBSD. Frequently accessed pages will be on the "hot" list. Free pages will be on the "cold" or "free" list. Solaris uses a free list, hashed list, and vnode page list to maintain its variation of an LRU replacement algorithm. Instead of scanning the vnode or hash page lists (more or less the equivalent of the "active"/"hot" lists in the FreeBSD/Linux implementations), Solaris scans all pages uses a "two-handed clock" algorithm as described in Solaris Internals and elsewhere. The two hands stay a fixed distance apart. The front hand ages the page by clearing reference bit(s) for the page. If no process has referenced the page since the front hand visited the page, the back hand will free the page (first asynchronously writing the page to disk if it is modified). All three operating systems take NUMA locality into account during paging.
The I/O buffer cache and the virtual memory page cache is merged into one
system page cache on all three OSes. The system page cache is used for
reads/writes of files as well as File SystemsAll three operating systems use a data abstraction layer to hide file system
implementation details from applications. In all three OSes, you use
VFS allows the implementation of many file system types on the system. This means that there is no reason that one of these operating systems could not access the file systems of the other OSes. Of course, this requires the relevant file system routines and data structures to be ported to the VFS of the OS in question. All three OSes allow the stacking of file systems. Table 2 lists file system types implemented in each OS, but it does not show all file system types.
ConclusionsSolaris, FreeBSD, and Linux are obviously benefiting from each other. With Solaris going open source, I expect this to continue at a faster rate. My impression is that change is most rapid in Linux. The benefits of this are that new technology has a quick incorporation into the system. Unfortunately, the documentation (and possibly some robustness) sometimes lags behind. Linux has many developers, and sometimes it shows. FreeBSD has been around (in some sense) the longest of the three systems. Solaris has its basis in a combination of BSD Unix and AT&T Bell Labs Unix. Solaris uses more data abstraction layering, and generally could support additional features quite easily because of this. However, most of the layering in the kernel is undocumented. Probably, source code access will change this. A brief example to highlight differences is page fault handling. In
Solaris, when a page fault occurs, the code starts in a platform-specific trap
handler, then calls a generic Kernel visibility and debugging tools are critical to get a correct
understanding of system behavior. Yes, you can read the source code, but I
maintain that you can easily misread the code. Having tools available to test
your hypothesis about how the code works is invaluable. In this respect, I see
Solaris with Max Bruning currently teaches and consults on Solaris internals, device drivers, kernel (as well as application) crash analysis and debugging, networking internals, and specialized topics. Contact him at max at bruningsystems dot com or http://mbruning.blogspot.com/. |