1. Introduction
This project is delivering two sets of functionality:
- The BrandZ infrastructure, which enables the creation of zones that
provide alternate operating environment personalities, or brands
on a Solaris(tm) 10 system.
- lx, a brand that supports the execution of Linux applications
This document describes the design and implementations of both BrandZ
and the lx brand.
2. BrandZ Infrastructure
2.1. Zones Integration
It is a goal of this project that all brand management should be
performed as simple extensions to the current zones infrastructure:
- The zones configuration utility will be extended to configure zones of
particular brands.
- The zone administration tools will be extended to
display the brand of any created zone, as well as to list the brand
types available.
BrandZ adds functionality to the zones infrastructure by allowing zones
to be assigned a brand type. This type is used to determine which scripts
are executed when a zone is installed and booted. In addition, a zone's
brand is used to properly identify the correct application type at
application launch time. All zones have an associated brand for
configuration purposes; the default is the 'native' brand.
2.1.1 Configuration
The zonecfg(1M) utility (as well as
libzonecfg) is modified to be brand-aware. There is a new
option to the 'create' command that allows a zone to be created with a
specific brand:
zonecfg:myzone> create -B lx
Once a zone has been assigned a brand, that brand cannot be changed or
removed.
The rest of the user interface for the zone configuration process
remains the same. To support this, each brand delivers a configuration
file at /usr/lib/brand/<name>/config.xml. The file
describes how to install, configure, and boot the zone. It also identifies
the name of the kernel module (if any) that provides kernel-level
functionality for the brand. A sample of the file follows:
<!DOCTYPE brand PUBLIC "-//Sun Microsystems Inc//DTD Brands//EN"
"file:///usr/share/lib/xml/dtd/brand.dtd.1">
<brand name="lx">
<install>/usr/lib/brand/lx/lx_install %z %R %*</install>
<boot>/usr/lib/brand/lx/lx_boot %R %z</boot>
<halt>/usr/lib/brand/lx/lx_halt %z</halt>
<postclone>/usr/bin/true</postclone>
<modname>lx_brand</modname>
<initname>/sbin/init</initname>
<platform name="platform.xml" />
</brand>
2.1.2 Installation
The current zone install mechanism is hardwired to execute
lucreatezone to install a zone. To support arbitrary
installation methods, the config.xml file contains a line of
the form:
<install>/usr/lib/lu/lucreatezone -z %z</install>
For the lx brand, this line will refer to
/usr/lib/brand/lx/lx_install instead of
lucreatezone. All of the command tags support the following
substitutions '%z' (for zonename) and '%R' (for zone root path). In
addition, the zoneadm(1M) command will pass additional arguments to the
program as specified on the command line. This will allow runtime
installation arguments, such as the location of source media:
# zoneadm -z linux install -d /net/installsrv/redhat30
2.1.3 Virtual Platform
The virtual platform consists of internal mounted filesystems, as well
as the device tree and any network devices. The virtual platform is
controlled by an XML file that is separate from the main brand config file,
config.xml. The properties it controls are outlined below.
2.1.3.1 Mounted Filesystems
Before handing off control to the virtual platform,
zoneadmd does some work of its own to set up well-known mounts
and do some basic verification. It performs Solaris-specific mounts
(/proc, /system/contract, etc.), which need to be
defined for each brand. We also need to mount native versions of some
files in order to support the interpretation layer (for native access to
/proc, for example).
The set of filesystems mounted by zoneadmd are specified
in the virtual platform configuration file:
<filesystem spec="/proc" dir="/native/proc" fstype="proc" />
<filesystem spec="/usr" dir="/native/usr" fstype="lofs" />
2.1.3.2 Device
Configuration
To create device nodes within a zone, zoneadmd calls
devfsadm. The devfsadm process walks the current list of
devices in /dev, and calls into libzonecfg to
determine if the device should be exported.
libzonecfg will consult a brand-specific platform configuration file
that describes how to apply the following semantics:
- Policy restrictions: Control which devices appear in the non-global
zone. This is the only aspect currently implemented. This is currently
hardcoded in
libzonecfg.
- Device inheritance: Describes a device which matches exactly a device
in the global zone. Currently this is assumed behavior, but it needs
to be configurable.
- Device renaming: Allows devices to appear in diffent locations within
the zone. This is needed to support devices that are emulated via
layered drivers.
The platform configuration file will have elements to perform all the
above tasks. A sample scheme could be:
<!-- Devices to create under /dev -->
<device match="pts/*" />
<device match="ptmx" />
<device match="random" />
<device match="urandom" />
<device match="zero" />
<device match="null" />
<device match="tty" />
<device match="tcp" />
<device match="tcp6" />
<device match="udp" />
<device match="udp6" />
<!-- Symlinks to create under /dev -->
<symlink source="stderr" target="./fd/2" />
<symlink source="stdin" target="./fd/0" />
<symlink source="stdout" target="./fd/1" />
<symlink source="systty" target="zconsole" />
<symlink source="log" target="/var/run/syslog" />
<!-- Renamed devices to create under /dev -->
<device match="brand/lx/ptmx_linux" name="ptmx" />
<!--
/dev/console can't be a symlink because this breaks
login security checks via /dev/securetty
-->
<device match="zconsole" name="console" />
2.1.4 Booting a Branded Zone
In addition to the install-time command attribute, the configuration
file can also provide an optional 'boot' script. This script will be run
after the zone has been brought to the 'Ready' state - immediately before
the zone is booted. This allows a brand to perform any needed setup tasks
after the virtual platform has been established, but before the first
branded process has been launched, giving the brand complete flexibility in
how the zone is brought online.
2.1.5 Spawning init
The init(1M) binary is spawned by the kernel, allowing the
kernel to control its restart. The name and path of a zone's init(1M)
binary is a per-zone attribute, which is stored in the brand configuration
file and passed to the kernel by zoneadm immediately prior to booting the
zone.
Linux does not allow kill(2) to send any signals to
init(1M) for which a signal handler has not been registered
(including SIGSTOP and SIGKILL, as by definition handlers
cannot be registered for those signals.) We solve this by having
lx_kill() make the same check and only allow a signal to be sent to the
Linux init process if a handler for the signal has already been
registered. The Linux man pages are silent on whether the Linux
kernel can send init an unhandled signal, so we do not
perform any particular checks to prevent the Solaris kernel from doing so.
2.1.6 Shutdown
The standard sysvinit Linux init package operates in much
the same way as our init did prior to the introduction of
smf(5). The 'init N' form of the command writes to the init
FIFO, which happens to be /dev/initctl. The 'lx' brand's
boot-time script creates this link from the global zone before the zone
boots up.
The init command then does a kill(1, SIGHUP)
to notify the running init process that there has been a change of
runlevel. We interpose on the kill(2) syscall and translate
PID 1 into whatever the zone-local PID is.
To actually power off or
reboot the system,
init calls the
reboot()
function with 'magic numbers' defined in
sys/reboot.h. This
function translates directly to the
reboot system call. This
call, and the associated magic numbers, will be translated into an
appropriate
uadmin() call.
The Linux simpleinit program works in a nearly identical
fashion. The main differences are in how it processes /etc/rc
scripts.
2.2. Kernel Integration
2.2.1 Brand Framework
Brands are loaded into the system in the form of brand modules. A
brand module is defined through a new linkage type:
struct modlbrand {
struct mod_ops *brand_modops;
char *brand_linkinfo;
struct brand *brand_branddef;
};
Each brand must declare struct brand as part of the
registration process.
typedef struct brand {
int b_version;
char *b_name;
struct brand_ops *b_ops;
struct brand_mach_ops *b_machops;
} brand_t;
This structure declares the name of the brand, the version of the brand
interface against which it was compiled, and ops vectors containing both
common and platform-dependent functionality. The b_version
field is currently used to determine whether a brand is eligible to be
loaded into the system. If the version number does not match the one
compiled into the kernel, then we simply refuse to load the brand module.
In theory, this version number could also be used to interpret the contents
of the two ops vectors, allowing us to continue supporting older brand
modules.
It is important to note that even though we have defined a linkage type
for brands and have implemented a versioning mechanism, we are not defining
a formal ABI for brands. The relationship between brands and the kernel is
so intimate that we cannot hope to properly support the development of
brands outside the ON consolidation. This does not mean that we will do
anything to prevent the development of brands outside of ON, but we must
minimize the possibility of an out-of-date brand accidentally damaging
something within the kernel.
2.2.2 System Call Interposition
The system call invocation mechanism is an implementation detail of
both the operating system and the hardware architecture. On SPARC
machines, a system call is initiated via a trap (Solaris chooses to use
trap numbers 8 and 9.) On x86 machines, there are a variety of different
methods available to enter the kernel: sysenter and
syscall instructions, lcalls, and
software-triggered interrupts. Solaris has used all of these mechanisms at
various points, and maintaining binary compatibility requires that we
continue to support them all.
Supporting a different version of Solaris requires interposing on each
of these mechanisms. Before we begin executing the system call support
code for the native version of Solaris, we must ensure that the process
does not belong to a brand that has a different implementation of those
calls. To do that, we must check the proc structure of the initiating
process to determine whether it belongs to a foreign brand, and if so,
whether that brand has chosen to interpose on that entry point. This check
is carried out by a BRAND_CALLBACK() macro placed at the top
of the handling routine for each of the existing entry points. If the
brand wishes to interpose on the entry point, control is passed to that
brand's kernel module. If not, control returns to the handler and the
standard Solaris system call code is executed.
Linux has a single means of entering the kernel: executing an 'int 80'
instruction. Since this mechanism is not used by any version of Solaris,
there is no existing mechanism on which we can interpose. Therefore the
creation of a Linux brand requires that we provide a new entry point into
the kernel specifically for this brand. As with the existing entry points,
the BRAND_CALLBACK() macro is executed by this handler. In this case,
there is no standard system call code to execute if the handler returns.
Instead the program is killed with a General Protection Fault.
While introducing a new entry point is a trivial task within the
Solaris engineering organization, it again demonstrates that the ability
for third parties to distribute radically new brands will be limited. A
third party may certainly distribute a modified version OpenSolaris that
includes the changes they need, but any brands that depend on those changes
would not work properly with our Solaris until/unless those changes were
adopted by Sun.
For performance reasons, the BRAND_CALLBACK() macro is
only invoked for branded processes. Each system call mechanism actually
has two different entry points defined: one for branded processes and one
for non-branded processes. A context handler is used to modify the
appropriate CPU registers or structures to switch to the appropriate set of
entry points during a context switch operation. The callback macro is only
a dozen or so instructions long, but this switching mechanism ensures that
non-branded processes are not subject to even that minimal overhead.
This interposition mechanism is executed before any other code in the
system call path. Specifically, it is executed before the standard system
call prologue. In a sense, this means that we are executing in kernel
mode, but don't consider ourselves to have entered the kernel. This
may sound esoteric, but it has concrete implications for the brand-specific
code. The brand code cannot take any locks, cannot do any I/O, or do
anything else that might cause the thread to block. This means that the
brand code can only do the simplest computations or transformations before
returning to userspace or to the standard system call flow.
As described below, the 'lx' brand returns immediately to userspace,
where the bulk of the emulation takes place. To implement a brand in the
kernel, the interposition routine could transform the incoming system call
into a brand() system call and return to the normal system call path.
After executing the standard system call prologue, control would be
vectored to the brand's xxx_brandsys() routine, where the emulation could
be carried out.
2.2.3 Other
Interposition Points
Other interposition points will be placed in the code paths for process
creation/exit, thread creation/teardown, signal delivery, and so on.
The interposition points are identified by means of a pair of ops
vectors (one generic and one platform-specific), similar to those used by
VFS for filesystems and the VM subsystem for segments. The generic ops
vector used by BrandZ is shown below:
struct brand_ops {
int (*b_brandsys)(int, int64_t *, uintptr_t, uintptr_t,
uintptr_t, uintptr_t, uintptr_t, uintptr_t);
void (*b_setbrand)(struct proc *);
void (*b_copy_procdata)(struct proc *, struct proc *);
void (*b_proc_exit)(struct proc *, klwp_t *);
void (*b_exec)();
void (*b_lwp_setrval)(klwp_t *, int, int);
void (*b_shmexit)(struct proc *);
int (*b_initlwp)(klwp_t *);
void (*b_forklwp)(klwp_t *, klwp_t *);
void (*b_freelwp)(klwp_t *);
void (*b_lwpexit)(klwp_t *);
int (*b_elfexec)(struct vnode *vp, struct execa *uap,
struct uarg *args, struct intpdata *idata, int level,
long *execsz, int setid, caddr_t exec_file,
struct cred *cred);
};
A brief description of each entry follows:
-
b_brandsys: Routine that implements a per-brand
'brandsys' system call that can be used for any brand-specific
functionality.
-
b_setbrand: Used to associate a new brand type with a
process. This is called when the zone's init process is hand-crafted, or
when a process uses zone_enter() to enter a branded zone.
-
b_copy_procdata: Copies per-process brand data at fork
time.
-
b_proc_exit: Called at process exit time to free any
brand-specific process state.
-
b_exec: Called at process exec time to initialize any
brand-specific process state.
-
b_lwp_setrval: set the syscall() return value for the
newly created lwp before the creating fork() system call returns.
-
b_shmexit: Called when a shared memory segment is
detached from an exiting branded process.
-
b_initlwp: Called by lwp_create() to
initialize any brand-specific per-lwp data.
-
b_forklwp: Called by forklwp() to copy any
brand-specific per-lwp data from the parent to child lwps.
-
b_freelwp: Called by lwp_create() on error
to do any brand-specific cleanup.
-
b_lwpexit: Called by lwp_exit() to do any
brand-specific cleanup.
-
b_elfexec: Called to load a branded executable.
The x86-specific ops vector is:
struct brand_mach_ops {
void (*b_sysenter)(void);
void (*b_int80)(void);
void (*b_int91)(void);
void (*b_syscall)(void);
void (*b_syscall32)(void);
greg_t (*b_fixsegreg)(greg_t, model_t);
};
The first 5 entries of this vector allow a brand to override the
standard system call paths with their own interpretations. The final entry
protects Solaris from brands that make different use of the segment
registers in userspace, and vice-versa.
The SPARC-specific ops vector is:
struct brand_mach_ops {
void (*b_syscall)(void);
void (*b_fasttrap)(void);
};
Of these routines, only the int80 entry of the x86 vector
is needed for the initial lx brand. The other entries are
included for completeness and are only used by a trivial 'Solaris 10' brand
used for basic testing on SPARC platforms.
These ops vectors are sufficient for the initial Linux brand. Adding
support for a new Linux distribution based on the 2.4 kernel can probably
also be done without modifying this interface. However, it is likely that
adding any new brand that is substantially different from the initial Linux
brand will require additional interposition points. For example, adding
support for a 2.6-based Linux distribution could require modifications.
Supporting a whole new operating system such as FreeBSD or Apple's OS X
would almost certainly require modifying this interface.
3. The lx Brand
3.1 Brands and Distributions
The lx Brand is intended to emulate the kernel interfaces
expected by the Red Hat Enterprise Linux 3 userspace. The freely available
CentOS 3 distribution is built from the same SRPMs as RHEL, so it is
expected to work as well.
The interface between the kernel and userspace is largely encapsulated
by the version of glibc that ships with the distribution. As
such, the interface we emulate will likely support other distributions that
make use of glibc 2.3.2. Debian 3.1 also uses this version of glibc, so
adding support for that distro to lx should be
straightforward.
The further removed one gets from that version of glibc, the
less likely it is that the current lx brand will be able to
support that distro. For example, RHEL 4 is based on glibc 2.4.7, which
represents a significant change to glibc. While adding support for RHEL 4
would likely require additional development work, we expect that the work
could be done within the scope of the lx brand. In the unlikely
event that the lx brand could not be extended to support RHEL 4,
we would introduce a new lx4 brand.
Finally, it should be noted that supporting a new distribution will
always require a new install script. For RPM-based distributions, it might
be sufficient to update the existing scripts with new package lists.
Adding support for distributions such as Debian, which do not use the RPM
packaging format, will require entirely new install scripts to be created.
It is relatively simple to have a variety of different install scripts
within a single brand, so simply changing the packaging format does not
require the creation of a new brand.
3.2 Installation of a Linux zone
Most Linux distributions' native installation procedures start by
probing and configuring the hardware on the system and partitioning the
hard drives. The user then selects which packages to install and sets
configuration variables such as the hostname and IP address. Finally the
installer lays out the filesystems, installs the packages, modifies some
configuration files, and reboots.
When installing a Linux zone, we can obviously skip the device probing
and disk partitioning. For the remainder of the installation procedure,
there are several different approaches we could take.
- One approach is to execute a distribution's installation tool
directly. Most of them are based on shell scripts, Python, or some
other interpreted language, so we could theoretically run the tools
before having a Linux environment running. This approach relies on
the existing install tools being fairly robust and would be hard to
sustain between releases if the tools change.
- Another option is to develop our own package installation tool, which
extracts the desired software from a standard set of distribution
media and copies it into the zone.
- A third option would be to adopt a flasharchive-like approach, in
which we would simply unpack a prebuilt tarball or cpio image into the
target filesystem. This image could be built by us, or by a customer
that already had an installed Linux system. If we were to build this
image ourselves, this approach would give us complete control over the
final image and and allow us to manually handle any particularly ugly
early-stage installation issues. This would also make it trivial for
a customer to "just try it out," an approach that has been successful
for VMware, Usermode Linux, and QEMU.
It is our intention to support the last two options.
We will provide an installation script that will extract a known set of
RPMs from a known set of Red Hat or CentOS installation media. We will
also allow a zone to be installed from a user-specified tarball. We will
document how a user can construct an installable tarball that can be used
for flasharchive-like installations.
3.3. Execution of Linux Binaries
When executing Linux binaries, we follow the same architectural
approach taken by the SunOS(tm) Binary Compatibility Project (SBCP), which
provides support for pre-Solaris 2.0 binaries.
3.3.1. Loading Linux Binaries
When a non-native ELF binary is execed inside a branded zone, the
brand's exec handler is given control. The lx brand-specific
exec handler execs the brand's Solaris support library (akin to the 'sbcp'
command) and maps the non-native ELF executable and its interpreter into
the address space.
The handler also places all the extra information needed to exec the
Linux binary on the stack in the form of aux vector entries. Specifically,
the handler passes the following values to the support library via the aux
vector:
AT_SUN_BRAND_BASE: Base address of non-native interpeter
AT_SUN_BRAND_LDDATA: Address of non-native interpreter's debug data
AT_SUN_BRAND_LDENTRY: Non-native interpreter's entry point
AT_SUN_BRAND_PHDR: Non-native executable ELF information needed by
AT_SUN_BRAND_PHENT: non-native interpreter
AT_SUN_BRAND_PHNUM:
AT_SUN_BRAND_ENTRY: Entry point of non-native executable
3.3.2. Running Linux Binaries
Rather than executing the Linux application directly, the exec handler
starts the program execution in the brand support library. The library
then runs whatever initialization code it needs before starting the Linux
application.
For its initialzation, the lx brand library uses the
brandsys(2) system call to pass the following data structure
to the kernel:
typedef struct lx_brand_registration {
uint_t lxbr_version; /* version number */
void *lxbr_handler; /* base address of handler */
void *lxbr_traceflag; /* address of trace flag */
} lx_brand_registration_t;
This structure contains a version number, ensuring that the brand
library and the brand kernel module are compatibile with one another. It
also contains two addresses in the application's address space: the
starting address of the system call emulation code and the address of a
flag indicating whether the process is being traced. The use of these two
addresses are discussed in sections 3.4 and 3.9 respectively.
Once the support library has finished initialization, it fixes up the
aux vector for the Linux interpreter to run, and jumps to the interpreter's
entry point. The brand's exec handler replaces the standard aux vector
entries with the corresponding values above, clears the above vectors (by
setting their type to AT_IGNORE), resets the stack to its
pre-Solaris-linker state, and then jumps to the non-native interpreter
which then runs the executable as it would on its own native system.
The advantages of this design are:
- No modifications to the Solaris runtime linker are necessary.
- Virtually all knowledge of branding is isolated to the kernel brand
module and userland brand support library.
- It keeps us aligned with the de-facto standard for non-native emulation
established with SunOS 4 BCP.
3.4. System Call Emulation
The most common method for applications to interact with the operating
system is through system calls. Therefore, the bulk of the work required
to support the execution of foreign binaries is to emulate the system call
interface to which those binaries have been written.
Linux applications do not use the syscall/sysenter
instructions. Instead, they use interrupt #80 to initiate a system call.
Because Solaris has no current handler for that interrupt, one had to be
added as part of this project. As noted above, the handler in the core
Solaris code does nothing but pass control to the int80_handler routine in
the brand module. It is then up to that brand to interpret and execute the
system call.
As with executable launching, the approach we have chosen to take is
the one originally implemented by the SBCP. In this model, the trap
handler in the kernel is nothing more than a trampoline, redirecting the
execution of the system call to a library in userspace. The library
performs any necessary data mangling and then calls the proper native
system call. Ideally, this method would require almost no code in the
kernel and would have no impact whatsoever on the behavior of a system with
no SunOS binaries running.
In practice, the user-space approach turns out to be less clean and
self-contained for Linux than for the original SunOS project. In that
case, the binary model being emulated was simpler than Solaris, less fully
featured, and still closely related. In the Linux case, there are system
calls that must be supported that do not have obvious equivalents in
Solaris (e.g., futex()) and there are differences in fundamental OS
abstractions (Linux 'threads' are almost full blown processes, each with
its own PID).
The steps involved in emulating a fairly straightforward Linux system
call are as follows:
- The Linux application marshalls parameters into registers and issues
an
int80
- The Solaris int80 handler checks to see if the process is branded. An
unbranded process will continue along the standard code path. For
int80, there is no standard behavior so the process would die
with a General Protection fault. Thus, a Solaris application cannot
successfully execute any Linux system calls.
- Solaris passes control to the brand-specific routine indicated in the
brandops structure.
- The lx brand module immediately trampolines into the
user-space emulation library.
- The emulation library does any necessary argument manipulation and
calls the appropriate Solaris system call(s).
- Solaris carries out the system call and returns to the brand library.
- The brand library performs any necessary manipulation of the return
values and error code.
- The brand library returns directly to the Linux application; it does
not return through the kernel.
The diagram below illustrates these steps:
Each Linux system call can be divided into one of three types:
pass-through, simple emulation, and complex emulation.
3.4.1 Pass-Through
A pass-through call is one that requires no data transformation and for
which the Solaris 10 semantics match those of the Linux system call. These
can be implemented in userland by immediately calling the equivalent system
call.
For example:
int
lx_read(int fd, void *buf, size_t bytes)
{
int rval;
rval = read(fd, buf, bytes);
return (rval < 0 ? -errno : rval);
}
Other examples of pass-through calls are close(), write(), mkdir(), and
munmap().
Although the arguments to the system call are identical, the method for
returning an error to the caller differs between Solaris and Linux. In
Solaris, the system call returns -1 and the error number is stored in the
thread-specific variable errno. In Linux, the error number is
returned as part of the rval.
There are also differences in the error numbers between Solaris and
Linux. The lx_read() routine is called by
lx_emulate(), which handles the translation between Linux and
Solaris error codes for all system calls.
3.4.2 Simple Emulation
One step up in complexity is a simple emulated system call. This is a
call where either the original arguments and/or return value require some
degree of simple transformation from the Solaris equivalent. Simple
transformations include changing data types or the moving of values into a
structure. These calls can be built entirely from standard Solaris system
calls and userland transformations.
For example:
int
lx_uname(uintptr_t p1)
{
struct lx_utsname *un = (struct lx_utsname *)p1;
char buf[LX_SYS_UTS_LN + 1];
strlcpy(un->sysname, LX_SYSNAME, LX_SYS_UTS_LN);
strlcpy(un->release, lx_release, LX_SYS_UTS_LN);
strlcpy(un->version, LX_VERSION, LX_SYS_UTS_LN);
strlcpy(un->machine, LX_MACHINE, LX_SYS_UTS_LN);
gethostname(un->nodename, sizeof (un->nodename));
if ((sysinfo(SI_SRPC_DOMAIN, buf, LX_SYS_UTS_LN) < 0))
un->domainname[0] = '\0';
else
strlcpy(un->domainname, buf, LX_SYS_UTS_LN);
return (0);
}
Other examples of simple emulated calls are stat(), mlock(), and
getdents().
3.4.3 Complex Emulation
The calls requiring the most in-kernel support are the complex emulated
system calls. These are calls that:
- Require significant transformation to input arguments or return values
- Are partially or wholly unique within the lx brand implementation
- Possibly require a new system call that has no underlying Solaris
counterpart
Some examples of complex emulated calls are clone(),
sigaction(), and futex(). The implementation of
the clone() system call is described below.
3.5 Other Issues
3.5.1 Linux Threading
Linux implements threads via the clone() system call.
Among the arguments to the call is a set of flags, four of which determine
the level of sharing within the address space: CLONE_VM, CLONE_FS,
CLONE_FILES, and CLONE_SIGHAND. When all four flags are clear, the clone
is equivalent to a fork; when they are all set, it is the equivalent to
creating another lwp in the address space. Any other combination of flags
reflects a thread/process construct that does not match any existing
Solaris model. Since these other combinations are rarely, if ever,
encountered on a system, this project will not be adding the abstractions
necessary to support them.
The following table lists all of the flags for the clone(2)
system call, and whether the 'lx' brand supports them. If an applications
uses an unsupported flag, or combination of flags, a detailed error message
is emitted and ENOTSUP is returned.
| Flag |
Supported? |
| CLONE_VM | Yes |
| CLONE_FS | Yes |
| CLONE_FILES | Yes |
| CLONE_SIGHAND | Yes |
| CLONE_PID | Yes |
| CLONE_PTRACE | Yes |
| CLONE_PARENT | Partial. Not
supported for fork()-style clone() operations. |
| CLONE_THREAD | Yes |
| CLONE_SYSVSEM | Yes |
| CLONE_SETTLS | Yes |
| CLONE_PARENT_SETTID | Yes |
| CLONE_CHILD_CLEARTID | Yes |
| CLONE_DETACH | Yes |
| CLONE_CHILD_SETTID | Yes |
When an application uses clone(2) to fork a new process,
the lx_clone() routine simply calls fork1(2).
When an application uses clone(2) to create a new thread, we
call the thr_create(3C) routine in the Solaris
libc.
The Linux application provides the address of a function at which the
new thread should begin executing as an argument to the system call.
However, the Linux kernel does not actually start execution at that
address. Instead, the kernel essentially does a fork(2) of a
new thread which, like a forked process, starts with exactly the same state
as the parent thread. As a result, the new thread starts executing in the
middle of the clone(2) system call, and it is the
glibc wrapper that causes it to jump to the user-specified
address.
This Linux implementation detail means that when we call
thr_create(3C) to create our new thread, we cannot provide the
user's start address to that routine. Instead, all new Linux threads begin
by executing a routine that we provide, called clone_start().
This routine does some final initialization, notifies the brand's kernel
module that we have created a new Linux thread, and then returns to
glibc.
A by-product of threads implementation in Linux is that every thread
has a unique PID. To mimic this behavior in the lx brand, every
thread created by a Linux binary reserves a PID from the PID list. This
reservation is performed as part of the clone_start() routine.
This reserved PID is never seen by Solaris processes, but it is used by
Linux processes. When a Linux thread calls getpid(2), it is
returned the standard Solaris PID of process. When it calls
gettid(2), it is returned the PID that was reserved at thread
creation time. Similarly, kill(2) sends a signal to the
entire process represented by the supplied PID, while tkill(2)
sends a signal to the specific thread represented by the supplied PID.
The Linux thread model supported by modern RedHat systems is provided by
the Native Posix Threads Library (NPTL). NPTL uses three consecutive
descriptor entries in the Global Descriptor Table (GDT) to manage thread
local storage. One of the arguments to the clone() is an
optional descriptor entry for TLS. More commonly used is the
set_thread_area() system call, which takes a descriptor as an
argument and returns the entry number in the GDT in which it has been
stored. The NPTL then uses this to initialize the %gs
register. The descriptors are per thread, so they have to be stored in per
thread storage and the GDT entries must be re-initialized on context
switch. This is done via a restore ctx operation.
Since both NPTL and the Solaris libc rely on
%gs to access per-thread data, we have added code to
virtualize its usage. The first thing our user-space emulation library
does is:
/*
* Save the Linux libc's %gs and switch to the Solaris libc's %gs
* segment so we have access to the Solaris errno, etc.
*/
pushl %gs
pushl $LWPGS_SEL
popl %gs
This sequence ensures that we always enter our Solaris code using the
well-known value used for our %gs. We also stash the current
value of %gs on the stack, so we can restore it prior to
returning to Linux code.
3.5.2 EFAULT/SIGSEGV
If the user-space emulation library were to access an argument from a
system call which had an invalid address, a SIGSEGV signal would be
generated. For proper Linux emulation, the desired result in this
situation is to generate an error return from the system call with an
EFAULT errno.
To deliver the expected behavior, we will introduce a new system call
(uucopy()), which copies data from one user address to
another. Any attempt to use an illegal address will cause the call to
return an error. Otherwise, the data will be copied as if we had performed
a standard bcopy() operation.
For example:
int
lx_system_call(int *arg)
{
int local_arg;
int rval;
/*
* catch EFAULT
*/
if ((rval = uucopy(arg, &arg, sizeof (int))) < 0)
return (rval); /* errno is set to EFAULT */
/*
* transform the arg, now in local_arg, to Solaris format
*/
return (solaris_system_call(&local_arg));
}
This functionality seems to be generically useful, so the
uucopy() call will be implemented in libc, where
it will be available to any application.
If the overhead imposed by this system call dramatically limits
performance, we may include an environment variable that causes the brand
library to perform a standard userspace copy rather than the kernel-based
copy. Setting this variable would lead to higher performance, but some
system calls would segfault rather than returning EFAULT.
3.6 Signal Handling
Delivering signals to a Linux process is complicated by differences in
signal numbering, stack structure and contents, and the action taken when a
signal handler exits. In addition, many signal-related structures, such as
sigset_ts, vary between Solaris and Linux.
The simplest transformation that must be done when sending signals is
to translate between Linux and Solaris signal numbers.
Major signal number differences between Linux and Solaris
| Number |
Linux |
Solaris |
| 10 | SIGUSR1 | SIGBUS |
| 12 | SIGUSR2 | SIGSYS |
| 16 | SIGSTKFLT | SIGUSR1 |
| 17 | SIGCHLD | SIGUSR2 |
| 18 | SIGCONT | SIGCHLD |
| 19 | SIGSTOP | SIGPWR |
| 20 | SIGTSTP | SIGWINCH |
| 21 | SIGTTIN | SIGURG |
| 22 | SIGTTOU | SIGPOLL |
| 23 | SIGURG | SIGSTOP |
| 24 | SIGXCPU | SIGTSTP |
| 25 | SIGXFSZ | SIGCONT |
| 26 | SIGVTALARM | SIGTTIN |
| 27 | SIGPROF | SIGTTOU |
| 28 | SIGWINCH | SIGVTALARM |
| 29 | SIGPOLL | SIGPROF |
| 30 | SIGPWR | SIGXCPU |
| 31 | SIGSYS | SIGXFSZ |
When a Linux process sends a signal using the kill(2)
system call, we translate the signal into the Solaris equivalent before
handing control off to the standard signalling mechanism. When a signal is
delivered to a Linux process, we translate the signal number from Solaris
back to Linux. Translating signals both at generation and at delivery time
ensures both that Solaris signals are sent properly to Linux applications
and that signals' default behavior works as expected.
One issue is that Linux supports 32 real time signals, with
SIGRTMIN typically starting at or near 32
(SIGRTMIN) and proceeding to 63 (SIGRTMAX)
(SIGRTMIN) is "at or near" 32 because glibc usually "steals"
one ore more of these signals for its own internal use, adjusting
SIGRTMIN and SIGRTMAX as needed.) Conversely,
Solaris actively uses signals 32-40 for other purposes and and only
supports 7 realtime signals, in the range 41 (SIGRTMIN) to 48
(SIGRTMAX).
At present, attempting to translate a Linux signal greater than 39
(corresponding to the maximum real time signal number Solaris can support)
will generate an error. We have not yet found an application that attempts
to send such a signal.
Branded processes are set up to ignore any Solaris signal for which
there is no direct Linux analog, preventing the delivery of untranslatable
signals from the global zone.
3.6.1 Signal Delivery
To support user-level signal handlers, BrandZ uses a double layer of
indirection to process and deliver signals to branded threads.
In a normal Solaris process, signal delivery is interposed on for any
thread registering a signal handler by libc. Libc needs to do various bits
of magic to provide thread-safe critical regions, so it registers its own
handler, named sigacthandler, using the
sigaction(2) system call. When a signal is received,
sigacthandler() is called, and after some processing, libc
calls the user's signal handler via a routine named
call_user_handler().
Adding a Linux branded thread to the mix complicates things somewhat.
First, when a thread receives a signal, it could be running with a Linux
value in the x86 %gs segment register as opposed to the value Solaris
threads expect; if control were passed directly to Solaris code, such as
libc's sigacthandler(), that code would experience a
segmentation fault the first time it tried to dereference a memory location
using %gs.
Second, the signal number translation referenced above must take place.
(As an example, SIGCONT is equivalent in function in Linux and
Solaris, but Linux' SIGCONT is signal 18 while Solaris' is
signal 25.) Further, as was the case with Solaris libc, before the
Linux signal handler is called, the value of the %gs segment register must
be restored to the value Linux code expects.
This need to translate signal numbers and manipulate the %gs register
means that while with standard Solaris libc, following a signal from
generation to delivery looks something like:
kernel ->
sigacthandler() ->
call_user_handler() ->
user signal handler
for BrandZ Linux threads, this instead would look like this:
kernel ->
lx_sigacthandler() ->
sigacthandler() ->
call_user_handler() ->
lx_call_user_handler() ->
Linux user signal handler
The new addtions are:
- lx_sigacthandler()
This routine is responsible for setting the %gs segment register to the
value Solaris expects, and for jumping to Solaris' libc signal
interposition handler, sigacthandler().
- lx_call_user_handler()
This routine is responsible for translating Solaris signal numbers to their
Linux equivalents, building a Linux signal stack based on the information
Solaris has provided, and passing the stack to the registered Linux signal
handler. It is, in effect, the Linux thread equivalent to libc's
call_user_handler.
Installing lx_sigacthandler() is a bit tricky, as normally
libc's sigacthandler() routine is hidden from user programs.
To facilitate this, a new private function was added to libc,
setsigaction():
void setsigacthandler(void (*new_handler)(int, siginfo_t
*, void *),
void (**old_handler)(int, siginfo_t *, void *))
The routine works by modifying the per-thread data structure that libc
already maintains that keeps track of the address of its own interposition
handler with the address passed in; the old handler's address is set in the
pointer pointed to by the second argument, if it is non-NULL, mimicking the
behavior of sigaction() itself. Once
setsigacthandler() has been executed, all future branded
threads this thread may create will automatically have the proper
interposition handler installed as the result of a normal
sigaction() call.
Note that none of this interposition is necessary unless a Linux thread
registers a user signal handler, because the default action for all signals
is the same between Solaris and Linux save for one signal,
SIGPWR. For this reason, BrandZ always installs its own
internal signal handler for SIGPWR that translates the action
to the Linux default, to terminate the process. (Solaris' default action
is to ignore SIGPWR.)
It is also important to note that when signals are not translated,
BrandZ relies upon code interposing upon the wait(2) system
call to translate signals to their proper values for any Linux threads
retrieving the status of others. So, while the Solaris signal number for a
particular signal is set in the data structures for a process (and would be
returned as the result of, for example, WTERMSIG()), the
BrandZ interposition upon wait(2) is responsible for
translating the value WTERMSIG(), and would return from a
Solaris signal number to the appropriate Linux value.
3.6.2 Returning From Signals
The process of returning to an interrupted thread of execution from a
user signal handler is entirely different between Solaris and Linux. While
Solaris generally expects to set the context to the interrupted one on a
normal return from a signal handler, in the normal case Linux instead sets
the return address from the signal handler to point to code that calls one
of two specific Linux system calls, sigreturn(2) or
rt_sigreturn(2). Thus, when a Linux signal handler completes
execution, instead of returning through what would, in Solaris' libc be a
call to setcontext(2), the sigreturn(2) or
rt_sigreturn(2) Linux system calls are responsible for
accomplishing much the same thing.
This trampoline code (for a call to sigreturn(2)) looks
like this:
pop %eax
mov LX_SYS_sigreturn, %eax
int $0x80
such that when the Linux user signal handler is eventually called, the
stack looks like this:
| Pointer to sigreturn trampoline code |
| Linux signal number |
| Pointer to Linux siginfo_t |
| Pointer to Linux ucontext_t |
| Linux ucontext_t |
| Linux fpstate |
| Linux siginfo_t |
BrandZ takes the approach of intercepting the Linux
sigreturn(2) (or rt_sigreturn(2)) system call in
order to turn it into a return through the libc call stack that Solaris
expects. This is done by the lx_sigreturn() or
lx_rt_sigreturn() routines, which remove the Linux signal
frame from the stack and pass the resulting stack pointer to another
routine, lx_sigreturn_tolibc(), which makes libc believe the
user signal handler it had called returned.
When control then returns to libc's call_user_handler()
routine, a setcontext(2) will be done that (in most cases)
returns the thread executing the code back to the location originally
interrupted by receipt of the signal.
One final complication in this process is the restoration of the %gs
segment register when returning from a user signal handler. Prior to
BrandZ, Solaris' libc forced the value of %gs to a known value when calling
setcontext() to return to an interrupted thread from a user
signal handler (as libc uses %gs internally as a pointer to curthread, it
is a way of ensuring a good "known value" for curthread.)
Since BrandZ requires that setcontext() restore a Linux value for %gs
when returning from a Linux signal handler, we made this forced restoration
optional on a per-process basis. This was accomplished by means of a new
private routine to libc:
void set_setcontext_enforcement(int on)
By default, the "curthread pointer" value enforcement is enabled. When
this routine is called with an argument of '0', the mechanism is disabled
for this process.
Shutting off this mechanism will not have any correctness or security
implications. Writing to the %gs segment register is not a privileged
operation and as such %gs can be set to any value at any time by user code.
The only drawback to disabling the mechanism is that if a bad value is set
for %gs, the broken application will likely segmentation fault deep within
libc.
3.7. Device and ioctl() support
3.7.1 Determining Which Devices to Support
Our investigation showed that he following devices are the minimum set
required to support Linux branded zones:
/dev/null
/dev/zero
/dev/ptmx
/dev/pts/*
/dev/tty
/dev/console
/dev/random
/dev/urandom
/dev/fd/*
The following devices were considered, but aren't actually necessary
for Linux branded zones.
/dev/ttyd? - These are serial port devices.
/dev/pty[pqr]? and /dev/tty[pqr]? - These are
old style terminal devices provided for compatibility purposes. They
currently do not exist in native non-global zones. The Unix98
specification replaced these devices with /dev/ptmx and
/dev/pts/*, which have been used as the standard terminal
devices for Solaris and Linux for a long time. While these devices do
still exist on Red Hat 2.4 systems, they have officially been unsupported
since Linux 2.1.115. An inspection of a Linux 2.4 system didn't reveal
any applications that were actually using these devices.
3.7.1.1 Networking Devices
Native Solaris non-global zones have a network interface that is
visible (reported via ifconfig), but there are no actual network device
nodes accessible via /dev. Certain higher level network
protocol devices are accessible in native zones:
/dev/arp, /dev/icmp, /dev/tcp, /dev/tcp6, /dev/udp, /dev/udp6
/dev/ticlts, /dev/ticots, /dev/ticotsord
Notably missing from the list above is:
/dev/ip
Looking at a native Linux 2.4 system, we see that the following network
devices exist:
/dev/inet/egp, /dev/inet/ggp, /dev/inet/icmp, /dev/inet/idp
/dev/inet/ip, /dev/inet/ipip, /dev/inet/pup, /dev/inet/rawip
/dev/inet/tcp, /dev/inet/udp
Documentation/devices.txt describes all of these as
'iBCS-2 compatibility devices', so we will not be supporting them.
Additionally, Linux does not create /dev entries for
networking devices. Network interface names are mapped to kernel modules
via aliases defined in /etc/modules.conf. Interface plumbing
(via ifconfig) is all done via ioctls to sockets. Reporting status of
interfaces (via ifconfig) is done either by socket ioctls or by accessing
files in /proc/net/. Finally, network accesses (telnet and ping) are all
done via socket calls.
This indicates that initial Linux zones networking support has no
actual device requirements, but it does require ioctl translations. (This
issue is addressed later in this document.) Given the lack of
device-specific configuration, the current native zones network interface
can be leveraged in Linux branded zones.
3.7.2 Major/Minor Device Number Mapping
Linux has explicitly hardcoded knowledge about how major and minor
device numbers map to drivers and paths. (See
Documentation/devices.txt in the Linux source.) But on
Solaris, major device number to driver mapping is dynamically defined via
/etc/name_to_major. Minor device number name space is managed
by individual drivers.
Also, there is not a 1:1 major/minor device number mapping between
Linux and Solaris devices nodes and emulation of the functionality provided
by a given Linux driver might involve using multiple Solaris drivers. For
example, in Linux, both /dev/null and /dev/random
are implemented using the same driver, so both of these devices have the
same major number. In Solaris, these devices are implemented with separate
drivers, each with its own major number.
Major/minor device numbers are exposed to Linux applications in
multiple places. Some examples are:
- the
stat(2) / statvfs(2) family of system
calls
- the filesystem name space via
/dev/pts/*
-
/proc/<pid>/stat
- certain ioctls (specifically
LX_TIOCGPTN )
One important question to answer is, is this device number translation
actually necessary? Are major and minor device numbers actually consumed
by Linux applications, and if so, how? As it turns out, the answer to this
question is unfortunately yes. glibc's
ptsname() does "sanity checks" of pts paths to makes sure they
have expected dev_t values. These "sanity checks" make assumptions about
expected major and minor device values.
Therefore, we will be required to provide a device number mapping
mechanism. For the required devices listed earlier, this gives us:
Device Solaris Driver Linux Major/Minor
------ -------------- -----------------
/dev/null mm 1 / 3
/dev/zero mm 1 / 5
/dev/random random 1 / 8
/dev/urandom random 1 / 9
/dev/tty sy 5 / 0
/dev/console zcons 5 / 1
/dev/ptmx ptm 5 / 2 [1]
/dev/pts/* pts 136 / * [2]
Notes:
[1] ptm is a clone device, so this translation is tricky.
Basically, the /dev/ptmx node in a native zone actually
points to the clone device. But when an open is done
on this device, the vnode that is returned actually corresponds
to a ptm node (and not a clone node).This means that
on a Solaris system, a stat of /dev/ptmx will return different
dev_t values than an fstat(2) of an fd that was created by
opening /dev/ptmx. On Linux, both of these operations need to
return the same result. So once again, we are mapping
multiple major/minor Solaris device numbers to a single
Linux device number.
[2] For pts devices, there should probably be no translation done
for device minor node numbers.
3.7.3 Ioctl Translation Support
3.7.3.1 Ioctl Support - General Issues
A quick investigation shows that most of the required ioctl support
isn't actually for devices at all. Most the necessary ioctls are for
non-device filesystem nodes. Here's a list of the most obviously needed
ioctls, broken up into categories of files that support these ioctls:
1) All file ioctls (regular files, streams devices, sockets, fifos):
FIONREAD, FIONBIO
2) Streams file ioctls (Streams device, sockets, fifos):
TCSBRK, TCXONC, TCFLSH, TIOCEXCL, TIOCNXCL, TIOCSPGRP,
TIOCSTI, TIOCSWINSZ, TIOCMBIS, TIOCMBIC, TIOCMSET,
TIOCSETD, FIOASYNC, FIOSETOWN, TCSETS, TCSETSW,
TCSETSF, TCSETA, TCSETAW, TCSETAF, TIOCGPGRP, TCSBRKP
3) Socket ioctls:
FIOGETOWN, SIOCSPGRP, SIOCGPGRP, SIOCATMARK,
SIOCGIFFLAGS, SIOCGIFADDR, SIOCGIFDSTADDR,
SIOCGIFBRDADDR, SIOCGIFNETMASK, SIOCGIFMETRIC,
SIOCGIFMTU, SIOCGIFCONF, SIOCGIFNUM
4) Streams device - ptm
TIOCGWINSZ, UNLKPT, LX_TIOCGPTN
5) Streams /w ttcompat module - pts
TIOCGETD
6) Streams /w ldterm, ptem or ttcompat module - pts
TCGETS
7) Streams /w ldterm or ptem module - pts
TCGETA
8) Streams device - pts
LX_TIOCSCTTY, LX_TIOCNOTTY
Most of these ioctls are streams ioctls, and since FIFOs and sockets
are implemented via streams in Solaris, any FIFO or socket supports most of
these ioctls. Of the 45 ioctls listed above, only 8 are actually
device-specific ioctls.
This indicates that doing ioctl translations via layered drivers is not
the best approach, since this would only address a minor subset of the
total ioctls that need to be supported. Because supporting non-device
ioctls will require the creation of a non-layered driver ioctl translation
mechanism, it seems more appropriate to handle device ioctls via this same
mechanism as well.
With this in mind, it's more interesting if the categories above are
renamed based in terms of their vnode v_type and v_rdev values. If we do
this, we get:
1) VREG, VFIFO, VSOCK, VCHR[ptm, pts, sy, zcons]
2) VFIFO, VSOCK, VCHR[ptm, pts, sy, zcons]
3) VSOCK
4) VCHR[ptm]
5, 6, 7, 8) VCHR[pts]
Supporting ioctls on these vnodes will require a switch table. In
addition to the ioctl number, the translation mechanism must look at the
type of the file descriptor an ioctl is targeting to determine what
translation needs to be done. Hence, the translation layer will need to
looking at the v_type and the major portion of v_dev associated with the
target file descriptor. These fields are easily accessible from the kernel
and are also available via st_mode and st_rdev from fstat(2). So this
translation could occur either in the kernel or in userland.
The only tricky part about this determination is that we don't want to
hard code the the major Solaris driver number into any translation code
since these number are allocated dynamically via
/etc/name_to_major in the global zone. Therefor device ioctl
translators should be bound to specific Solaris drivers by their driver
name, and when an application attempts to perform an ioctl to a driver the
translation code will need to be able to resolve the driver name to driver
major number mapping. This translation code should have not have any
impact on how devices are managed in the global zone.
3.7.3.2 Ioctl support - nits
There are other more minor issues surrounding ioctl support that are
worth mentioning.
Ioctl cmd symbols that represent the same ioctl command on Solaris and
Linux can have different underlying symbol values. For example, TCSBRK on
Solaris is 0x5405, while on Linux it's 0x5409. Any translation layers will
have to be aware of the Linux ioctl values and translate them into Solaris
ioctl values.
One final note: BrandZ will take an "opt-in" approach to ioctls.
Only those ioctls that we have explicitly added support
for will be executed. All others will return EINVAL. The alternative
approach, simply passing the unrecognized ioctls through to Solaris proper,
is risky for the following reasons:
- Nondeterministic behavior: Since ioctl cmd values can be different
from Solaris, and devices on Solaris and Linux can support different
behaviors, simply passing on unknown ioctls could result in
nondeterministic device and/or application behavior.
- Inadvertent breakage and maintainability issues: If an application
depends on certain ioctls that are not explicitly listed as supported
in brand-specific code, then there is an increased chance that a
developer might attempt to change the implementation of one of those
ioctls without knowing that the ioctl is also being consumed by a
brand, and thereby inadvertently break the brand. If all ioctls
supported by a brand are explicitly listed in the brand support code,
then when developers search for consumers of a given ioctl, they
will see that the ioctl is exported for application consumption in a
brand environment.
3.7.4 Device Minor Nodes and Path Management
Multiple zones (both native and non-native) will be sharing devices with
the global zone and with each other. Therefor, we must ensure that this device
sharing doesn't generate any security problems where one zone could affect
another because of device sharing. Here's a look at the device nodes that
will be present in zones and what types of risks/issues these devices
present.
Device Notes
------ -----
/dev/null read-write, doesn't have any consumer state
/dev/zero read-write, doesn't have any consumer state
/dev/random Writable only by root, protected from root
zone writes via the 'sys_devices' privilege.
/dev/urandom Writable only by root, protected from root
zone writes via the 'sys_devices' privilege.
/dev/tty Pass through device,
doesn't have any consumer state
/dev/ptmx Cloning device. Each open results in a unique
minor node, so a minor node can only exist
in one zone at any given time.
/dev/pts/* All minor nodes are visible in all zones.
Currently, this driver has been made zone
aware to prevent multiple zones from
accessing the same minor nodes concurrently.
/dev/console Minor nodes of this device should not be
shared across different zones. Each
instance of this device in a zone should
have a unique minor number. This is the
current behavior for native zones.
The other important aspect of device paths is how brand/zone-specific
device paths are seen from the global zone. Currently, the only
brand/zone-specific device that exists is the zcons console device. Here
are examples of zcons device paths that could exist in the global zone (with
two native zones booted):
/dev/zcons/<zone1_name>/masterconsole ->
/devices/pseudo/zconsnex@1/zcons@0:masterconsole
/dev/zcons/<zone1_name>/zoneconsole ->
/devices/pseudo/zconsnex@1/zcons@0:zoneconsole
/dev/zcons/<zone2_name>/masterconsole ->
/devices/pseudo/zconsnex@1/zcons@1:masterconsole
/dev/zcons/<zone2_name>/zoneconsole ->
/devices/pseudo/zconsnex@1/zcons@1:zoneconsole
These device paths are very zcons centric and don't really extend well
if we attempt to introduce any new brand-specific devices. If we decide to
add any new brand-specific devices, then these paths should probably be
changed. For instance, if we add a Linux brand-specific driver that layers
on top of /dev/ptmx, then the following device paths might
make more sense:
/dev/zones/<brand>/<zone_name>/masterconsole ->
/devices/pseudo/zonesnex@1/zcons@0:masterconsole
/dev/zones/<brand>/<zone_name>/zoneconsole ->
/devices/pseudo/zonesnex@1/zcons@0:zoneconsole
/dev/zones/<brand>/ptmx ->
/devices/pseudo/ptm_linux@0:ptmx
3.7.5 Device-Specific Issues
3.7.5.1 /dev/console
Each branded zone will need its own console device, just like native
zones today. Whenever possible, a brand should leverage the zcons driver
and use it as the /dev/console device in a non-native zones.
We have done this in the lx brand.
3.7.5.2 /dev/ptmx and /dev/pts
These devices need special management. When Linux applications access
these devices, we need to do two things.
-
After an open of
/dev/ptmx, we need to ensure that
the associated /dev/pts/* device link exists (since
it can be created asynchronously to ptmx opens), and
that it has its ownership changed to match that of the
process that opened this instance of /dev/ptmx.
-
We need to ensure that after any pts device is opened
by a Linux application, the following modules get pushed
onto its stream:
ptem, ldterm, ttcompat, and ldlinux
In Solaris, issue 1 above is done by libc`ptsname() with the help of an
su binary. Issue 2 above is done by client applications that are
allocating terminals (for example, this is done in xterm`spawn().)
For BrandZ, there two possible approaches for solving these problems.
The first is to continue with the initially prototyped approach and do
both of these things post open in an interception layer. For the ptm
device (issue 1 above), this interception layer would have to be in the
kernel since it involves changing the ownership of a device and
/dev is mounted as read only in zones. The post open pts
device processing (issue 2 above) could be done in the kernel or in
userland. This approach doesn't seem that great since its implementation
is spread across both userland and the kernel. It also involves
post-processing all opens to determine if additional work is necessary.
We will adopt an alternate approach, and replace the ptm driver with a
layered driver in Linux branded zones. The current ptmx driver will be
replaced with a self-cloning layered driver in the Linux zones. Upon open,
this layered driver will:
- Open an instance of the real
/dev/ptmx.
- Wait for the corresponding
/dev/pts/* node to be created.
- Set the permissions on the corresponding
/dev/pts/* node.
- Set up the auto push mechanism (via kstr_autopush() to automatically
push the required strmods onto the corresponding
/dev/pts/*
node when it is actually opened by a Linux application.
Upon final close of a Linux ptm node, this driver will remove the auto
push configuration it created and close the underlying
/dev/ptmx node that it opened.
3.7.5.3 /dev/fd/*
The entries in /dev/fd are not actually devices. The
entries in /dev/fd/ allow a process access to its own file
descriptors via another namespace. Thus, opens of entries in this
directory map to re-opens of the corresponding file descriptor in the
current process.
In Solaris /dev/fd is implemented via a filesystem.
readdir(3C)s of /dev/fd might not return an accurate
reflection of the current file descriptor state of a process, but opens of
specific entries in the directory will succeed if that file descriptor is
valid for the process performing the open.
In Linux, /dev/fd is implemented as a symbolic link to
/proc/self/fd. This /proc filesystem directory
is similar to the Solaris /proc/<pid>/fd directory in
that it contains an accurate representation of a processes current file
descriptor state. But aside from just providing access to the processes
current file descriptors, on Linux the files in this directory are actually
symbolic links to the underlying files referenced by the processes file
descriptors. This is similar to the functionality in Solaris provided by
/proc/<pid>/paths.
The most common uses for /dev/fd entries are for suid
shell script and as parameters to commands that don't natively support I/O
to stdin/stdout. Given these use cases it seems that a simple
mount of the existing Solaris /dev/fd filesystem in the Linux
zone should be sufficient for compatibility purposes. Of course, it is
possible that other Linux applications exist that utilize some of the
additional functionality of the Linux /dev/fd implementation
that isn't available in Solaris. (For example, /dev/fd on
Linux provides an accurate reflection of the current fd state of the
process and allows applications to determine file descriptor to file path
mappings.) If such applications are discovered, then we might need to
revisit this strategy.
3.7.5.4 audio devices
Linux has two different audio subsystems OSS and ALSA. To determine
which audio subsystem to support we identified some common/popular
applications that utilize audio and checked which subsystem they use. We
found:
OSS only:
skype, real/helix media player, flash, quake, sox
OSS or ALSA:
xmms (selectable via plugins)
Our survey identified no popular applications that require ALSA, so we
will only be supporting OSS audio.
Audio device access on Linux and Solaris is done via reads, writes, and
ioctls to different devices.
OSS devices:
/dev/dsp, /dev/mixer
/dev/dsp[0-9]+, /dev/mixer[0-9]+
Solaris devices:
/dev/audio, /dev/audioctl
/dev/sound/[0-9]+, /dev/sound/[0-9]+ctl
Unfortunately, we can't simply map the Solaris /dev/audio
and /dev/audioctl devices to /dev/dsp and
/dev/mixer devices in a Linux and expect the ioctl translator
to handle everything else for us. Some of the reason for this are:
- The admin/user may not always want a Linux branded zone to have access
to system audio devices.
- There may be multiple audio devices on a system each of which may
support only input, only output, or both input and output. In Solaris a
user can specify which audio device an application should access by
providing a
/dev/sound/* path to the desired device. But in
the Linux zone the admin might want the Linux audio device to map to
separate Solaris audio devices for input and/or output.
- Linux ioctl translation is done using dev_t major values. On Solaris
opening
/dev/audio will result in accessubg different device
drivers based of what the underlying audio hardware is, and these different
drivers may have different dev_t values. Hence, if audio devices were
directly imported the dev_t translator would need to have knowledge of
every potential audio device driver on the system, and as new audio drivers
are added to the system this translator would need to be made aware of them
as well.
- In Linux audio devices are character devices and support mmap
operations. On Solaris audio devices are streams based and do not support
mmap operations.
To deal with these problems the following components are providedL
- A way for the user to enable audio support in a zone via zonecfg. The
user enables audio via zonecfg boolean attribute called "audio". (The
absence of this attribute implies a value of false.) Adding this resource
to a zone via zonecfg looks like this:
--
zonecfg:centos> add attr
zonecfg:centos:attr> set name="audio"
zonecfg:centos:attr> set type=boolean
zonecfg:centos:attr> set value=true
zonecfg:centos:attr> end
zonecfg:centos> commit
zonecfg:centos> exit
--
By default when a Linux audio applications attempts to open
/dev/dsp this access is mapped to /dev/audio
under Solaris. (Linux application access to /dev/mixer are
mapped to /dev/audioctl.)
To allow an admin to control which Solaris devices a Linux zone can
send input/output to we provide two additional attributes that have
string values:
audio_inputdev = none | [0-9]+
audio_outputdev = none | [0-9]+
If audio_inputdev is set to none, then audio input is disabled. If
audio_inputdev is set to an integer, then when a Linux application attempts
to open /dev/dsp for input this access is mapped to
/dev/sound/<audio_inputdev attribute value>. The same
behavior applies to audio_outputdev for Linux audio output accesses.
If audio_inputdev or audio_outputdev exist but the audio attribute is
missing (or set to false) audio will not be enabled for the zone.
Currently there is no mechanism outside of zonecfg itself for verifying
zone attributes. So if a user specifies invalid types or values for these
attributes there is no way to alert them during the zone configuration
stage. (Later when attempting to boot the zone we will have an opportunity
to verify these attributes and can fail to boot the zone if they are
invalid.)
- Create a new layered driver to act as a Linux audio device.
This device will always be mapped into the zone. Linux can change the
ownership of this device nodes as it sees fit.
Since this layered driver will be accessing Solaris devices from within
the kernel there will be no problems with device ownership. The Solaris
audio devices will continue to be owned by whoever is logged in on the
console but a Linux zone will also be able to access the device whenever
necessary.
Luckily, the Solaris audio architecture includes an integrated audio
output stream mixer so that multiple programs can open the audio device and
output data at the same time. Unfortunately it's not possible to
virtualize all audio features in this same manner. For instance there can
only be one active audio recording stream on a given audio device at a
time. Also it would be difficult to virtualize global mixer controls.
Hence, by allowing a zone shared access to audio devices owned by the
console user, it would be possible for zone users and the console user to
compete for any non-virtualized audio resources. These limitations seem
acceptable if it's assumed that the user who owns the console is the same
user who is running Linux audio applications. We assume that this will
probably be the common case for most audio enabled Linux zones. If these
limitations are not acceptable then an admin always has the option of
updating /etc/logindevperm to deny console users access to
audio devices. (thereby giving a zone exclusive access to the device.)
This device is implemented as a character driver (instead of as a
streams driver.) This allows it to more easily support Linux memory mapped
audio device access. Theoretically, mmap device access could be simulated
in the ioctl translation layer but this would add quite a bit of extra
complexity to the translation system. It would require that the ioctl
translation layer start to maintain state across multiple ioctl operations,
something it currently does not do. It would also require user land helper
threads and support for handling and redirecting signals that might get
delivered to those threads.
- Provide hooks into the zone state transition brand callback mechanism
to propagate zone audio device configuration to the Linux layered audio
device.
This is done via a program that opens the layered Linux audio device
when the zone is booted, and uses ioctls to configure the device. When a
zone is halted this same program is used to notify the driver that any
previous configuration for a given zone is no longer valid.
This model provides each Linux zone with one audio device. There are no
inherent limits in this approach that would prevent future support for
multiple virtual audio devices in a linux zone. The zonecfg
attribute configuration mechanism could be extended to allow the the
administrator to specify what the mappings for additional audio devices
should be. The audio layered driver could also be enhanced to export
multiple audio and mixer minor nodes which would all be imported into any
linux zone on the system. Lastly, the zone boot configuration callbacks
could be enhanced to parse the additional zonecfgaudio
configuration attributes and pass the additional configuration information
down to the audio layered driver.
3.8 NFS
The BrandZ lx brand will have NFSv3/2 client support. NFSv4 is not
supported since the version of RedHat/CentOS that the lx brand currently
supports does not support NFSv4 itself. Having an NFSv3/2 server in a
branded zone is also not supported since this is an existing limitation on
native zones systems and BrandZ is not removing this limitation.
3.8.1 NFSv3/2 Client Support
NFS client support consists of two major components:
- A kernel filesystem module that can "mount" nfs shares from a server
and service accesses to that filesystem via VOP_ interfaces
lockd and statd rpc services to handle nfs
locking requests.
Due to licensing issues we can't port the Linux NFS kernel filesystem
module to Solaris,
On Solaris lockd is a userland daemon with a significant
kernel component: klmmod. Most of the lockd
functionality is actually implemented in the kernel in klmmod.
The kernel lockd component also uses private undocumented
interface (NSM_ADDR_PROGRAM) to communicate with statd.
On Linux lockd is actually entirely contained within the
kernel. When the kernel starts up the lockd services, it
creates a fake process that is visible via the ps command but
lacks most normal /proc style entries.
Given the how closely integrated the separate components of the NFS
client are on Solaris, and given that that most of the NFS client on Linux
is in the kernel and there for not usable by the lx brand, the
approach taken to support the NFS client in the lx brand was to
simply run the Solaris NFS client within the lx zone.
Adding support for using the all Solaris NFS client components in a zone
involved modifications in BrandZ, the lx brand, and base Solaris.
Some of these areas and the modifications that were required are described
below.
3.8.1.1 mount(2) support
The first component of support NFS client operations in an lx
branded zone is translating Linux NFS mount system call requests into
Solaris mount system call requests. This requires translating the
arguments and options strings normally passed to the Linux into formats
that the Solaris kernel is expecting.
3.8.1.2 Starting and stopping of the lockd and
statd daemons.
The version of RedHat/CentOS Linux that the lx brand support
has a rc.d startup script that is used to start, stop, and
check the status of statd and lockd. To support
the execution of the Solaris versions of lockd and
statd in a zone, after install we replace the Linux
lockd and statd binaries with symlinks to scripts
in /native/usr/lib/lx/brand that start the Solaris versions of
the two daemons. The startup script has also been modified to successfully
allow for the startup, stopping, and querying of status for
lockd and statd. This approach preserves the
existing techniques for administering NFS locking under RedHat/CentOS in
the lx brand.
lockd and statd are also run under different
user and group ids under Linux than Solaris. The startup wrapper scripts
pass command line options to lockd and statd to
indicate what user and group should be used during normal operations.
3.8.1.3 Running native Solaris processes in a zone
Normally all processes in a non-native zone are branded non-native
process. NFS client support is the only exception to this rule since it
introduces the execution of two native Solaris processes into a non-native
zone. To support the basic execution of a native Solaris process in the
following steps were taken.
3.8.1.3.1 Brand aware exec path
The internal kernel exec path path (starting at
exec_common()) was updated to have a new brand operation flag.
The possible values for this flag are: clear (no special brand operation),
brand-exec (indicates that the current process should become branded if the
exec operation succeeds), and native-exec (indicates that the current
process should become a native process if the exec operation succeeds.)
Note, these extended exec flags are not accessible through the normal
exec() system call path. The normal exec() system call path always
defaults to this flag being clear. To change the branding of a process via
exec, a special brand operation (invoked via the brandsys(2)
system call) is used.
3.8.1.3.2 Chroot
One Problem with running native Solaris binaries in a branded zone is
that both the native binary and native libraries that they use expect to be
able to access native Solaris paths and files that may not exist inside a
branded. Rather than implementing a path mapping mechanism to re-direct
filesystem accesses for native binaries to paths into /native, during the
startup of these daemons we do a chroot("/native"). We've also ensured
that there is enough of the native Solaris environment created in /native
to allow lockd and statd run properly.
3.8.1.4 Allowing lockd and statd to
communicate with Linux services/interfaces within the zone.
lockd and statd are fairly self contained but
they do require access to certain services for which the native Solaris
versions won't be available in a zone. An audit of lockd and
statd reveal that these daemons depend on access to the
following services:
- naming services (via libnsl.so)
- syslog (via libc.so)
Normally, these daemons simply access these services via local
libraries. These libraries in turn use local files, other libraries, and
various network based resources to resolve requests. In a branded zone
most of these resources will not be available. For example, we can't
expect the Solaris libnsl.so library to know how to parse Linux NIS
configuration files.
To handle these requests we need to be able to leverage existing Linux
services and interfaces. This requires translating certain Solaris
lockd and statd services requests into Linux
service requests, and then translating any results back into a format that
Solaris libraries and utilities are expecting. In the lx brand
we've decided to call this process of translating service requests
thunking. (akin to a 32-bit OS calling into 16-bit BIOS code.) To
service these requests we have created a thunking layer which translates
Solaris calls into Linux calls.
This thunking layer works as follows:
- When
lockd or statd make a request
that requires thunking, this request ends up getting directed into a
library in the process called lx_thunk.so (the mechanism used to direct
requests into this library varies based of the type of request being
serviced and is discussed further below).
- The lx_thunk.so library packs up this request and sends it via a door
to child Linux process called lx_thunk.
- If the lx_thunk process does not exist then the lx_thunk.so library
will
fork(2)/exec(2) it.
- The lx_thunk process is a one line /bin/sh script that attempts to
execute itself and is executed in a Linux shell. When the brand emulation
library (lx_brand.so) detects that it is executing as the lx_thunk process
and it is attempting to re-exec itself, the library takes over the process
and sets itself up as a doors server.
- When the lx_thunk process receives a door request from lx_thunk.so
library in a native process, it unpacks the request and uses a Linux thread
to call invoke Linux interfaces to service the request.
- Once it is done servicing the request it packs up any results and
returns them via the same door call that it received the request on.
This thunking layer means that now the lx brand is dependent
upon Linux interfaces so we need to worry about Linux interfaces changing
and breaking the lx_thunk server process. To help avoid this possibility,
most the Linux interfaces that we've chosen to use are extremely well known
and listed in the glibc ABI. All of the interfaces are used by many
applications outside of glibc. Here are the Linux interfaces currently
used by the lx_thunk process:
gethostbyname_r
gethostbyaddr_r
getservbyname_r
getservbyport_r
openlog
syslog
closelog
__progname
Also worth mentioning is the means by which service requests that
require thunking are directed to lx_thunk.so. To intercept name service
requests the lx brand is introducing a new libnsl.so plugin
name-to-address translation library. Libnsl already supports
name-to-address translation plugin libraries that can be specified via
netconfig(4). For lx branded zones there will be a custom
netconfig file installed into /native/etc/netconfig that will instruct
libnsl.so to redirect name service lookup requests to a new library called
lx_nametoaddr.so. This library will in turn resolve name service requests
using private interfaces exported from the thunking library, lx_thunk.so.
3.8.1.5 rpcbind vs portmap
lockd and statd are both rpc based services.
There for when they start up they must register with an rpc portmapper.
This registration is done via a standardized rpc protocol. The difficulty
with this registration is that there are multiple versions of the protocol.
Initially the protocol was called the portmapper protocol and only
supported IP based transports. This is the version of the protocol that
RedHat/CentOS support. Solaris long ago upgraded the version of the
protocol it supports to be rpcbind. rpcbind supports registrations on
non-ip based transports. The problems faced by BrandZ are that all the
libnsl.so interfaces designed for talking to a local portmapper assume that
the local portmapper supports the rpcbind protocol. In the case of an
lx branded zone we're actually running the Linux portmapper which
doesn't support the rpcbind protocol. To work around this a new command
line flag has been added to lockd and statd to
indicate that they should attempt to to use the portmapper protocol instead
of the rpcbind protocol for registering rpc services. Also, new private
interfaces have been added into libnsl.so to allow it to support
communication via the portmapper protocol instead of the rpcbind protocol.
3.8.1.6 Privilege retention for lockd and
statd
On Solaris, lockd and statd and privilege
aware daemons and upon startup they drop all privilege they won't need for
normal execution. When running in a Linux zone, these daemons need
additional privileges so that they can chroot(2),
fork(2), exec(2), and run the lx_thunk processes.
So the lx_thunk.so library which is preloaded into these processes also
prevents them from dropping the following privileges:
proc_exec
proc_fork
proc_chroot
file_dac_read
file_dac_write
file_dac_search
3.8.2 Automounters
Linux supports two automounters: "amd" and "automount".
amd is implemented as a userland NFS server. It mounts NFS filesystems
on directories where it will provide automount services, and specified
itself as the server for these NFS filesystems. To support amd only
required adding translation support for all the Linux system call mount
options it expects to work.
automount, the more common (and often default) automounter, is
substantially more complex than amd. automount relies on a filesystem
module called autofs. Upon startup, automount mounts the autofs filesystem
onto all automount controlled directories. As an option to the mount
command it passes a file descriptor that indicates a pipe will be used to
send requests to the automount process. automount listens for requests on
this pipe. When it gets a request, it looks up shares via whatever name
services are configured, executes mount(2) system calls as
necessary, and notifies the autofs filesystem that a request has been
serviced. The exact semantics of the interfaces between automount and
autofs are versioned and appear to differ based of the Linux kernel
version. To support automount the lx brand will introduce a new
filesystem module called lx_autofs. When the automount process attempts to
mount the autofs filesystem we will instead mount the lx_autofs filesystem
which will emulate the behavior one specific version of the autofs
filesystem.
3.9 Debugging and
Observability
3.9.1 Ptrace
In order to support Linux strace, it will be necessary to
duplicate almost the full functionality of the ptrace system
call. Rather than trying to implement this giant wad in the kernel, we
will implement it in userland on top of a native /proc, which
will be mounted at /native/proc. It is worth noting that the
Solaris /proc already has a ptrace compatibility flag, which
provides much of the semantics we want (e.g., interaction through wait(2)).
Two difficult parts about ptrace are tracing Linux system calls and
attaching to existing processes.
To implement system call tracing, we want to stop the program in
userland before the interposition library has done any work. To do this, we
introduce a per-brand scratch space in the ulwp_t structure, similar to
that used by the DTrace PID provider. When we want to trace a system call
enter or exit, we set this flag from within the kernel via a brand-specific
system call. In the trampoline code, we check the status of this flag and
issue another brand-specific system call that will stop us with an
appropriately formed stack. Note that it's generally not possible to
"hide" the interpositioning library when it comes to signals. Besides
trying to figure where we are in the brand library (if we're in it at all),
we will probably do more harm than good when trying to hide this
behavior.
When a debugger attaches to another process using PTRACE_ATTACH under
Linux, it actually becomes the parent for the target process in all ways -
except that getppid() still returns the original value. Since this is an
implementation detail and an undesirable pollution of the Solaris process
model, we will instead add a per-brand child list that the wait(2) system
call will also check. When we attach to a process, we add it to the
debugger's brand list.
Finally, there are also significant issues around multithreaded apps.
Trying to reconcile the differences between the Linux and Solaris threading
models as well as the differences between ptrace(2) and
/proc appears to be a nearly intractable problem. For at
least the initial BrandZ release, we do not expect to be able to support
ptrace(2) for multithreaded applications.
3.9.2 librtld_db: MDB and Solaris ptools Support
Since the Linux binaries and applications are standard ELF objects,
Solaris tools are able to process them in essentially the same way that
Solaris binaries are processed. The main objective in doing so is to
retrieve symbols and thereby aid debugging and observability.
mdb and the ptools (pstack, pmap, etc) use interfaces provided by
librtld_db to debug live processes and core files.
librtld_db discovers ELF objects which have been mapped into
the target's (a target can be either a live process or a core file) address
space and reports these back to the librtld_db client.
librtld_db library understands enough of the internals of the
Solaris runtime linker to iterate over the linker's private link maps and
process the objects it finds. librtld_db allows our tools to
debug the Solaris portions of a branded process (such as the brand library,
libc, etc.), but they cannot understand any Linux objects that are mapped
into our address space, because the Solaris linker only has Solaris objects
on its link maps.
In order to give Solaris tools visibility into Linux binaries, a brand
helper-library framework is implemented in librtld_db. When
librtld_db is asked to examine a branded target process or
core file, it uses the AT_SUN_BRANDNAME aux vector to get the brand name of
the process. Once it has the brand name, it dlopen(3C)s a
shared object from:
/usr/lib/brand/brandname/brandname_librtld_db.so.1
Brand helpers must have a vector of operations called from critical
librtld_db hooks (e.g. helper initialization, loaded object
iteration, etc).
Once loaded, the helper library is responsible for finding any
brand-specific information it needs, such as the brand equivalent of
AT_SUN_LDDATA (used by librtld_db to find the
Solaris link maps), and preparing to return details about the objects
loaded into the address space by the brand's linker.
When a client of librtld_db operating on a branded
process, asks to know what objects are loaded in the target,
librtld_db walks the Solaris link maps and iterates over each
object it finds there, handing information about each to the caller. It
then calls down into the helper library, which does the same for the
brand-specific objects used by the target. In this manner, the client of
librtld_db does not need any modification to operate on
branded processes, nor does it need any special knowledge of brands; all
data is passed via established librtld_db APIs.
Because mdb and the other ptools are Solaris binaries,
they may not run inside a non-Solaris branded zone. They must therefore run
from the global zone and may operate on individual processes within the
branded zone.
3.9.3 Core Files
Since dumping core is handled by the kernel, it will produce Solaris
core files that cannot be understood by the native Linux tools. While it
would be possible to provide a brand-specific tool to convert between
Solaris core files and Linux core files, it will not be provided as part of
the initial release. Unless there is significant customer demand for this
tool, it will not likely be included in subsequent releases either. Our
preferred method for examining Linux core files is through
mdb.
3.9.4 DTrace
With the addition of brand support in librtld_db.so, the
DTrace PID provider is able to instrument Linux processes correctly with no
additional work.
We have also added an lx-syscall provider, which allows
DTrace to trace the Linux system calls issued by the application.
3.10. /proc
The lx brand will deliver a lx_proc kernel module that
provides the necessary semantics of a Linux /proc filesystem.
Linux tends to use /proc as a dumping ground for all things
system-related, although this is reduced by the introduction of sysfs in
the 2.6 kernel. Thus, we will not be able to emulate a large number of
elements from within a zone. Examples of unsupported functionality include
physical device characteristics, the USB device tree, acccess to kernel
memory, etc. Because various commands expect these files to be present,
but do not actually act on their contents, a number of these files will
exist but otherwise be empty.
We are able to emulate the per-process directories completely. The
following table shows the support status of other /proc system files.
| File |
Supported |
Description |
| cmdline | empty | Kernel command line options |
| cpuinfo | empty | Physical chip characteristics |
| crypto | no | Kernel crypto module info |
| devices | empty | Major number mappings |
| dma | empty | ??? |
| driver/* | no | Per-driver configuration settings |
| execdomains | no | ??? |
| fb | no | Framebuffer device |
| filesystems | empty | Available kernel filesystems |
| fs/* | no | ??? |
| ide/* | no | Kernel IDE driver info |
| interrupts | empty | Kernel interrupt table |
| iomem | no | I/O memory regions |
| ioports | empty | I/O port bindings |
| irq/* | no | ??? |
| kcore | empty | Kernel core image |
| kmsg | empty | Kernel message queue |
| ksyms | no | Kernel symbols |
| loadavg | yes | System load average |
| locks | no | ??? |
| mdstat | no | ??? |
| meminfo | yes | Virtual memory information |
| misc | no | ??? |
| modules | no | Random module information |
| mounts | yes | System mount table |
| mpt/* | no | ??? |
| mtrr | no | ??? |
| net/* | no* | Various network configuration |
| partitions | empty | Parition table |
| pci | no | PCI device info |
| scsi/* | no | SCSI device info |
| slabinfo | no | Kernel Slab allocator stats |
| stat | yes | General system wide statistics |
| swaps | no | Swap device information |
| sys/* | no | ??? |
| sysrq-trigger | no | ??? |
| sysvipc/* | no | System V IPC statistcis |
| tty/* | no | ??? |
| uptime | yes | System uptime |
| version | empty | Kernel version |
4. Deliverables
4.1 Source delivered into ON
Below is a summary of the new sources being added to ON.
New code for brand support:
| usr/src/lib/libbrand | support for reading the new XML files defining brands and virtual platforms |
| usr/src/lib/brand/native/ | virtual platform and template files for non-branded zones |
| usr/src/uts/common/os/brand.c | manages the brandx framework, tracks loaded brands, and so on. |
| usr/src/uts/common/syscall/brandsys.c | the brandsys() system call |
New directories created as holding areas for per-brand code:
| usr/src/lib/brand | for the userspace components of brands |
| usr/src/uts/common/brand | for platform-independent brand code |
| usr/src/uts/intel/brand | for Intel-specific brand code |
| usr/src/uts/sparc/brand | for SPARC-specific brand code |
For the Solaris 10 'test' brand:
| usr/src/pkgdefs/SUNWs10r | The package containing all the kernel-space pieces of the s10 brand |
| usr/src/pkgdefs/SUNWs10u | The package containing all the user-space pieces of the s10 brand |
| usr/src/lib/brand/s10/s10_brand | contains the emulation code and the zones integration support. |
| usr/src/uts/common/brand/s10 | source for kernel-level s10 support |
| usr/src/uts/intel/s10_brand | where the s10 Intel brand module is built |
| usr/src/uts/sparc/s10_brand | where the s10 SPARC brand module is built |
For the lx brand:
| usr/src/pkgdefs/SUNWlxr | The package containing the kernel-space components of the lx brand |
| usr/src/pkgdefs/SUNWlxu | The package containing the user-space components of the lx brand |
| usr/src/lib/brand/lx/lx_brand | contains the emulation code, as well as the zones integration support. (install scripts and so on) |
| usr/src/lib/brand/lx/librtld_db | the rtld_db plugin for the lx brand |
| usr/src/uts/common/brand/lx | source for kernel-level lx support |
| usr/src/uts/common/brand/lx/procfs | source for the lx /proc |
| usr/src/uts/common/brand/lx/dtrace | source for the lx syscall provider |
| usr/src/uts/intel/lx_brand | where the lx brand module is built |
| usr/src/uts/intel/lx_syscall | where the lx syscall provider is built |
| usr/src/uts/intel/lx_proc | where the lx /proc filesystem is built |
4.2 Components installed with the lx brand
Userspace components:
| usr/lib/brand/lx/lx_brand.so.1 | lx emulation library |
| usr/lib/brand/lx/[amd64]/lx_librtld_db.so.1 | lx rtld_db plugin |
| usr/lib/brand/lx/[amd64]/lx_thunk.so | lx thunking support |
| usr/lib/brand/lx/[amd64]/lx_nametoaddr.so.1 | Wrapper for Solaris name services |
| usr/lib/brand/lx/config.xml | Definition of the lx brand |
| usr/lib/brand/lx/platform.xml | Definition of the lx virtual platform |
| usr/lib/brand/lx/SUNWblank.xml | Values for a blank zone configuration |
| usr/lib/brand/lx/SUNWdefault.xml | Values for a default zone configuration |
| usr/lib/brand/lx/install | The scripts needed to install supported distros |
| usr/lib/brand/lx/lx_audio_config | Audio device configuration |
| usr/lib/brand/lx/lx_lockd | Wrapper for Solaris lockd |
| usr/lib/brand/lx/lx_statd | Wrapper for Solaris statd |
| usr/lib/brand/lx/lx_native | Wrapper for native Solaris processes |
| usr/lib/brand/lx/lx_boot | Boot-time script |
| usr/lib/brand/lx/lx_halt | Halt-time script |
| usr/lib/brand/lx/lx_install | Main lx install script |
| usr/lib/devfsadm/linkmod/SUNW_lx_link_i386.so | Brand-aware devfsadm link module |
Kernel components:
| kernel/brand/[amd64]/lx_brand | lx brand kernel module |
| kernel/drv/[amd64]/lx_audio | lx Linux-compatible audio driver |
| kernel/drv/[amd64]/lx_systrace | lx syscall provider |
| kernel/drv/[amd64]/lx_ptm | lx Linux-compatible ptm driver |
| kernel/fs/[amd64]/lx_afs | lx automounter support |
| kernel/fs/[amd64]/lx_proc | lx /proc filesystem |
| kernel/strmod/[amd64]/ldlinux | lx Linux-compatible ldterm module |