OpenSolaris

  subsites   code review   repo   packages   bugs   defect   polls   planet
You are not signed in. Sign in or register.

Virtual Memory – HAT (Hardware Address Translation) Layer

Status 06/05/2007

With the 2nd release of source the VM and the HAT layers are functional. The kernel is in control of the MMU even though we haven't yet executed a bop_quisce. We still are relying on the prom interface for the console, print and network connection. However in the kernel memory management there is comfort level with it's functionality. If you have reviewed the Openfirmware task you will have noticed the issues related to the lack of the ODW firmware running virtual memory mode. Overall this masked a number of items such that when we got VOF up and running we actually regressed in this area. However that is behind us.

More details to follow along with the 2nd source release.

Initial Review of the HAT Layer – 1/1/06

Way back when, before even writing a line of code Guy did an assessment of the original 2.6 code to see what was usable in a 2.11 port project. Below are his notes from that review.

Virtual Memory: HAT – Hardware Address Translation Layer – Guy Shaw

The Virtual Memory sustem can be considered the core of a Solaris system, and the implementation of Solaris virtual memory affects just about every other subsystem in the operating system. Rather than managing every byte of memory, Solaris uses page-size pieces of memory to minimize the amount of work the virtual memory system has to do to maintain virtual-to-physical memory mappings. Figure 4.1 shows how the management and translation of the virtual view of memory (the address space) to physical memory is performed by hardware known as the virtual memory management unit (MMU).

-------------------------------------------------------------------------------------—

Guy has evaluated HAT interface changes based on a difference listing of 2.6 vs 2.10. usr/src/uts/common/vm/hat.h in /ws/on998-gate vs /ws/onnv-gate.

The following are new functions

hat_dump
hat_thread_exit
hat_unload_callback
hat_register_callback
hat_add_callback
hat_delete_callback
hat_getkpfnum_badcall
hat_reserve
hat_page_demote
/ Kernel Physical Mapping (segkpm) hat interface routines. /
hat_kpm_mapin
hat_kpm_mapout
hat_kpm_page2va
hat_kpm_vaddr2page
hat_kpm_fault
hat_kpm_mseghash_clear
hat_kpm_mseghash_update
hat_kpm_addmem_mseg_update
hat_kpm_addmem_mseg_insert
hat_kpm_addmem_memsegs_update
hat_kpm_mseg_reuse
hat_kpm_delmem_mseg_update
hat_kpm_split_mseg_update
hat_kpm_walk
va_to_pfn
va_to_pa
  • The following functions have been removed
hat_pageflip
  • The following functions have a change in function signature
hat_share
hat_unshare
hat_dump() is small and is entirely processor-independent code.
hat_thread_exit() is small but the underlying function that implements it,
hat_switch(), does processor-specific and mmu-specific things to switch 
from one thread to another. Not a big problem.
The hat callback family of functions is currently implemented on Sparc only.
We can just supply pacifiers to comply with the new interface.
hat_getkpfnum() is deprecated. There are a few places left that still call 
hat_getkpfnum(). Those have all been changed to hat_getkpfnum_badcall() 
so that hat_getkpfnum() can be eliminated from the HAT interface. That way, 
nobody is tempted to write new code that uses hat_getkpfnum(). 
hat_getkpfnum_badcall() is just the implementation of what used to be 
hat_getkpfnum(). This is an easy change.
hat_reserve() does nothing.
hat_page_demote() is a significant amount of work. Much of it is 
processor-independent, because it has to do with the way Solaris allocates 
and deallocates Hardware Mapping Entries (HMEs). However, this can be 
deferred, because it is only used for mappings of large page sizes. We 
don't have to exploit large page sizes in userland in the first cut.
Kernel Physical Mapping (segkpm) hat interface routines, hat_kpm_(), are 
process-specific, but are trivial. Many would be noops on PowerPC.
vatopfn() is used only at boot time, while the boot loader is in charge of
the MMU. It is illegal to use it after that. Whoever writes the boot stuff 
can do what he wants. We may need to coordinate on this item.
vatopa() is declared in the common hat interface but is really only 
implemented on Sparc and only sparc-specific code (drivers, etc.) call it. 
We not only don't have to implement it, we don't even have to define it.
The removal of hatpageflip() is not a problem. The Power<nop>PC 
implementation just returned a status indicating that this feature was not 
supported.
The change to hatshare() and hatunshare() involve adding an argument, a 
page size code, to indicated the desired page size for shared mappings. 
This can be made simple by restricting the variety of page sizes we will 
deal with. For starters, we don't even have to support Intimate Shared 
Memory (ISM) at all.
A few flags have been added for some functions:
HATRELOADSHARE
HATNOKALLOC
HATLOADAUTOLPG
HATINIT
all these are either trivial or can be deferred. That's about it for 
interface changes. Please see below for comprehensive details on HAT port
###Study of the Feasibility of Reusing Solaris PPC 2.6 HAT Code######
Guy has composed a more structured overview which can be seen below
Study of the Feasibility of Reusing Solaris/PPC 2.6 HAT Code
  • Background

Sun has already done a port of Solaris to PowerPPC.  In 1995, Solaris
the release 2.5.1 supported PowerPC.  Additional work was done for Solaris 
rev 2.6. After 2.6, PPC support was removed, for commercial reasons rather
than any technical failure.
A big question in considering how to do the new Solaris/PPC is:
How much code from the Solaris/PPC 2.6 release can and should be reused, if 
any?
This document is concerned about answering that question only for
the HAT layer and other processor-specific VM code, and portions of
the boot that deal with VM.
There are many pieces of processor-specific code involved in any
port of Solaris to a new processor, but the HAT layer is a large and
critical part.  Whether a new HAT layer is written from scratch or
existing code is reused and upgraded, it is necessary to have some
idea of the costs of the HAT layer in order to have any hope of
reasoning about the total costs of the porting project.
  • HAT Roles

The HAT layer has many roles with respect to other parts of the
system, including hardware and other software.  In order to make
decisions about the suitability of the existing code to be reused,
all of these roles must be examined, in light of the changes that
have taken place over the last decade.
The roles the HAT layer plays are:
  1. HAT/MMU – Manager of hardware MMU resources
  2. HAT/provider – Provider of kernel services
  3. HAT/consumer – Consumer of kernel services
  4. HAT/boot – Partner during boot
The term HAT/runtime is used to denote HAT/MMU, HAT/provider,
and HAT/consumer, together; that is, everything except HAT/boot.
Keep in mind that the boundary between HAT roles is just logical, for
the sake of decomposing the analysis of changes in requirements.  It is
not that there are separate files, packages, modules, or functions
(whatever) that keep the code for these roles separate.  There is
separate code for HAT/boot vs HAT/runtime, but it is not possible to
separate out HAT/runtime roles in any coarse-grain fashion.  A single
function can be HAT/provider in one line and call some function
(HAT/consumer) in the next, then immediately do some low-level TLB
management (HAT/MMU).
  • The Decision Process
It could be that the existing code is simply too far out of date with
respect to any combination of these four roles.  In that case the
issue of re-usability would be a no-brainer, just scrap the old code.
In the case of interaction with boot, a decision can be made separately
to scrap most of the boot-related code, but keep all code related to
the other HAT roles.
If there is enough value in the old code, then things are not quite
so simple, because decisions can be influenced by other factors, such
as schedule and budget and willingness to drop or defer development
of some functionality.
For example, if rapid bring-up is an absolute requirement then things
can be done for the sake of quick results, but which mean that cleanup
or major revisions will have to be done later.  Examples of possible
deferred HAT functionality are:
  1. support for Intimate Shared Memory (ISM)
  2. support for 64 bit machines
  3. support for multiprocessor machines
The following four sections will present an evaluation of the
suitability of the Solaris 2.6 code.  Each role will be evaluated
in terms of a quick go/no-go test, then in terms of time and
optional features.

HAT/MMU – Manager of hardware MMU resources

64-bit

Solaris/PPC 2.6 has no support for 64-bit models. Not for 64-bit kernel and not for 64-bit applications. That is bad news. But it does not necessarily mean that Solaris/PPC 2.6 HAT is unsuitable for reuse. If we wrote a new HAT from scratch, it is still more work to support both 32-bit and 64-bit kernel and applications. Since the new Solaris/PPC port see use for embedded systems, we believe that we would not contemplate supporting only a 64-bit kernel, as is done on Sparc. Even if we did that, in order to eliminate one of the four combinations, it would not save as much as 1/4 of the effort.

What this means is that it boils down to 2 questions:

1) Is the 64-bit MMU hardware so fundamentally different that the 32-bit code cannot (or should not) be reused?

2) Was the Solaris/PPC 2.6 HAT code designed in a way that makes it unnecessarily difficult to support both 32-bit and 64-bit hardware?

You might think that the MMU hardware would be fundamentally different between 32-bit and 64-bit, and necessarily so. The major rewrite of Solaris/x86 HAT code was triggered by the port to AMD64. But, that was only the proximate cause.

Intel's x86 hardware was designed much earlier than PowerPC. Intel did not start out with a road-map for 64-bit kernel or userland. Solaris/x86 was not designed with a 64-bit future in mind. But, the PowerPC was designed from the beginning to be a 64-bit architecture with a 32-bit subset. That applies to the MMU design as well as ISA (Instruction Set Architecture).

The PowerPC has hashed page tables, unlike the x86, which has forward-mapped page tables. Also, there is one global page table, no per-process or per-group or separate kernel vs userland page tables. That is not a decision made by a kernel developer; it is pretty much dictated by the PowerPC MMU design, and we do not want to fight the hardware. 64-bit addresses have segment IDs that are 32 bits longer, but the role of the lower order bits in hashing and indexing into the page table is the same for 32-bit and 64-bit. The designers of Solaris/PPC knew this at the time and kept it in mind. Although it has not been put to the test, the code appears to be sufficiently 64-bit clean.

Conclusion: going to 64-bit HAT is nowhere near as traumatic as it was for x86 and AMD64.

Other aspects of 64-bit Solaris/PPC are outside the scope of this document. They would include design of a 64-bit ABI, link editor, etc. and getting consensus from all stake-holders. Historically, arriving at a consensus has been known to consume a great deal of time. But, reusing Solaris/PPC 2.6 HAT code does not add to this problem.

Location and size of page tables

There is a possibility of running into problems with a larger page table on a 64-bit system with more physical memory. This is because the pagetable is contiguous physical memory and a larger page table might conflict with something else that needs to be in lower memory or upper memory. But, I don't think this is too likely. In any case, this problem is not made worse by reusing Solaris/PPC 2.6 code.

MMU related traps

Older models of PowerPC generated traps for every TLB miss. Newer models can reload the TLB from the page tables without any traps; the only page fault is a major page fault. This is a welcome change. We could leave the trap handler in the code for the sake of older models, or we could purge it if we know we will never encounter hardware that generates TLB-miss traps. Better to have and not need than to need and not have. However, finding machines to test this case could be difficult.

Endianness

PowerPC can operate in either big-endian or little-endian mode. Solaris/PPC runs in little-endian mode. This decision was made primarily because of customer requirements at the time (1993-1995), not for any purely technical reason. Without those commercial requirements, there would be a slight advantage to running big-endian. For one thing, PowerPC page tables are big-endian, independent of the overall endian mode of the machine. The designers of Solaris/PPC knew at the time that this might be controversial and subject to change, and they coded accordingly. The HAT layer used accessor functions for all read and modify operations on the page tables. Other parts of Solaris have some endianness dependencies, but I believe they are not a huge problem. In fact, we have a running big-endian version of Solaris/PPC that was done as a feasibiltiy project. So, we might be able to use that as a starting point.

Cache Implementation


The PowerPC architecture leaves the details of cache implementation pretty wide open so that each model can be free to implement caching in its own way. A model of PowerPC is allowed to implement no cache at all. It is possible that Solaris/PPC 2.6 code, which supported a small number of early PowerPC models, would have to be modified to handle a wider range of cache behaviors in order to support newer models. Besides cache geometry (number of levels, size of each level, line sizes), which ought to be parameterized, other implementation details are possible, which may require more significant code changes to support. For example, on some models, a cache might be virtually indexed, as is the case for some Sparc models. In that case, code to handle page coloring would need to be added for performance. At least in the case of the MPC-7450, this will not be necessary.

Multiprocessor


Solaris/PPC has been written with multiprocessor machines in mind. There was a working version running in the lab, but it was not integrated into Solaris 2.7, because Solaris/PPC was canceled by then. Multiprocessor machines are much more common, today, and so expectations are higher. A PPC port of Solaris would still require much effort to verify multiprocessor mostly due to testing.


HAT/provider – Provider of kernel services

The HAT layer is inherently processor and platform dependent. For that reason, Solaris HAT interfaces are pretty well-defined, much more so that some other parts of the kernel which have not had to be ported several times over Solaris's life. Therefore, of all parts of the kernel, the HAT layer is among the least likely to suffer from illegitimate interfaces, such as unintended dependencies, spooky action at a distance, lack of decomposability, etc.

All legitimate HAT interface is defined in usr/src/uts/common/vm/hat.h. This basic source code structure has not changed. It was a good idea then, and it is a good idea now.

How has the HAT interface since Solaris 2.6? This can be answered by examining a difference listing of usr/src/uts/common/vm/hat.h between

  1. 6 and the current version of Solaris.

The two source code gates to be compared are:

/ws/on297-gate /ws/onnv-gate

The following are new functions:

hatdump hatthreadexit

hatunloadcallback hatregistercallback hataddcallback hatdeletecallback

hatgetkpfnumbadcall

hatreserve

hatpagedemote

/* Kernel Physical Mapping (segkpm) hat interface routines. / hat_kpm_mapin hat_kpm_mapout hat_kpm_page2va hat_kpm_vaddr2page hat_kpm_fault hat_kpm_mseghash_clear hat_kpm_mseghash_update hat_kpm_addmem_mseg_update hat_kpm_addmem_mseg_insert hat_kpm_addmem_memsegs_update hat_kpm_mseg_reuse hat_kpm_delmem_mseg_update hat_kpm_split_mseg_update hat_kpm_walk

va_to_pfn va_to_pa

The following functions have been removed:

hat_pageflip

The following functions have a change in function signature:

hat_share hat_unshare

hat_dump() is small and is entirely processor-independent code.

hat_thread_exit() is small but the underlying function that implements it, hat_switch(), does processor-specific and MMU-specific things to switch from one thread to another. Not a big problem.

The hat callback family of functions is currently implemented on Sparc only. We can just supply pacifiers to comply with the new interface.

hat_getkpfnum() is deprecated. There are a few places left that still call hat_getkpfnum(). Those have all been changed to hat_getkpfnum_badcall() so that hat_getkpfnum() can be eliminated from the HAT interface. That way, nobody is tempted to write new code that uses hat_getkpfnum(). hat_getkpfnum_badcall() is just the implementation of what used to be hat_getkpfnum(). This is an easy change.

hat_reserve() does nothing.

hat_page_demote() is a significant amount of work. Much of it is processor-independent, because it has to do with the way Solaris allocates and deallocates Hardware Mapping Entries (HMEs). However, this can be deferred, because it is only used for mappings of large page sizes. We don't have to exploit large page sizes in userland in the first cut.

Kernel Physical Mapping (segkpm) hat interface routines, hat_kpm_(), are process-specific, but are trivial. Many would be no-ops on PowerPC.

vatopfn() is used only at boot time, while the boot loader is in charge of the MMU. It is illegal to use it after that. Whoever writes the boot stuff can do what he wants. We may need to coordinate on this item.

vatopa() is trivial, and is the same for all processors. It is just vatopfn() with the page offset of the given virtual address blended back in to give the corresponding physical address.

The removal of hatpageflip() is not a problem. The PowerPC implementation just returned a status indicating that this feature was not supported.

The change to hatshare() and hat_unshare() involve adding an argument, a page size code, to indicated the desired page size for shared mappings. This can be made simple by restricting the variety of page sizes we will deal with. For starters, we don't even have to support Intimate Shared Memory (ISM) at all.

Flags


A few flags have been added for some functions:

HATRELOADSHARE HATNOKALLOC HATLOADAUTOLPG HAT_INIT

all these are either trivial or can be deferred.

Intimate Shared Memory (ISM)


ISM is strictly a performance feature. It does not involve any change to the HAT interface. ISM is a term used to refer to the cases when multiple processes can share not only mappings to the same physical memory, but also MMU resources used for those mappings. For example, in the case of x86, with forward-mapped page tables, entire pages of Page Table Entries (PTEs) can be shared, provided that the virtual addresses and size just happen to be suitable for sharing pages of PTEs. Let's use the term "PTE-page-span" to describe the size mapped by an entire page of PTEs. It is not required that all the mappings to the same physical memory have the same virtual address. But, the virtual addresses must all be aligned on a PTE-page-span boundary, and their sizes must be a multiple of the PTE-page-span. Any mappings that are less strict about VA alignment and size cannot share page tables without violating Unix memory mapping semantics and/or security principles. If VA alignment and size are even more strict, then 2nd-level and even higher level pages of directory entries could be shared. Something very similar has been done on Itanium and MIPS hardware, except that those machines have linear page tables, rather than forward-mapped.

Solaris/PPC 2.6 does not implement ISM. But, the absence of ISM support is clean. That is, the fact that ISM was not supported in Solaris/PPC 2.6 does not affect any decision to reuse the existing code. The same work would have to be done whether adding functionality to the old code or writing all new code. Whether any functionality we add is easy or difficult, it is pure and simple addition of functionality. Nothing about the Solaris/PPC 2.6 HAT design involved work that would have to be undone or commitments to a way of doing things that we might regret.

PowerPC MMU does not have any such thing as pages of PTEs. The only possible way to support any sharing of MMU resources on PowerPC is to use Block Address Translation (BAT) registers. BAT registers are the only mechanism for mapping regions of memory with a page size larger than 4K. There are only a handful of BAT registers. ISM implemented this way would have more strict alignment requirements, because a single BAT entry with a large page size would require:

1) that all mapping be naturally aligned with respect to page size; 2) that the requested size must be exactly 1 page size; 3) that the underlying physical memory be contiguous and naturally aligned physical addresses.

An unlimited number of processes could share the same memory, but at any time, only a very small number of these mappings can be supported. On an embedded system, there might be an application for which this support is just perfect. Even a single very large mapping shared by 2 processes could be a big win for the right kind of application. It could save a great deal of pressure on the page table. Let's see … large mappings can save 256 PTEs per megabyte. A 1 GByte mapping for shared data could save 1/4 megaPTEs. In order to do this, there would have do be some mechanism for preventing physical memory from getting fragmented beyond redemption before we even get to the first userland process. There is no interface to do this. There would probably have to be something in /etc/system to tell the kernel to reserve physical memory early on.


HAT/consumer – Consumer of kernel services

The HAT layer, proper, is pretty low down in the dependency tree of all kernel services. This is especially true of the pure TLB management functions. We would be in trouble if the data types and functions provided by the kernel changed significantly in the last decade. But it looks like we are in pretty good shape.

Data types


The HAT does interact with some other kernel data structures.

  1. HAT uses paget's which describe pages of physical memory.

The machine-dependent paget, machpaget, is a pure extension of the paget data type; the paget structure is not modified in any other way. No part of Solaris uses the machpaget extensions.

Functions


The HAT layer needs locking primitives and some atomic operations. Function calls are used and the data types used with these functions are either opaque objects or primitive data types. So, the HAT does depend on functions such as: mutex*(), cv(), atomic_(), cas(). The good news is that these functions are pretty low level and their interfaces are stable.

xXX Better separation of pure TLB management functions. XXX Move to separate library

XXX It may be a good idea to change use of cv_*() functions


HAT/boot – Partner during boot

Solaris boot has changed considerably since 2.5.1 and 2.6. Almost all boot-related HAT code will have to be thrown out, no matter what. It is almost a complete write-off. Certainly, all the code related to boot-time device support is useless. Some snippets related to VOF, such as getting properties, can be used as a design suggestion.

XXX How much has VOF changed? Not much, we hope.

The good news is that modern boot makes many things easier. The basic problem of handing off allocated memory and mappings from boot to the kernel HAT is not much different, so some small pieces can be reused.

Another bit of good news is that some things that are done for good hygiene can be deferred. For example, we can just waste some memory owned by boot, bypassing the tricky hand-off code for those pages of memory. This is a good trade, for the sake of rapid bring-up.

Whether we reuse Solaris/PPC 2.6 code or not, I strongly recommend that we invest a great deal in enforcing the contract between boot and the HAT layer, much more than has been done for Sparc and x86, even more than was done for Solaris/IA64, which invested heavily in this. This kind of investment is one that is tempting to short-stroke in the interests of quick startup, but it pays big-time, unless all the developers are perfect in every way, or extremely lucky. In fact, I recommend that we deliberately change the contract, a few times during development, just to keep us safe from inadvertent dependency creep. For example, page table size and location can be changed, within reason; allocation of BAT registers can be changed, for no particular reason.

There is processor-dependent code to handle userland process address space allocations. It is not really part of the HAT, proper, but the developer who writes and maintains the HAT usually maintains this bit part, as well. In addition to changing HAT/boot contract, I recommend changing some aspects of VM layout, such as text start address. It is not that we cannot decide on a value and stick with it. It is a bit of a jolt to the system, just to keep things on track. Better to do it early, rather than later.

XXX More on boot/HAT contract, later.

Data types


XXX memseg structures changed?

Functions


Flow of control


XXX flow of control from starup() … hatkern_setup()


The following is a quick overview of HAT features and an assessment of:

  1. how easy it is to implement;
  2. whether it can be deferred (from a purely technical perspective, not whether it is considered to be a critical requirement);
  3. how much demand there is, particularly for embedded systems;
  4. how urgent is the requirement;
  5. how much of an embarrassment would it be if this feature were not implemented, considering things like how much expectations have changed in the last decade, what has Linux already done, etc.
  6. How much more testing has to be done, let alone implementation cost. Some things could be coded right away, but have serious testing implications. For example, do we want to retain support for older models? That is a coding noop, but a huge increase in testing complexity, hardware procurement, and
so on.

critcal testing

Feature easy? defer? demand urgency factor burden

  • -----------------— --------— ----— ----— --------— -----— ------—
Multiprocessor ??? yes high soon high high 64-bit maybe yes ??? low low high endian change yes yes ??? high low low ISM no yes low low low medium


XXX UPOD schedule vs quick&dirty schedule XXX UPOD := Under-Promise Over-Deliver


OPINION on HAT DATA Structures

The hat data structure should be an opaque data type, preferably void. That is, nothing outside the HAT should refer to "struct hat", but to hatt. So, hatt * is void *, as far everyone is concerned, except the HAT implementation.

If we wrote the kernel in C++, we could make hatt a class with private members.

We should be able to change all usage of "struct hat" to hatt.

If that is not acceptable, than we ought to at least redefine the struct hat so that it has one member which is a simple data type and has an unlikely name, like _noneofyourbusiness. Failing that, we ought to be able to redefine struct hat so that all the members are the same order, data type, offset, and size, but the member names have been changed, for example by prefixing each member name with _.

vm/hat.h could have some preprocessor code like so:

#if defined(HATIMPLEMENTATION) struct hat { sometype member1; … }; #else struct hat { sometype __member1; … }; #endif

END of OPINION


HAT/boot Food Taster and HAT Debugging Tools

Most of this document covers changes to Solaris/PPC 2.6 code that are imposed by external factors: changing hardware, evolution of Solaris, changes to boot. But, there are a few changes recommended here, simply because they are an important improvement in HAT construction technology. By a very wide margin, the top two are:

  1. Extensive HAT/boot food taster
  2. HAT Debugging toolkit

These additions do not come free of cost, so they need to be mentioned here. However, they have a very good chance of leading to a net reduction in system bringup time, and they contribute to more reliable time budgets, because big surprises are reduced.


HAT/boot Food Taster

Things don't go well if the HAT consumes anything toxic. Things can go especially badly early on and in mysterious ways if the HAT inherits MMU state from boot which is not compatible with the state for which it is designed. I recommend spending some time up front in writing a significant amount of code which tests the MMU state and other conditions, as they are when HAT first takes control. If there is anything that is not in order, the HAT/boot food taster should not be sparing in its effort to explain clearly what is expected and what it got, and then die a quick and merciful death. Perhaps in the process it can deliberately trigger the debugger (either hardware debugger or kmdb), if present, in a special way. The alternative to fail-fast semantics is system delusion, so fail fast and fail noisily.

This sort of thing was done on Solaris/IA64, and has been proven to save a great deal of time when integrating work done by several developers each working on different pieces, and possibly misunderstanding the HAT/boot contract.

Even if it were the case that large pieces of HAT/boot code could be reused, the amount of work needed to add a HAT/boot food taster is pretty much the same for a HAT code update or for a brand new HAT layer.


HAT Debugging Toolkit

It is common to sprinkle some ASSERTs in the code. Also, most HAT developers have a stash of HAT debugging aids, such as a handy-dandy pagetable hashing pocket calculator or pagetable walker/navigator. But, I believe it is a good idea to include a set of kernel functions and userland tools as a first class citizen in the software release. Solaris/IA64 had an extensive set of HAT debug helper functions, as well as userland tools to do HAT-specific monitoring of correctness and performance. Our new Solaris/PPC port can do even more because there are more hardware and software tools available, such as hardware debugger (when available) and DTrace.

The amount of work needed to add a HAT debugging toolkit is pretty much the same for a HAT code update or for a brand new HAT layer.

Some of the components of the HAT debugging toolkit are:

  1. Fault injection
  2. Pagetable and HME verifier
  3. Pagetable and HME statistics
  4. Pathological workloads (small scale)

Fault Injection


One example of the use of fault injection for the HAT layer is to arrange for races to be lost often. The function to atomically replace a PTE can be implemented with a version that deliberately causes it to fail with a specified probability, but being careful to limit the number of consecutive failures from the same caller so that we don't block forward progress of the system.

Pagetable and HME verifier


A pagetable and HME verifier is to the HAT data structures what fsck is to UFS filesystem metadata. On a running system things change too quickly to check consistency of the entire system, but at the time of a kernel panic, it can be done without disturbing the existing mappings. Also, most of the consistency checking is decomposable. That is, much can be determined about the internal consistency of a subset of page tables, and it can be tested quickly and nondestructively.

Pagetable and HME statistics


Before DTrace, it would have been very difficult to generate statistics about things like hash collisions without writing extra helper functions and rolling your own methods for enabling and disabling probing; and it was even more difficult to get statistics out of the kernel so that some userland monitoring / visualization program can slice and dice and present the data. DTrace makes this sort of thing a great deal easier. However, I believe the HAT may still have some hooks in it with DTrace and userland reporting programs in mind.

Pathological workloads


In order to exercise the logic for handling rare cases, such as clustering of hash collisions leading to full PTE groups, small workloads can be constructed that generate very unfortunate reference patterns.


There are several ideas for new functionality and performance enhancements, but they are not as important and certainly not as urgent as debugging aids. So, they will be collected in a document yet to come.