|
|
Virtual Memory – HAT (Hardware Address Translation) LayerStatus 06/05/2007With the 2nd release of source the VM and the HAT layers are functional. The kernel is in control of the MMU even though we haven't yet executed a bop_quisce. We still are relying on the prom interface for the console, print and network connection. However in the kernel memory management there is comfort level with it's functionality. If you have reviewed the Openfirmware task you will have noticed the issues related to the lack of the ODW firmware running virtual memory mode. Overall this masked a number of items such that when we got VOF up and running we actually regressed in this area. However that is behind us. More details to follow along with the 2nd source release. Initial Review of the HAT Layer – 1/1/06Way back when, before even writing a line of code Guy did an assessment of the original 2.6 code to see what was usable in a 2.11 port project. Below are his notes from that review. Virtual Memory: HAT – Hardware Address Translation Layer – Guy Shaw The Virtual Memory sustem can be considered the core of a Solaris system, and the implementation of Solaris virtual memory affects just about every other subsystem in the operating system. Rather than managing every byte of memory, Solaris uses page-size pieces of memory to minimize the amount of work the virtual memory system has to do to maintain virtual-to-physical memory mappings. Figure 4.1 shows how the management and translation of the virtual view of memory (the address space) to physical memory is performed by hardware known as the virtual memory management unit (MMU). -------------------------------------------------------------------------------------— Guy has evaluated HAT interface changes based on a difference listing of 2.6 vs 2.10. usr/src/uts/common/vm/hat.h in /ws/on998-gate vs /ws/onnv-gate. The following are new functions
###Study of the Feasibility of Reusing Solaris PPC 2.6 HAT Code######
HAT/MMU – Manager of hardware MMU resources
Solaris/PPC 2.6 has no support for 64-bit models. Not for 64-bit kernel and not for 64-bit applications. That is bad news. But it does not necessarily mean that Solaris/PPC 2.6 HAT is unsuitable for reuse. If we wrote a new HAT from scratch, it is still more work to support both 32-bit and 64-bit kernel and applications. Since the new Solaris/PPC port see use for embedded systems, we believe that we would not contemplate supporting only a 64-bit kernel, as is done on Sparc. Even if we did that, in order to eliminate one of the four combinations, it would not save as much as 1/4 of the effort. What this means is that it boils down to 2 questions: 1) Is the 64-bit MMU hardware so fundamentally different that the 32-bit code cannot (or should not) be reused? 2) Was the Solaris/PPC 2.6 HAT code designed in a way that makes it unnecessarily difficult to support both 32-bit and 64-bit hardware? You might think that the MMU hardware would be fundamentally different between 32-bit and 64-bit, and necessarily so. The major rewrite of Solaris/x86 HAT code was triggered by the port to AMD64. But, that was only the proximate cause. Intel's x86 hardware was designed much earlier than PowerPC. Intel did not start out with a road-map for 64-bit kernel or userland. Solaris/x86 was not designed with a 64-bit future in mind. But, the PowerPC was designed from the beginning to be a 64-bit architecture with a 32-bit subset. That applies to the MMU design as well as ISA (Instruction Set Architecture). The PowerPC has hashed page tables, unlike the x86, which has forward-mapped page tables. Also, there is one global page table, no per-process or per-group or separate kernel vs userland page tables. That is not a decision made by a kernel developer; it is pretty much dictated by the PowerPC MMU design, and we do not want to fight the hardware. 64-bit addresses have segment IDs that are 32 bits longer, but the role of the lower order bits in hashing and indexing into the page table is the same for 32-bit and 64-bit. The designers of Solaris/PPC knew this at the time and kept it in mind. Although it has not been put to the test, the code appears to be sufficiently 64-bit clean. Conclusion: going to 64-bit HAT is nowhere near as traumatic as it was for x86 and AMD64. Other aspects of 64-bit Solaris/PPC are outside the scope of this document. They would include design of a 64-bit ABI, link editor, etc. and getting consensus from all stake-holders. Historically, arriving at a consensus has been known to consume a great deal of time. But, reusing Solaris/PPC 2.6 HAT code does not add to this problem. Location and size of page tables There is a possibility of running into problems with a larger page table on a 64-bit system with more physical memory. This is because the pagetable is contiguous physical memory and a larger page table might conflict with something else that needs to be in lower memory or upper memory. But, I don't think this is too likely. In any case, this problem is not made worse by reusing Solaris/PPC 2.6 code. MMU related traps Older models of PowerPC generated traps for every TLB miss. Newer models can reload the TLB from the page tables without any traps; the only page fault is a major page fault. This is a welcome change. We could leave the trap handler in the code for the sake of older models, or we could purge it if we know we will never encounter hardware that generates TLB-miss traps. Better to have and not need than to need and not have. However, finding machines to test this case could be difficult. Endianness PowerPC can operate in either big-endian or little-endian mode. Solaris/PPC runs in little-endian mode. This decision was made primarily because of customer requirements at the time (1993-1995), not for any purely technical reason. Without those commercial requirements, there would be a slight advantage to running big-endian. For one thing, PowerPC page tables are big-endian, independent of the overall endian mode of the machine. The designers of Solaris/PPC knew at the time that this might be controversial and subject to change, and they coded accordingly. The HAT layer used accessor functions for all read and modify operations on the page tables. Other parts of Solaris have some endianness dependencies, but I believe they are not a huge problem. In fact, we have a running big-endian version of Solaris/PPC that was done as a feasibiltiy project. So, we might be able to use that as a starting point. Cache Implementation The PowerPC architecture leaves the details of cache implementation pretty wide open so that each model can be free to implement caching in its own way. A model of PowerPC is allowed to implement no cache at all. It is possible that Solaris/PPC 2.6 code, which supported a small number of early PowerPC models, would have to be modified to handle a wider range of cache behaviors in order to support newer models. Besides cache geometry (number of levels, size of each level, line sizes), which ought to be parameterized, other implementation details are possible, which may require more significant code changes to support. For example, on some models, a cache might be virtually indexed, as is the case for some Sparc models. In that case, code to handle page coloring would need to be added for performance. At least in the case of the MPC-7450, this will not be necessary. Multiprocessor Solaris/PPC has been written with multiprocessor machines in mind. There was a working version running in the lab, but it was not integrated into Solaris 2.7, because Solaris/PPC was canceled by then. Multiprocessor machines are much more common, today, and so expectations are higher. A PPC port of Solaris would still require much effort to verify multiprocessor mostly due to testing. HAT/provider – Provider of kernel services The HAT layer is inherently processor and platform dependent. For that reason, Solaris HAT interfaces are pretty well-defined, much more so that some other parts of the kernel which have not had to be ported several times over Solaris's life. Therefore, of all parts of the kernel, the HAT layer is among the least likely to suffer from illegitimate interfaces, such as unintended dependencies, spooky action at a distance, lack of decomposability, etc. All legitimate HAT interface is defined in usr/src/uts/common/vm/hat.h. This basic source code structure has not changed. It was a good idea then, and it is a good idea now. How has the HAT interface since Solaris 2.6? This can be answered by examining a difference listing of usr/src/uts/common/vm/hat.h between
The two source code gates to be compared are: /ws/on297-gate /ws/onnv-gate The following are new functions: hatdump hatthreadexit hatunloadcallback hatregistercallback hataddcallback hatdeletecallback hatgetkpfnumbadcall hatreserve hatpagedemote /* Kernel Physical Mapping (segkpm) hat interface routines. / hat_kpm_mapin hat_kpm_mapout hat_kpm_page2va hat_kpm_vaddr2page hat_kpm_fault hat_kpm_mseghash_clear hat_kpm_mseghash_update hat_kpm_addmem_mseg_update hat_kpm_addmem_mseg_insert hat_kpm_addmem_memsegs_update hat_kpm_mseg_reuse hat_kpm_delmem_mseg_update hat_kpm_split_mseg_update hat_kpm_walk va_to_pfn va_to_pa The following functions have been removed: hat_pageflip The following functions have a change in function signature: hat_share hat_unshare hat_dump() is small and is entirely processor-independent code. hat_thread_exit() is small but the underlying function that implements it, hat_switch(), does processor-specific and MMU-specific things to switch from one thread to another. Not a big problem. The hat callback family of functions is currently implemented on Sparc only. We can just supply pacifiers to comply with the new interface. hat_getkpfnum() is deprecated. There are a few places left that still call hat_getkpfnum(). Those have all been changed to hat_getkpfnum_badcall() so that hat_getkpfnum() can be eliminated from the HAT interface. That way, nobody is tempted to write new code that uses hat_getkpfnum(). hat_getkpfnum_badcall() is just the implementation of what used to be hat_getkpfnum(). This is an easy change. hat_reserve() does nothing. hat_page_demote() is a significant amount of work. Much of it is processor-independent, because it has to do with the way Solaris allocates and deallocates Hardware Mapping Entries (HMEs). However, this can be deferred, because it is only used for mappings of large page sizes. We don't have to exploit large page sizes in userland in the first cut. Kernel Physical Mapping (segkpm) hat interface routines, hat_kpm_(), are process-specific, but are trivial. Many would be no-ops on PowerPC. vatopfn() is used only at boot time, while the boot loader is in charge of the MMU. It is illegal to use it after that. Whoever writes the boot stuff can do what he wants. We may need to coordinate on this item. vatopa() is trivial, and is the same for all processors. It is just vatopfn() with the page offset of the given virtual address blended back in to give the corresponding physical address. The removal of hatpageflip() is not a problem. The PowerPC implementation just returned a status indicating that this feature was not supported. The change to hatshare() and hat_unshare() involve adding an argument, a page size code, to indicated the desired page size for shared mappings. This can be made simple by restricting the variety of page sizes we will deal with. For starters, we don't even have to support Intimate Shared Memory (ISM) at all. Flags A few flags have been added for some functions: HATRELOADSHARE HATNOKALLOC HATLOADAUTOLPG HAT_INIT all these are either trivial or can be deferred. Intimate Shared Memory (ISM) ISM is strictly a performance feature. It does not involve any change to the HAT interface. ISM is a term used to refer to the cases when multiple processes can share not only mappings to the same physical memory, but also MMU resources used for those mappings. For example, in the case of x86, with forward-mapped page tables, entire pages of Page Table Entries (PTEs) can be shared, provided that the virtual addresses and size just happen to be suitable for sharing pages of PTEs. Let's use the term "PTE-page-span" to describe the size mapped by an entire page of PTEs. It is not required that all the mappings to the same physical memory have the same virtual address. But, the virtual addresses must all be aligned on a PTE-page-span boundary, and their sizes must be a multiple of the PTE-page-span. Any mappings that are less strict about VA alignment and size cannot share page tables without violating Unix memory mapping semantics and/or security principles. If VA alignment and size are even more strict, then 2nd-level and even higher level pages of directory entries could be shared. Something very similar has been done on Itanium and MIPS hardware, except that those machines have linear page tables, rather than forward-mapped. Solaris/PPC 2.6 does not implement ISM. But, the absence of ISM support is clean. That is, the fact that ISM was not supported in Solaris/PPC 2.6 does not affect any decision to reuse the existing code. The same work would have to be done whether adding functionality to the old code or writing all new code. Whether any functionality we add is easy or difficult, it is pure and simple addition of functionality. Nothing about the Solaris/PPC 2.6 HAT design involved work that would have to be undone or commitments to a way of doing things that we might regret. PowerPC MMU does not have any such thing as pages of PTEs. The only possible way to support any sharing of MMU resources on PowerPC is to use Block Address Translation (BAT) registers. BAT registers are the only mechanism for mapping regions of memory with a page size larger than 4K. There are only a handful of BAT registers. ISM implemented this way would have more strict alignment requirements, because a single BAT entry with a large page size would require: 1) that all mapping be naturally aligned with respect to page size; 2) that the requested size must be exactly 1 page size; 3) that the underlying physical memory be contiguous and naturally aligned physical addresses. An unlimited number of processes could share the same memory, but at any time, only a very small number of these mappings can be supported. On an embedded system, there might be an application for which this support is just perfect. Even a single very large mapping shared by 2 processes could be a big win for the right kind of application. It could save a great deal of pressure on the page table. Let's see … large mappings can save 256 PTEs per megabyte. A 1 GByte mapping for shared data could save 1/4 megaPTEs. In order to do this, there would have do be some mechanism for preventing physical memory from getting fragmented beyond redemption before we even get to the first userland process. There is no interface to do this. There would probably have to be something in /etc/system to tell the kernel to reserve physical memory early on. HAT/consumer – Consumer of kernel services The HAT layer, proper, is pretty low down in the dependency tree of all kernel services. This is especially true of the pure TLB management functions. We would be in trouble if the data types and functions provided by the kernel changed significantly in the last decade. But it looks like we are in pretty good shape. Data types The HAT does interact with some other kernel data structures.
The machine-dependent paget, machpaget, is a pure extension of the paget data type; the paget structure is not modified in any other way. No part of Solaris uses the machpaget extensions. Functions The HAT layer needs locking primitives and some atomic operations. Function calls are used and the data types used with these functions are either opaque objects or primitive data types. So, the HAT does depend on functions such as: mutex*(), cv(), atomic_(), cas(). The good news is that these functions are pretty low level and their interfaces are stable. xXX Better separation of pure TLB management functions. XXX Move to separate library XXX It may be a good idea to change use of cv_*() functions HAT/boot – Partner during boot Solaris boot has changed considerably since 2.5.1 and 2.6. Almost all boot-related HAT code will have to be thrown out, no matter what. It is almost a complete write-off. Certainly, all the code related to boot-time device support is useless. Some snippets related to VOF, such as getting properties, can be used as a design suggestion. XXX How much has VOF changed? Not much, we hope. The good news is that modern boot makes many things easier. The basic problem of handing off allocated memory and mappings from boot to the kernel HAT is not much different, so some small pieces can be reused. Another bit of good news is that some things that are done for good hygiene can be deferred. For example, we can just waste some memory owned by boot, bypassing the tricky hand-off code for those pages of memory. This is a good trade, for the sake of rapid bring-up. Whether we reuse Solaris/PPC 2.6 code or not, I strongly recommend that we invest a great deal in enforcing the contract between boot and the HAT layer, much more than has been done for Sparc and x86, even more than was done for Solaris/IA64, which invested heavily in this. This kind of investment is one that is tempting to short-stroke in the interests of quick startup, but it pays big-time, unless all the developers are perfect in every way, or extremely lucky. In fact, I recommend that we deliberately change the contract, a few times during development, just to keep us safe from inadvertent dependency creep. For example, page table size and location can be changed, within reason; allocation of BAT registers can be changed, for no particular reason. There is processor-dependent code to handle userland process address space allocations. It is not really part of the HAT, proper, but the developer who writes and maintains the HAT usually maintains this bit part, as well. In addition to changing HAT/boot contract, I recommend changing some aspects of VM layout, such as text start address. It is not that we cannot decide on a value and stick with it. It is a bit of a jolt to the system, just to keep things on track. Better to do it early, rather than later. XXX More on boot/HAT contract, later. Data types XXX memseg structures changed? Functions Flow of control XXX flow of control from starup() … hatkern_setup() The following is a quick overview of HAT features and an assessment of:
XXX UPOD schedule vs quick&dirty schedule XXX UPOD := Under-Promise Over-Deliver OPINION on HAT DATA Structures The hat data structure should be an opaque data type, preferably void. That is, nothing outside the HAT should refer to "struct hat", but to hatt. So, hatt * is void *, as far everyone is concerned, except the HAT implementation. If we wrote the kernel in C++, we could make hatt a class with private members. We should be able to change all usage of "struct hat" to hatt. If that is not acceptable, than we ought to at least redefine the struct hat so that it has one member which is a simple data type and has an unlikely name, like _noneofyourbusiness. Failing that, we ought to be able to redefine struct hat so that all the members are the same order, data type, offset, and size, but the member names have been changed, for example by prefixing each member name with _. vm/hat.h could have some preprocessor code like so: #if defined(HATIMPLEMENTATION) struct hat { sometype member1; … }; #else struct hat { sometype __member1; … }; #endif END of OPINION HAT/boot Food Taster and HAT Debugging Tools Most of this document covers changes to Solaris/PPC 2.6 code that are imposed by external factors: changing hardware, evolution of Solaris, changes to boot. But, there are a few changes recommended here, simply because they are an important improvement in HAT construction technology. By a very wide margin, the top two are:
These additions do not come free of cost, so they need to be mentioned here. However, they have a very good chance of leading to a net reduction in system bringup time, and they contribute to more reliable time budgets, because big surprises are reduced. HAT/boot Food Taster Things don't go well if the HAT consumes anything toxic. Things can go especially badly early on and in mysterious ways if the HAT inherits MMU state from boot which is not compatible with the state for which it is designed. I recommend spending some time up front in writing a significant amount of code which tests the MMU state and other conditions, as they are when HAT first takes control. If there is anything that is not in order, the HAT/boot food taster should not be sparing in its effort to explain clearly what is expected and what it got, and then die a quick and merciful death. Perhaps in the process it can deliberately trigger the debugger (either hardware debugger or kmdb), if present, in a special way. The alternative to fail-fast semantics is system delusion, so fail fast and fail noisily. This sort of thing was done on Solaris/IA64, and has been proven to save a great deal of time when integrating work done by several developers each working on different pieces, and possibly misunderstanding the HAT/boot contract. Even if it were the case that large pieces of HAT/boot code could be reused, the amount of work needed to add a HAT/boot food taster is pretty much the same for a HAT code update or for a brand new HAT layer. HAT Debugging Toolkit It is common to sprinkle some ASSERTs in the code. Also, most HAT developers have a stash of HAT debugging aids, such as a handy-dandy pagetable hashing pocket calculator or pagetable walker/navigator. But, I believe it is a good idea to include a set of kernel functions and userland tools as a first class citizen in the software release. Solaris/IA64 had an extensive set of HAT debug helper functions, as well as userland tools to do HAT-specific monitoring of correctness and performance. Our new Solaris/PPC port can do even more because there are more hardware and software tools available, such as hardware debugger (when available) and DTrace. The amount of work needed to add a HAT debugging toolkit is pretty much the same for a HAT code update or for a brand new HAT layer. Some of the components of the HAT debugging toolkit are:
Fault Injection One example of the use of fault injection for the HAT layer is to arrange for races to be lost often. The function to atomically replace a PTE can be implemented with a version that deliberately causes it to fail with a specified probability, but being careful to limit the number of consecutive failures from the same caller so that we don't block forward progress of the system. Pagetable and HME verifier A pagetable and HME verifier is to the HAT data structures what fsck is to UFS filesystem metadata. On a running system things change too quickly to check consistency of the entire system, but at the time of a kernel panic, it can be done without disturbing the existing mappings. Also, most of the consistency checking is decomposable. That is, much can be determined about the internal consistency of a subset of page tables, and it can be tested quickly and nondestructively. Pagetable and HME statistics Before DTrace, it would have been very difficult to generate statistics about things like hash collisions without writing extra helper functions and rolling your own methods for enabling and disabling probing; and it was even more difficult to get statistics out of the kernel so that some userland monitoring / visualization program can slice and dice and present the data. DTrace makes this sort of thing a great deal easier. However, I believe the HAT may still have some hooks in it with DTrace and userland reporting programs in mind. Pathological workloads In order to exercise the logic for handling rare cases, such as clustering of hash collisions leading to full PTE groups, small workloads can be constructed that generate very unfortunate reference patterns. There are several ideas for new functionality and performance enhancements, but they are not as important and certainly not as urgent as debugging aids. So, they will be collected in a document yet to come. |