Recall the aspects of the virtual memory (VM) system: * Isolation (illusion of -- debugging support breaches it) * Page sharing between apps * Demand paging * VM as file I/O cache Isolation is achieved with help from the harware (see "Solaris x86 internals" pp. 79--83) and the note on PAE below. The other properties above are all due to the modern VM design, which (a) maintains a "reverse" mapping for all physical pages: for each page, the OS records its usage (b) uses the page fault handler as the main workhorse for paging in blocks from block devices and delegates to it whenever possible (e.g., "open" means "mmap", and "mmap" means page table entry setup; a "read" will then cause a #PF handler to actually call the driver's block reading code) (c) relies on the ELF format's rich knowledge about the structure of executables and libraries. Consider dynamic linking/loading design as motivated by the economics of reusing and remapping library code. As code gets more and more functionality, there is a break-even point between statically compiling all the needed function code into executables (only needed functions will be pasted into the final executable), and factoring shared code into dynamically loaded libraries (aka shared objects). Shared object code is trimmed off the executables' own .text, but now one must load the entire page of a dynamically linked library where a needed function is located. However, these loaded pages can be shared between multiple virtual address spaces if needed by them. Thus go the VM trade-offs happen. Note that it's only non-writable pages of shared object files that can be shared between processes for their lifetimes; writable pages of data obviously must receive a separate copy in every address space as soon as they are written to! This design is called copy-on-write: a writable page is shared until the first write into it, at which point the process that wrote into it must receive its own private copy of that page (while other processes that have not yet written to it may continue sharing the page as loaded). --------------[ Anatomy of address spaces: ]--------------------- "OpenSolaris Internals" Chapter 9.2 explains the theory of address spaces. Chapter 9.4 explains how address spaces are implemented and handled: proc_t.p_as -> struct as -> (AVL Tree) -> "struct seg" (AKA seg_t) Cf.: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/seg.h#102 Observe seg_ops, the dispatch table of operations/methods that will handle (consecutive) memory segment mapping, casting the void* s_data member of seg_t into whatever type the operations apply to (e.g., segvn_data for segvn_ops, segmap_data for segmap_ops, and so on): http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/seg.h#117 In the terms of the C++ object system, seg_t is an "abstract class": a common base class for a variety of classes that actually represent objects with that common set of operations---but meaningless and not instantiatable as such. In Java, such abstract non-instantiatable class definitons are called *interfaces*, to further distinguish them from the instantiatable classes. Note that we have seen a similar pattern with vnode_t in VFS across different file systems: see VOP_* macros and fop_* methods in process-table-traversals.txt. ---------------------[ Address space life cycle ]--------------------- "struct as"s' life cycle: as_alloc() [once, at system boot/init time] -> as_dup() [by fork()] as_dup() is how all address spaces after init's get created: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/vm_as.c#774 Observe how AVL structures get copied in as_dup (the loop at line 791). Look for call to as_dup() in fork(): http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/fork.c#286 ---------------------[ The trajectory of a page fault ]--------------------- For the purposes of this trace, we'll start with trap(), http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/os/trap.c#455, which is the common point where Illumos C code starts handling traps and faults. Of course, to get to trap() from an instruction that, e.g., issues a virtual address that lacks a macthing entry in the hardware page tables, or an instruction that can't be decoded in a valid way, execution must be dispatched through the Interrupt Descriptor Table to the entry that corresponds to the particular kind of fault or trap. We say that the hardware "raises" that exception to switch the control flow that can no longer continue due to an error (such as a virtual address unmapped in the current page table pointed to by the current value of CR3) to a special entry point in IDT. These entry points, one per exceptional trapped condition, are encoded in IDT entries; they typically lead to assembly routines such as cmnint(). These assembly routines line up the CPU-provided info about the trap (such as the faulting virtual address) according to the C calling convention, so that it can be fed into a C function. This is what cmnint() does, so that trap() can be called. In turn, trap() dispatches on the trap code, trapno. Observe the big case statement at http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/os/trap.c#579, where the cases are constrants from the x86 CPU manual, encoded as constants at http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/sys/trap.h#41 Page fault #PF is number 14, 0xe, called T_PGFLT. Then pagefault() is called from the trap() to handle various cases of #PF. It contains simple process-related checks, but all the real process-specific work is done in as_fault(). Note how as_fault() (http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/vm_as.c#841) redispatches the handling of the fault by finding the right segment in the address space's AVL tree and then calls _that segment's_ SEGOP_FAULT (line 944): res = SEGOP_FAULT(hat, seg, raddr, ssize, type, rw); This is very important. A page fault on a swapped-out memory page merely means that page needs to be brought back from disk. A page fault on an unmapped addeess (no segment containing it) may mean either that a SIGSEGV signal should be sent to the process OR that a new page should be claimed for the process' stack. A page fault on a previously mmap-ed file (such as a library shared object) means that a physical page need to be filled by the corresponding page-sized chunk of the file on disk, and so on. The segment will call the right kind of an operation, based on its type and set in its s_ops (and supported by the right kind of s_data). See "x86 OpenSolaris internals", Section 6.2 about OpenSolaris' unified trap, faults, and exceptions handling. ---------------------[ Per-page hash table ]--------------------- The kernel maintans a "struct page" page_t data structure for every physical page, explained in the beginning of Chapter 10. This is the heart of mmap-ed file sharing (libraries, in particular) and file I/O caching: it associates a physical page is associated with a piece of a file/device, i.e. maps --> an instance of page_t <--> phys. page page_t defined at: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/page.h#498 Look through the big comment that starts at line 149, and explains the uses of this page_t structure! Multiple kernel functions using this table means multiple locks protecting its different members. The long comment at line 149 of page.h tells the whole story. The story, especially the "locking protocol", is a perfect example of OS programming optimizations. Lookup in the pagetable: page_find http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/vm_page.c#993 Note that the hashing is done in a macro: PAGE_HASH_FUNC, http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/page.h#590 The hash table itself is page_hash, defined in http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/os/startup.c#306 next to other globals that enable other traversals of phys memory pages. We've seen similar code in /proc traversal. ---------------------[ Segment Drivers ]--------------------- Named in Table 9.4 (p. 479). Primary example: "seg_vn": http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/seg_vn.c#140 Observe *static* methods of this driver being packed into a "struct seg_ops", called segvn_ops. Note that "struct seg"'s void* s_data member in segments created by the segvn driver will be pointing to a segvn_data (p. 482 explains it), and thus through it to the mapped file's vnode ("vp" member) and offset ("offset" member). The same driver also handles anonymous mappings. In that case, the segvn_data's "amp" member will be used instead (shown in Fig. 9.11) When faulting in a file-backed page: trap() -> pagefault() -> as_fault() -> segvn_fault() -> _getpage() where is the underlying filesystem, predominantly zfs. Observe the "dives" from abstraction layers to specific systems' implementation workhorse methods and back: mmap() -> _map() -> segvn_create() -> hat_map() ufs_map zfs_map ... ---------------------[ Trap handling in Illumos ]--------------------- All traps are handled uniformly by trap(). This is a conscious design decision: all registers are saved on the stack by the respective ASM interrupt handler pointed from the IDT, and then C routines are presented with the same data structure. Observe the "struct regs *rp" argument in trap, and also "caddr_t addr" (which must be extracted from the special register CR2 when pagefault handler is called). This is done by cmntrap ("common trap"), http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/ml/locore.s#1098 ---which, in turn, starts with pushing all regs on the stack by INTR_PUSH, for 32-bit and 64-bit respectively: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/amd64/sys/privregs.h#202 http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/sys/privregs.h#153 Eventually, this register push gives up the trapno in trap() as a C struct member: type = rp->r_trapno; ----------------------------------------------------------------------- Suggestions: Using DTrace's vminfo::: provider, observe all four kinds of page mappings in Fig. 9.2 (described on p. 457) in action for your favorite process. E.g., write simple "memory hog" programs to malloc() a lot of anonymous pages, or call functions with lots of stack-allocated local arrays. Observe file sharing between processes. Notice how *minor faults* are handled (see 9.4.4 for definitions) See Table 9.3 for address space manipulation functions. ---------------------[ PAE, the pre-64-bit paging ]--------------------- I mentioned PAE as a stage between 32-bit x86 MMUs and the current 64-bit designs. It's a taste of how actual hardware progresses. PAE overview: http://en.wikipedia.org/wiki/Physical_Address_Extension (36 bits vs classical 32 bits of address space, i.e. 4GB -> 64GB) Classic 32-bit page translation without PAE: CR3 -> 4 KB "page directory" (4 byte entry)*1024 -------> 4 KB "page table" (4 byte entry)*1024 Page translation with PAE: (bit 5 of CR4 := 1) CR3 -> Page Dir Ptr Table (8 bit entry)*4 -------> 4 KB "page directory" (8 byte entry)*512 ----> 4 KB "page table" (8 byte entry)*512 bit 7 in each PDE eliminates the last lookup stage when set; instead, the rest of the address is interpreted as an offset into a 4MB (no PAE, 22 bits) or 2MB page (with PAE, 21 bits). Bit 0 in the PTE is the crucial "Page Present" bit. When hardware translation sees it, i raises the #PF trap, which swaps the page back in and retries the instruction (at the address stored in %cr2 on entry to #PF handler). PAE was the first x86 extension to introduce the per-page NX bit is in the PTE descriptor layout, as the top bit of the 64-bit struct. It remains there to this day. See "x86 OpenSolaris internals", Section 4.3. For the OS-developer level of documentation on PTEs and PDEs in detail: "Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide", pp. 3-35 -- 3-45 http://download.intel.com/design/processor/manuals/253668.pdf (from 2011: superceded but more readable). See also Section 3.12, p. 3-51 for a brief summary of TLBs. ------------------------------------------------------------------------