Problem 3: Map a given physical page into a process. My intended solution was to manipulate the page coloring algorithm that picks a physical page to allocate for a requested virtual address. Specifically, the user can pass a desired virtual address as a first argument to mmap; although mmap is not obligated to honor this request, it will try if it can. If the virtual address makes sense, the kernel's page coloring algorithm will use it to pick a freelist of the right color---on which the targeted physical page is hopefully waiting, in a known position. So the solution approach is to converge on that physical page by mapping out the kernel's colored freelists from MDB and finding the page there, and then, from the userland program, to manipulate the virtual addresses passed to mmap to nab the right freelist. Thus the key to this problem is page coloring, described in section 10.2.7. The textbook makes the default algorithm sound a bit daunting---but a look at the code shows that, in fact, it isn't. 1. Page coloring algorithm on x86. As several of you noticed, the problem would be really simple on a Sparc, because on that architecture the kernel can be told to apply a trivial mapping based in the virtual address alone (see Table 10.2 on p. 516). Unfortunately, this option is not compiled into the kernel on x86---but we can still peek at the code, and it turns out to be a good starting point. The page coloring policy is implemented in the AS_2_BIN macro. This macro is defined differently for sun4/vm/vm_dep.h and i86pc/vm/vm_dep.h and is then used in common/vm/vm_pagelist.c For the Sparc (sun4/vm), http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/sun4/vm/vm_dep.h#625, the default 'case 0' looks pretty complicated, whereas 'case 1' is a one-liner, as expected from Table 10.2. But how does this macro look for x86? 438 #define AS_2_BIN(as, seg, vp, addr, bin, szc) \ 439 bin = (((((uintptr_t)(addr) >> PAGESHIFT) + ((uintptr_t)(as) >> 4)) \ 440 & page_colors_mask) >> \ 441 (hw_page_array[szc].hp_shift - hw_page_array[0].hp_shift)) Notice that most of the arguments passed to the macro aren't used at all! The only non-trivial part is that the (virtual) address addr (shifted right to remove the offset into the 4K page) is added with the (kernel) address to the 'struct as' of the process for which the color mapping is computed. The page_colors_mask is simply 0x7f, so only the two nybbles from the 'as' and 'addr' each actually matter for color selection (addr: 0x........ ...xx..., as: 0x........ .....xx.). The parameter szc is 0 for 4K pages, and so the last shift of the macro does not come into play for them. This is not daunting at all. Essentially, 'as' is the only thing that adds some "randomness" to the page color mapping (i.e., something process-specific), and only 2 nybbles worth of it. 2. Code path from mmap to the freelists. How do we get from the mmap() called in a user process to the point where AS_2_BIN picks the page color for the virtual addresses we request of it? First, we need to pick sensible addresses to map (say, mmap() will refuse to map a virtual address from a range that's already taken, or that belongs to a kernel). Mapping a page with NULL for the requested address and seeing what mmap picks will help find a sensible range. Second, how often will mmap() favor requests on a non-busy system? The small mapper.c program shows that this, in fact, happens very often. The getfreepage.d DTrace script captures the functions and arguments that go into a colored page allocation off of a freelist(). The desired virtual address is passed down to the page_create_va() function as vaddr, its last argument. This is the fucntion that actually grabs a physical page from one of the free lists---based on the vaddr, among other things. Note that page_create_va() was an innovation, intended to replace the older vaddr-oblivious page_create(), because knowing the virtual address at which a physical page will be mapped into a process turned out to significantly reduce cache thrashing. page_create_va() calls AS_2_BIN to get the color for vaddr (vm_pagelist.c, line 3785) and then passes it to *get_page_func(), which is actually get_page_mnode_freelist() (line 3744). From there, the actual freelist is referenced via 2955 pp = PAGE_FREELISTS(mnode, szc, bin, mtype); where for 4K pages, mnode is always 0, szc is 0, mtype is 1 on a simple uniprocessor machine without NUMA. So only the 'bin' parameter, a.k.a. the color set by AS_2_BIN, matters non-trivially. (Following through the mentions of mnode and mtype and looking up the variables from which they are computed should further convince you of this, beyond DTrace's output). Note the asserts for a page taken from a freelist. These are the properties worth crashing the kernel over! This is what it really means for the page to be free and for its page_t structure to be coherent. 2960 * These were set before the page 2961 * was put on the free list, 2962 * they must still be set. 2963 */ 2964 ASSERT(PP_ISFREE(pp)); 2965 ASSERT(PP_ISAGED(pp)); 2966 ASSERT(pp->p_vnode == NULL); 2967 ASSERT(pp->p_hash == NULL); 2968 ASSERT(pp->p_offset == (u_offset_t)-1); 2969 ASSERT(pp->p_szc == szc); 2970 ASSERT(PFN_2_MEM_NODE(pp->p_pagenum) == mnode); Finally, PAGE_FREELISTS is again architecture-specific. For i86pc/vm we get: 115 extern page_t ****page_freelists; 116 117 #define PAGE_FREELISTS(mnode, szc, color, mtype) \ 118 (*(page_freelists[mtype][szc] + (color))) 3. Walking the freelists in the kernel. The freelists above should be walked to match the PAGE_FREELISTS above, which is ultimately responsible for producing a pointer to the page_t. Mtype and szc for 4K pages are fixed at 1 and 0 respectively, so the running parameter will be color, for the 128 (0x80, to match the page_colors_mask 0x7f) colors. Sadly, MDB lacks a scripting language, so commands to walk it must either be written as a "walker" (in C, as an MDB module; look for the source code for some modules listed ::walkers under /mdb/common/modules/genunix/ in OpenGrok), or fed from a shell script (or Perl or Python or Ruby, etc.). So, to get the head (first) page of each freelist (targeting these would be the simplest, as they are the first to be allocated in their color---unless some other task beats you to it, of course). *page_freelists+8/K | ::map *. | ::array page_t* 7f | ::map *. or *page_freelists+8/K | /K | ::array page_t* 7f | /K /K (dereference as a 64-bit address) and map *. seem to be synonymous; ::map seems to be needed to do anything other than dereferencing addresses. Spaces inside ::map expressions are not allowed. These are actual page_t pointers. For each such pointer, ::list page_t p_next will walk the corresponding freelist till it's done. E.g., this will print all free pages on color lists: *page_freelists+8/K | ::map *. | ::array page_t* 7f | ::map *. | ::list page_t p_next | ::print page_t p_pagenum To dump the color N list (NOTE: N MUST be hex!): *page_freelists+8/K | ::map *. | ::map .+8*N | /K | ::list page_t p_next | ::print page_t p_pagenum As per asserts above, any pages that you encounter MUST have p_state 0x80, p_hash and p_vnode NULL, and p_offset -1 (0xffff ffff ffff ffff) to be free---or, chances are, you derefenced one pointer too few or too many (it happens all the time when your base pointer type has **** :)).