I believe that the best way to understand most kernel data structures and algorithms is to follow how they are traversed to provide common kernel services. For example, we can see the structure of the process table by following a /proc traversal done by "ps" . ======== 0. What kernel functions support the /proc traversal The DTrace script procdents.d puts probes at the implementations of the getdents system call (on Illumos: "man -s2 getgents") From UNIX getdents(2): DESCRIPTION This is not the function you are interested in. Look at readdir(3) for the POSIX conforming C library interface. This page documents the bare kernel system call interface. So getdents is the system call behind the C way of reading directories. For ps(1) reading /proc this can be easily seen by running "truss ps". You can convince yourself that getdents on /proc is indeed at the heart of "ps" by doing "truss ps" (I suggest running it as "truss ps 2>&1 | less" to avoid scrolling back a lot). A simple D script to trace the kernel functions invoked by ps in the course of that getdents call: -------------------------- #!/usr/sbin/dtrace -s #pragma D option flowindent syscall::getdents*:entry /execname == "ps"/ { self->flag = 1; } fbt::: /self->flag/ { } syscall::getdents*:return /self->flag/ { self->flag = 0; exit(0); } ------------------------- borrowed from http://www.dtracebook.com/index.php/Kernel:ktrace.d In the course d/ directory, I have http://www.cs.dartmouth.edu/~sergey/cs258/d/procdents.d which is a modification of this script. ======== 1. Following the VFS redispatches Getdents* redispatches to prreaddir , the procfs-specific implementation under VFS (see the introduction on VFS in the textbook's Ch. 1): http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/syscall/getdents.c#187 Note lines 227--229 where the read work of reading a directory is done: 227 (void) VOP_RWLOCK(vp, V_WRITELOCK_FALSE, NULL); 228 error = VOP_READDIR(vp, &auio, fp->f_cred, &sink, NULL, 0); 229 VOP_RWUNLOCK(vp, V_WRITELOCK_FALSE, NULL); Since the underlying directory is subject to changes by other threads just as we read it, a lock is taken. Since different filesystems may use different lock implementations, and will definitely use different "readdir" implementations as befits their internal datastructures, all of these are done through macros that mask the redispatches. Observe VOP_READDIR: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/vnode.h#VOP_READDIR It calls for_readdir, which does the actual dispatch: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/vnode.c#fop_readdir err = (*(vp)->v_op->vop_readdir)(vp, uiop, cr, eofp, ct, flags); THIS IS REALLY THE HEART OF VFS (see also other fop_* function definitions that activate the right slot of the function pointer table). Suggestion: find there these filesystem-specific tables are filled and initialized. The arrow operator is on the v_op table of "methods" (function pointers) appropriate for the vnode . The table is of type vnodeops_t , defined at http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/vnode.h#939 Note the huge macro of all possible method names, VNODE_OPS, just above: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/vnode.h#824 The proc-fs specific function dispatched to will be "prreaddir" . Suggestion: Confirm this and other function choices with DTrace. 2. Inside the procfs module Note: we are not talking about dynamically loadable kernel modules here; it's the DTrace's idea of a "module" (i.e., a partition of the namespace of all available probes in the FBT provider, by kernel compilation unit). prreaddir consults the table of appropriate actions based on the pathname of the /proc pseudo-directory that we want to read: http://src.illumos.org/source/s?defs=prreaddir&project=illumos-gate : return (pr_readdir_function[pnp->pr_type](pnp, uiop, eofp)); The table itself: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/proc/prvnops.c#pr_readdir_function and selects the action based on the type of the /proc/ node, pr_readdir_procdir Also note the catch-all action for "there is no meaningful /proc interpretation of listing this directory", pr_readdir_notdir: static int pr_readdir_notdir(prnode_t *pnp, uio_t *uiop, int *eofp) { return (ENOTDIR); } pr_readdir_procdir (line 4757) walks the actual "slots" of the process table. Watch the running pointer 4779 proc_t *p; iterating over all proc_t Process Descriptor Blocks. Specifically, the while loop at http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/proc/prvnops.c#4786 increments the "slot number" n until another non-empty slot is found -- that is, pid_entry(n) returns non-NULL -- and n is less than the maximum number of processes, v.v_proc is reached. Also skipped are "zone-invisible" processes, but ignore these clases for now. See http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/var.h for the definition of "v". At early boot time, OpenSolaris reads the custom settings of these variables from /etc/system , sets them in "v" and uses them determine the allocated sizes of many process-related arrays allocated at boot time (e.g., see pid_init discussed below). ======== 3. The structure of the process table. The process table has many uses and supports many traversals (as evidenced by the muptiple pointers that a proc_t has to many other traversal-related proc_t's. We will concentrate on the structure imposed by the three boot-time allocated arrays: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/pid.c#530 procdir , pidhash , proc_lock Suggestion: find out the role and function of pr_pid_cv , the fourth array. procdir is a union (see "union procent") of pointers to either another cell ("slot") of procdir of to an actual proc_t . procentfree points to the next free "slot" proc_t's are extracted from proc_t slots by slot number by get_entry . Note the weird-looking check for the pointer "pep" being between the start and the end addresses of the procdir array http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/pid.c#562 -- this means that the contents of the "slot" is NOT a pointer to allocated, active valid proc_t, but instead that slot is empty and is used as a part of the free list (see below). procdir is "synchronized" with the struct pid of a process through struct pid's member pid_prslot , which points to the slot in procdir which in turn points to the proc_t of the process that this pid belongs to. Observe it being set in pid_allocate(). With MDB, the kernel debugger, you will be able check all these on a live kernel with "mdb -k". pidhash supports efficient lookups of a proc_t by its (integer) PID, found in struct pid as its "int pid_id" member. The structure of pidhash is apparent from pid_lookup and prfind functions in http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/pid.c . Suggestion: follow the code path from the "worker function" cfork() of the various "fork" syscalls to pid_allocate through getproc() : cfork > getproc > pid_allocate and observe how new process descriptors proc_t get created and initialized, with all their internal linkages, etc. ======== 4. Hash-by-pid structure A similar hash table structure is also present in the Linux kernel (even though pid is an integer there and does not get its own structure). Hash-by-pid logic (used, e.g., for signal delivery) is all gathered in pid.c . The two main structures are procdir[] and pidhash[]. The underlying idea, as far as I understand it, is as follows: in order to look things up (specifically, proc_t by its PID) trough a hashtable, one needs to keep the collision list pointers somewhere. The lookup keys being simply integers, some kind of a struct is needed to keep both the integer PID and the collision linked list. Hence "struct pid". pidhash[] entries point to pid structs, which contain the collision list pointer pid_link, and the "slot" index of the corresponding proc_t pointer in procdir[], pid_prslot. This pid_prslot provides a "back-pointer" from struct pid to the corresponding proc_t . procdir[] and pidhash[] are kept in sync by proc_t-creating and freeing functions (described below). procdir[] doubles as a freelist of proc_t's (pointed to by procentfree). Note that most of these are "static", i.e., are explicitly internal to this file's scope; pid_prslot is also not findable from elsewhere. However, notice in proc.h the macro proc.h: #define p_slot p_pidp->pid_prslot and its use in prsubr.c, prcontrol.c, proc.c, and sign.c [same deal with the macro proc.h: #define p_lock p_lockp->pl_lock ] ---- Pid-hash structures are created in pid_init() in pid.c startup() -> kern_setup1() -> pid_init() http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/os/sundep.c#216 Observe the initialization of p0 (proc_t for the first process, "sched") and its related struct pid pid0 pid0 is statically defined in pid.c, but p0 is in param.c , for the reason explained in http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/conf/param.c#75 --- http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/main.c#proc_sched /* well known processes */ proc_t *proc_sched; /* memory scheduler */ proc_t *proc_init; /* init */ proc_t *proc_pageout; /* pageout daemon */ proc_t *proc_fsflush; /* fsflush daemon */ pgcnt_t maxmem; /* Maximum available memory in pages. */ pgcnt_t freemem; /* Current available memory in pages. */ int audit_active; int interrupts_unleashed; /* set when we do the first spl0() */ kmem_cache_t *process_cache; /* kmem cache for proc structures */ --- Read through pid_init() to see the pid-hash structures created. The so-called "hash" is in fact, that's just zeroing out the higher bits of the integer PID: static int pid_hashlen = 4; /* desired average hash chain length */ static int pid_hashsz; /* number of buckets in the hash table */ /* 4096 on my system, got set by pid_hashsz = 1 << highbit(v.v_proc / pid_hashlen); */ #define HASHPID(pid) (pidhash[((pid)&(pid_hashsz-1))]) Note that HASHPID(pid) stands for the pidhash[] element, and is used as an lvalue -- it is NOT just an integer hash, it's an array slot! --- The structure of the hash table made out of pidhash[] and links in struct pid is made obvious in the following functions: pid_lookup: Locate the "struct pid" by the integer PID by walking the collision list. pid_allocate: See new pid struct being allocated and linked into pidhash and "point" back to procdir's "prslot" where the associated proc_t goes. For one point where this function is called, see getproc() in fork.c, which creates a new proc_t: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/fork.c#getproc (notice how the new structure is requested from the special allocator, and then zeroed out with bcopy. We will study the Kmem allocator in depth later.) BTW, on line 1074 we see nproc++ (the number of processes on the system), and then the newly created proc_t linked at the head of the active process list. More of fork()-ed child's setup follows. Note: If you are confused as to the use of ->p_lock off of proc_t* because the proc_t struct does not contain any member named p_lock, you are right. Notice the following macro in proc.h: #define p_lock p_lockp->pl_lock --- Observe the permutation induced by pid_getlockslot on procdir integer slot integers: PLOCK_SHIFT 3 static int pid_getlockslot(int prslot) { int even = (v.v_proc >> PLOCK_SHIFT) << PLOCK_SHIFT; int perlap = even >> PLOCK_SHIFT; if (prslot >= even) return (prslot); return (((prslot % perlap) << PLOCK_SHIFT) + (prslot / perlap)); } 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 256 264 272 280 288 296 304 312 320 328 336 344 352 360 368 376 384 392 400 408 416 424 432 440 ... ... 9728 9736 9744 9752 9760 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 ... This is done to prevent spin lock "lock"-ed and cached bytes for sequentially created processes from being in the same cache line. This is known as "False Sharing" -- for a short definition, see http://en.wikipedia.org/wiki/False_sharing , for a longer discussion of how it occurs in userland programs, see http://www.drdobbs.com/architecture-and-design/sharing-is-the-root-of-all-contention/214100002 http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206 Recall that the CPU punishment for a cache miss is hundreds of CPU cycles spent stalling, so the silly-looking arithmetic more than pays for itself! ---- Unlinking and freeing processes: proc_entry_free(): [called from pid_ext(), which after that frees the proc_t] Observe the proc_t structure being put on the head of the freelist: procentree now points to it. procdir[pidp->pid_prslot].pe_next = procentfree; procentfree = &procdir[pidp->pid_prslot]; The structure is being referred to through it slot number in procdir, contained in pid: pidp->pid_prslot.