============[ Per-CPU variables and magazines ]============ To understand why having different CPUs access the same memory address at the same time creates huge performance bottlenecks on multi-processor systems, watch Cliff Click's talk about modern CPUs. Keep in mind that a global lock's variable is just such an address: http://www.infoq.com/presentations/click-crash-course-modern-hardware (The talk starts at 4 min mark, explanation of cache coherence logic in hardware from 36min mark, slide titled Caches; the same logic underlies modern spinlock synchronization.) ------------[ Per-CPU Kmem magazines ]------------ Bonwick01.pdf explains the need for per-CPU allocation in Section 3 and goes into the implementation details in Fig. 3.1b and Fig. 3.1c. The actual code has somewhat changed, but is still recognizable: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/kmem.c#2526 The most important statement on the "hot path" is getting the per-cpu magazine: 2528 kmem_cpu_cache_t *ccp = KMEM_CPU_CACHE(cp); The KMEM_CPU_CACHE looks like a simple macro, but it actually goes much deeper than a first glance reveals: 182/* 183 * The "CPU" macro loads a cpu_t that refers to the cpu that the current 184 * thread is running on at the time the macro is executed. A context switch 185 * may occur immediately after loading this data structure, leaving this 186 * thread pointing at the cpu_t for the previous cpu. This is not a problem; 187 * we'd just end up checking the previous cpu's per-cpu cache, and then check 188 * the other layers of the kmem cache if need be. 189 * 190 * It's not even a problem if the old cpu gets DR'ed out during the context 191 * switch. The cpu-remove DR operation bzero()s the cpu_t, but doesn't free 192 * it. So the cpu_t's cpu_cache_offset would read as 0, causing us to use 193 * cpu 0's per-cpu cache. 194 * 195 * So, there is no need to disable kernel preemption while using the CPU macro 196 * below since if we have been context switched, there will not be any 197 * correctness problem, just a momentary use of a different per-cpu cache. 198 */ 199 200#define KMEM_CPU_CACHE(cp) \ 201 ((kmem_cpu_cache_t *)((char *)(&cp->cache_cpu) + CPU->cpu_cache_offset)) Finally, in the CPU macro, we get to how a processor gets to know its own identity: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/cpuvar.h#565 and http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/i86_subr.s#2047 (look up to lines 2039--2045. Such symbol definitions #ifdef-ed for "lint" are meant to keep the lint(1) C code-checker happy with appropriate type definitions that aren't ever to be compiled. Skip those and look for actual architecture-dependent definitions: #if defined(__amd64) 2048 2049 ENTRY(curcpup) 2050 movq %gs:CPU_SELF, %rax 2051 ret 2052 SET_SIZE(curcpup) 2053 2054#elif defined(__i386) 2055 2056 ENTRY(curcpup) 2057 movl %gs:CPU_SELF, %eax 2058 ret 2059 SET_SIZE(curcpup) 2060 2061#endif /* __i386 */ So this is the actual definition, and it depends on the per-CPU GDT tables, pointed to by the respective CPUs' GDTR pointers, which get each processor its own %GS selector registers. These, at last, can be different for each processor, even though the compiled code is identical. A similar mechanism is used for addressing thread-local storage. Recall that threads run code that is identical in every thread; it's only the CPU context data (saved and restored registers) that's different. Search for %gs in http://http://www.uclibc.org/docs/tls.pdf for more detail. Note that the "hot path" in lines 2532-2558 is mostly debugging code; its non-debugging essence is handing up the topmost "round" in the per-cpu magazine, the lock in the kmem_cpu_cache ccp protecting only against thread preemption on the same CPU. Also note that this logic is enclosed in an endless loop "for (;;) {", just in case it can be served by swapping the per-CPU magazines with kmem_cpu_reload() and then with going to the depot level with kmem_depot_alloc() (line 2597). Only when all of these fail via a "break" clause, the raw slab allocation is invoked: kmem_slab_alloc (line 2618). Also, notice that on the non-debug path, the constructor is called if defined: 2643 if (cp->cache_constructor != NULL && 2644 cp->cache_constructor(buf, cp->cache_private, kmflag) != 0) { 2645 atomic_inc_64(&cp->cache_alloc_fail); 2646 kmem_slab_free(cp, buf); 2647 return (NULL); 2648 } Finally, note the effects of _false sharing_ in caches. These are discussed in Section 3.6 of the Bonwick paper.