============[ Per-CPU variables and magazines ]============

To understand why having different CPUs access the same memory
address at the same time creates huge performance bottlenecks
on multi-processor systems, watch Cliff Click's talk about
modern CPUs. Keep in mind that a global lock's variable is just 
such an address:

http://www.infoq.com/presentations/click-crash-course-modern-hardware

(The talk starts at 4 min mark, explanation of cache coherence logic
 in hardware from 36min mark, slide titled Caches; the same logic
 underlies modern spinlock synchronization.)

------------[ Per-CPU Kmem magazines ]------------

Bonwick01.pdf explains the need for per-CPU allocation in Section 3
and goes into the implementation details in Fig. 3.1b and Fig. 3.1c.
The actual code has somewhat changed, but is still recognizable:

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/kmem.c#2526

The most important statement on the "hot path" is getting the 
per-cpu magazine:

2528	kmem_cpu_cache_t *ccp = KMEM_CPU_CACHE(cp);

The KMEM_CPU_CACHE looks like a simple macro, but it actually goes much
deeper than a first glance reveals:

182/*
183 * The "CPU" macro loads a cpu_t that refers to the cpu that the current
184 * thread is running on at the time the macro is executed.  A context switch
185 * may occur immediately after loading this data structure, leaving this
186 * thread pointing at the cpu_t for the previous cpu.  This is not a problem;
187 * we'd just end up checking the previous cpu's per-cpu cache, and then check
188 * the other layers of the kmem cache if need be.
189 *
190 * It's not even a problem if the old cpu gets DR'ed out during the context
191 * switch.  The cpu-remove DR operation bzero()s the cpu_t, but doesn't free
192 * it.  So the cpu_t's cpu_cache_offset would read as 0, causing us to use
193 * cpu 0's per-cpu cache.
194 *
195 * So, there is no need to disable kernel preemption while using the CPU macro
196 * below since if we have been context switched, there will not be any
197 * correctness problem, just a momentary use of a different per-cpu cache.
198 */
199
200#define	KMEM_CPU_CACHE(cp)						\
201		((kmem_cpu_cache_t *)((char *)(&cp->cache_cpu) + CPU->cpu_cache_offset))

Finally, in the CPU macro, we get to how a processor gets to know its own identity:

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/cpuvar.h#565

and 

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/i86_subr.s#2047

(look up to lines 2039--2045. Such symbol definitions #ifdef-ed for "lint"
are meant to keep the  lint(1) C code-checker happy with appropriate type
definitions that aren't ever to be compiled. Skip those and look for 
actual architecture-dependent definitions:

#if defined(__amd64)
2048
2049	ENTRY(curcpup)
2050	movq	%gs:CPU_SELF, %rax
2051	ret
2052	SET_SIZE(curcpup)
2053
2054#elif defined(__i386)
2055
2056	ENTRY(curcpup)
2057	movl	%gs:CPU_SELF, %eax
2058	ret
2059	SET_SIZE(curcpup)
2060
2061#endif	/* __i386 */

So this is the actual definition, and it depends on the per-CPU
GDT tables, pointed to by the respective CPUs' GDTR pointers,
which get each processor its own %GS selector registers.
These, at last, can be different for each processor, even though
the compiled code is identical.

A similar mechanism is used for addressing thread-local storage.
Recall that threads run code that is identical in every thread;
it's only the CPU context data (saved and restored registers) that's
different. Search for %gs in http://http://www.uclibc.org/docs/tls.pdf
for more detail.


Note that the "hot path" in lines 2532-2558 is mostly debugging code;
its non-debugging essence is handing up the topmost "round" in the per-cpu magazine,
the lock in the  kmem_cpu_cache ccp  protecting only against thread
preemption on the same CPU. Also note that this logic is enclosed 
in an endless loop "for (;;) {", just in case it can be served by swapping
the per-CPU magazines with  kmem_cpu_reload()  and then with going to
the depot level with  kmem_depot_alloc() (line 2597). Only when all
of these fail via a "break" clause, the raw slab allocation is invoked:
kmem_slab_alloc (line 2618).

Also, notice that on the non-debug path, the constructor is
called if defined:

2643   if (cp->cache_constructor != NULL &&
2644       cp->cache_constructor(buf, cp->cache_private, kmflag) != 0) {
2645            atomic_inc_64(&cp->cache_alloc_fail);
2646            kmem_slab_free(cp, buf);
2647            return (NULL);
2648   } 

Finally, note the effects of _false sharing_ in caches. These are discussed
in Section 3.6 of the Bonwick paper.