======[[ Solaris/Illumos' adaptive locks ]]======

Reading: Chapter 17, especially 17.5 (see below for other material)

If you haven't watched the part of Cliff Click's lecture on caches
and memory access costs, do it now!

http://www.infoq.com/presentations/click-crash-course-modern-hardware :
   -- especially the part at the 36 min mark where memory cache details start.
(same as the link at the top of percpu-allocations.txt)

(There is a version of this talk at https://www.youtube.com/watch?v=OyTA5EaAb3Y,
 but unfortunately it lacks the slides)

OpenSolaris optimizes its mutexes based on the two assumptions:
 
1. Critical sections are short, and once entered by a CPU, will
   be over very soon (faster than the context switches involved
   in blocking a waiting thread on a turnstile so that some other
   thread could run, and then waking it up).

   Hence the idea that a thread should spin if the mutex is
   held by a thread currently on a CPU, and block otherwise.

2. Most kernel mutexes in the kernel can be adaptive (i.e., can block
   without causing scheduling trouble), are not hotly contested, and
   therefore most threads in most cases will find a mutex BOTH
   *adaptive* and *not taken*.

   Hence this path -- adaptive && not taken -- and the mutex imlementation
   data structure (see mutex_impl.c union) are *co-optimized* in
   assembly to a single lock-ed (atomic) instruction and return:

   http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/lock_prim.s#512  -- comment

   http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/lock_prim.s#558  -- code 

   Note: according to AMD 64 calling conventions, RDI holds the first
         argument to a function, which for mutex_enter is the pointer
	  to the mutex data structure from mutex_impl.c

Question: what happens if kernel pre-emption hits on this path? Spinning on a spin 
          lock should not get pre-empted. See notes on kernel pre-emption below. 

=====[ Spin/Adaptive Lock Implementation ]=====

Described in Ch. 17 of the textbook, expecially in 17.5

mutex_enter() is defined in native assembly for each supported architecture.
For x86:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/lock_prim.s

In MDB, mutex_enter::dis  will show you the actual code.

After handling the simplest "fast path" case of an uncontested lock
being taken by a thread (see comment at line 514), this code hands off to C code,
mutex_vector_enter(), for more complex cases, where it spins or blocks on
a queue ("turnstyle"), based on the lock type and the adaptive heuristic
to spin if the holder of the lock is currently running on some CPU, to block otherwise.

Read about the illustrious CMPXCHG instruction. E.g.:
http://faydoc.tripod.com/cpu/cmpxchg.htm (quick summary),
http://web.cs.du.edu/~dconnors/courses/comp3361/notes/lock.pdf (a poem in slides ;) -- we looked at slide 7)
and compare with the Wikipedia entry on CAS (http://en.wikipedia.org/wiki/Compare-and-swap).

Note that another x86 instruction used for locking, XCHG, does not need the LOCK prefix;
it locks the bus by default regardless of whether the prefix is given.

Note also  mutex_owner_running() , also in assembly, called from the C side
(http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/mutex.c)
especially in the body of mutex_vector_enter(), lines 413--455 . 

-----------[ mutex_enter()/mutex_exit() for all locks, a matter of style ]---------------

Note that mutex_enter() and mutex_exit() are used for BOTH pure spinlocks that always spin
(as appropriate for a driver's top half lock) and for adaptive locks that may spin
or go sleep on a turnstyle. So there must be some trick that encodes,
in the opaque memory storage allocated for the lock---and pointed to by
the argument of mutex_enter()---that the lock is a spinlock and can only spin.
This trick must work with the assembly hot path of mutex_enter() and with the C
mutex_vector_enter(). 

This trick is described in the textbook, and also in the code:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/sys/mutex_impl.h#40
and aided by the macros 
96  #define  MUTEX_TYPE_ADAPTIVE(lp)    (((lp)->m_owner & MUTEX_DEAD) == 0)
97  #define  MUTEX_TYPE_SPIN(lp)        ((lp)->m_spin.m_dummylock == LOCK_HELD_VALUE)

Namely, the opaque lock memory is seen as a union of two layouts: one for strictly
spinlocks, with 0xFF (LOCK_HELD_VALUE) in the lowest byte, and one for adaptive locks.
The type of the lock is set by setting this value in memory at initialization.

The optimized path of mutex_enter() is thus NEVER going to take a spinlock 
in the assembly, because the opaque memory is never going to be occupied by a NULL,
which would make CMPXCHG atomically overwrite it with the current thread address 
and thus grab the (unheld) lock: 

If (accumulator == destination)              // cmpxchgq   %rdx,   (%rdi)   
    { ZF <-1 ; dst <- src; }                 //            src     dst           
If (accumulator != destination)              //           thread   lock address 
    { ZF <-0 ; accumulator <- dst; } 

Thus strict spinlock's mutex_enter() thus always falls through to 
mutex_vector_enter(), and would spin. 

An adaptive lock's opaque memory would be NULL (MUTEX_NO_OWNER) if not held,
  mutex_impl.h:  82 #define  MUTEX_NO_OWNER   ((kthread_id_t)NULL)
or a thread pointer to the owning thread if held, plus some more bits
(such as the "waiters" bit), using the fact that thread pointers are 
aligned and thus always end in four zero bits:

77 #define      MUTEX_WAITERS           0x1
78 #define      MUTEX_DEAD              0x6
79 #define      MUTEX_THREAD            (-0x8)

81 #define      MUTEX_OWNER(lp)         ((kthread_id_t)((lp)->m_owner & MUTEX_THREAD))
82 #define      MUTEX_NO_OWNER          ((kthread_id_t)NULL)

Question: Find examples of spin-only locks in the kernel code!

-----------------[ A CPU's Sense of Self ]-----------------

mutex_vector_enter() checks the CPU's interrupt priority level to make
sure it is not above LOCK_LEVEL (must spin, cannot sleep on a queue)
when trying to acquire an adaptive lock (lines 372--378). 
mutex_vector_enter() also updates some statistics for the CPU (line 380).

How does a CPU know which CPU it is? Essentially, it's from the value that
is set in one of its registers (GS, a segment selector). This value, along
with other values, is set during context switches (this is how a thread finds
its thread-local data, and also how the current process is found in the context
of a system call). 

Getting the CPU a thread is running on uses the same trick as other per-CPU data, using %gs:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/i86_subr.s

#define	CPU		(curcpup())	/* Pointer to current CPU */

  ENTRY(curcpup)
  movl   %gs:CPU_SELF, %eax
  ret


Q.: Where is the spin lock loop? Can you find it in the kernel assembly 
    (e.g., with "mdb -k"?)

Also note the panic and assert conditions, lines 410, 457. Why are they there?

See also the definition of  cpu_t :
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/cpuvar.h

Suggestion: Trace the use of cpu_t through Kmem, to see their use in non-blocking
	    allocation of Kmem cache objects various CPU* macros; observe
	    the limitations of that approach (in particular, the occasions
	    where global locks get taken).

Finally, note the Big Theory Statement at the top of mutex.c 

======[ Kernel Pre-emption ]======

Linux kernel was not pre-emptable till 2.6. The story:
http://www.linuxjournal.com/article/5600
http://lwn.net/Articles/22912/

A possibly useful discussion of what it means for a kernel to be preemptive:
http://stackoverflow.com/questions/817059/what-is-preemption-what-is-a-preemtible-kernel-what-is-it-good-for

Solaris kernel, in contrast, was designed to be pre-emptable:
https://blogs.oracle.com/sameers/entry/solaris_kernel_is_pre_emptible
(reflect in the test special system call that would trigger a while(1),
 as described in this article). 

Kernel implementation:
http://lxr.free-electrons.com/source/include/linux/preempt.h 

As usual, it's piled with macros, so refer to the kernel disassembly
to see how they actually expand (see browsing-linux-kernel-binary.txt for
some tips).

As you read these macros, notice the use barrier() -- now that
kernel code can be pre-empted at any instruction just because
a timer interrupt fired, it's important to force memory ordering: 
Memory barriers are explained in this very useful paper:
http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf

======[ Preemption and spinlocks ]======

In a pre-emptable kernel, spinlocks are the only choice
for drivers and other critical operations that cannot block.
See chapter 17 for explanations.

For everything else there are adaptive locks, which
choose to either block or spin, based on a simple heuristic:
if the holder of the lock is running on another CPU, it won't
hold the lock for long, and it makes sense to spin; if the
holder is blocked, than blocking is the only option, because
we don't know how long the owner will remain blocked. 

Spinlocks on multiprocessor machines ultimately depend on cache
coherency logic to be efficient. Since locking the memory bus on every
check of a state-holding memory byte would likely be too wasteful, and
the results of non-locked reads cannot be relied on to persist even
until the next instruction in the current thread, implementors of
locks must deal with the case when the lock is snatched away just
after its state was read and found "free" (or convince themselves that
it can never happen on a particular platform). Keep this in mind when
reading the code.

======[ Linux spinlocks ]======

Spinlocks show off several important issues about efficient 
synchronization on real platforms. We will look at an
older implementation of spinlocks in Linux -- as of 2.6
kernel. Newer Linux kernels use different, more efficient 
synchronization primitives.

http://lxr.free-electrons.com/source/include/asm-i386/spinlock.h?v=2.4.37#L126

The actual code is in the spin_lock_string:
http://lxr.free-electrons.com/source/include/asm-i386/spinlock.h?v=2.4.37#L55
(the %0 in this macro string refers to the address of the lock variable,
 lock->lock)

Note the trick of emitting the spinlock's actual busy-waiting loop
into a separate code section (see the definition of LOCK_SECTION_NAME,
".text.lock.")

More about the older Linux implementation of spinlocks:
http://www.cs.fsu.edu/~xyuan/cop5611/spinlock.html
[note how the implementation's choice of atomic instructions
 changed over time]

=== Trivia ===

What is this silly do { } while(0) in Linux?
http://stackoverflow.com/questions/257418/do-while-0-what-is-it-good-for

What is rep nop?
http://stackoverflow.com/questions/7086220/what-does-rep-nop-mean-in-x86-assembly

GCC extensions used in the kernel source, such as likely() and 
unlikely(), and prefetch hints:
http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/

A handy table of x86 conditional jumps: http://www.unixwiz.net/techtips/x86-jumps.html