======[[ Solaris/Illumos' adaptive locks ]]======

Reading: Chapter 17, especially 17.5 (see below for other material)

OpenSolaris optimizes its mutexes based on the two assumptions:
 
1. Critical sections are short, and once entered by a CPU, will
   be over very soon (faster than the context switches involved
   in blocking a waiting thread on a turnstile so that some other
   thread could run, and then waking it up).

   Hence the idea that a thread should spin if the mutex is
   held by a thread currently on a CPU, and block otherwise.

2. Most kernel mutexes in the kernel can be adaptive (i.e., can block
   without causing scheduling trouble), are not hotly contested, and
   therefore most threads in most cases will find a mutex BOTH
   *adaptive* and *not taken*.

   Hence this path -- adaptive && not taken -- and the mutex imlementation
   data structure (see mutex_impl.c union) are *co-optimized* in
   assembly to a single lock-ed (atomic) instruction and return:

   http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/lock_prim.s#512  -- comment

   http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/lock_prim.s#558  -- code 

   Note: according to AMD 64 calling conventions, RDI holds the first
         argument to a function, which for mutex_enter is the pointer
	  to the mutex data structure from mutex_impl.c

Question: what happens if kernel pre-emption hits on this path? 

======[ Kernel Pre-emption ]======

What it means for a kernel to be preemptive:
http://stackoverflow.com/questions/817059/what-is-preemption-what-is-a-preemtible-kernel-what-is-it-good-for

Linux kernel was not pre-emptable will till 2.6. The story:

http://www.linuxjournal.com/article/5600
http://lwn.net/Articles/22912/
http://lxr.linux.no/#linux+v2.6.32/include/linux/preempt.h -- Kernel implementation

Solaris kernel, in contrast, was designed to be pre-emptable

https://blogs.oracle.com/sameers/entry/solaris_kernel_is_pre_emptible

======[ Preemption and spinlocks ]======

In a pre-emptable kernel, spinlocks are the only choice
for drivers and other critical operations that cannot block.
See chapter 17 for explanations.

For everything else there are adaptive locks, which
choose to either block or spin, based on a simple heuristic:
if the holder of the lock is running on another CPU, it won't
hold the lock for long, and it makes sense to spin; if the
holder is blocked, than blocking is the only option, because
we don't know how long the owner will remain blocked. 

Spinlocks on multiprocessor machines ultimately depend on cache
coherency logic to be efficient. Since locking the memory bus on every
check of a state-holding memory byte would likely be too wasteful, and
the results of non-locked reads cannot be relied on to persist even
until the next instruction in the current thread, implementors of
locks must deal with the case when the lock is snatched away just
after its state was read and found "free" (or convince themselves that
it can never happen on a particular platform). Keep this in mind when
reading the code.

A few details in preparation.

Getting the CPU a thread is running on (recall that %gs is used
to hold "per-thread context" data):
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/i86_subr.s

      ENTRY(curcpup)
      movl	%gs:CPU_SELF, %eax
      ret

#define	CPU		(curcpup())	/* Pointer to current CPU */

See also the definition of  cpu_t :
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/cpuvar.h

Suggestion: Trace the use of cpu_t through Vmem, to see their use in non-blocking
	    allocation of Kmem cache objects various CPU* macros; observe
	    the limitations of that approach (in particular, the occasions
	    where global locks get taken).

======[ Linux spinlocks ]======

Spinlocks show off several important issues about efficient 
synchronization on real platforms. We will look at an
older implementation of spinlocks in Linux -- as of 2.6
kernel. Newer Linux kernels use different, more efficient 
synchronization primitives.

http://lxr.linux.no/#linux-bk+v2.6.11.5/include/asm-i386/spinlock.h
  (see  spin_lock_string, line 45)

Note the 64bit implementation's trick of emitting the spinlock's
actual busy-waiting loop into a separate section:
http://lxr.linux.no/#linux-bk+v2.6.11.5/include/asm-x86_64/spinlock.h
  (line 49) +
http://lxr.linux.no/#linux-bk+v2.6.11.5/include/linux/spinlock.h
  (see definition of LOCK_SECTION_NAME, ".text.lock.")

More about the older Linux implementation of spinlocks:
http://www.cs.fsu.edu/~xyuan/cop5611/spinlock.html
[note how the implementation's choice of atomic instructions
 changed over time]

=== Trivia ===

What is this silly do { } while(0) in Linux?
http://stackoverflow.com/questions/257418/do-while-0-what-is-it-good-for

What is rep nop?
http://stackoverflow.com/questions/7086220/what-does-rep-nop-mean-in-x86-assembly

GCC extensions used in the kernel source, such as likely() and 
unlikely(), and prefetch hints:
http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/

A handy table of x86 conditional jumps: http://www.unixwiz.net/techtips/x86-jumps.html

=== Regarding the LOCK prefix and the F00F bug: ===

For a discussion of caches and memory access costs, see
http://www.infoq.com/presentations/click-crash-course-modern-hardware :
   -- at 36min mark memory cache details start.

See links in http://www.cs.dartmouth.edu/~sergey/cs258/2009/f00f-bug.txt
Unfortunately, the great x86.org article is no longer 
 available, and leads to a glitzy but vacuous site. But the article
 survives at 
 http://www.rcollins.org/ddj/May98/F00FBug.html     and
 http://www.rcollins.org/Errata/Dec97/F00FBug.html  !

=====[ Solaris' Adaptive Locks ]=====

Described in Ch. 17 of the textbook, expecially in 17.5

Excerpts from the old Sun site that covered the same material:

Adaptive locks:
http://sunsite.uakom.sk/sunworldonline/swol-09-1999/swol-09-insidesolaris.html

Turnstyles:
http://sunsite.uakom.sk/sunworldonline/swol-08-1999/swol-08-insidesolaris.html

Trace examples of actual lock usage. For example, consider the fork*()
family of system calls. To create a new process in the process table,
they must lock both individual proc_t structs to read and update them,
and the pointer links between these proc_t structs (these links make
the process table's varied structures via which it is traversed, such
as doubly linked lists, sibling lists, parent chains, hash tables, and
so on). Locate these locks being taken (mutex_enter()) and released
(mutex_exit()) in 
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/fork.c

Note the _global_ locks: pidlock, pr_pidlock, and the array proc_lock 
(http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/pid.c ,
lines 77-80; proc_lock is the trickiest). What do they protect? [intergrity 
of which data structures?] 

Note that modern UNIX systems have many flavors of fork() to reduce copying
(and locking) the process metadata (such as "struct as", threads, etc.)
See the man page(s) for vfork, vforkx, fork1, forkall and how they differ,
and observe the actual realization in, e.g., forksys() (fork.c, line 108).
 
=== Implementation ===

mutex_enter() is defined in native assembly for each supported architecture.
For x86:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/lock_prim.s

After handling the simplest cases (see comment at line 514), this code
hands off to C code, mutex_vector_enter(), for more complex cases.

Read about the illustrious CMPXCHG instruction. E.g.:
http://faydoc.tripod.com/cpu/cmpxchg.htm (quick summary),
http://web.cs.du.edu/~dconnors/courses/comp3361/notes/lock.pdf (a poem in slides ;))
and compare with the Wikipedia entry on CAS 
(http://en.wikipedia.org/wiki/Compare-and-swap).

Note also  mutex_owner_running() , also in assembly, called from the C side
(http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/mutex.c)
especially in the body of mutex_vector_enter(), lines 413-455 . 

Q.: Where is the spin lock loop? Can you find it in the kernel assembly 
    (e.g., with "mdb -k"?)

Also note the panic and assert conditions, lines 410, 457. Why are they there?

Finally, note the Big Theory Statement at the top of mutex.c