========================[ Readings ]========================

Read Chapters 3 of Ruby Under the Microscope.  Pay special attention
to cut-ins that discuss the source code of Ruby implementation. Follow
the book's discussion with the code open in your favorite editor.

Your objective is to see how Ruby code gets compiled to bytecode from
s-expressions and how the bytecode gets evaluated by the virtual
machine, and to notice patterns of recursion, handling different
scope, and creating closures of functions and blocks over lexical
scopes.

It also helps to note the tricks with which master programmers like
Yukihiro Matsumoto organize their C code. You will notice plenty of
macros that treat C pointers as objects or symbols, creating a
de-facto type system and a macro language not unlike LISP. In effect,
a project such as Ruby creates a new language on top of C---deliberately
or unwittingly. In the latter case, as the joke goes,

  Any sufficiently complicated C or Fortran program contains an ad
  hoc, informally-specified, bug-ridden, slow implementation of half
  of Common Lisp.
   -- https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule

In fact, many PL researchers chose to simply implement their ideas in
Scheme :)

Compare the style of Ruby implementation with that of the Emacs
Elisp's virtual machine:
https://github.com/emacs-mirror/emacs/blob/master/src/bytecode.c

Read ahead: Chapters 4 and 5.

-------[ Getting Ruby sources & setting up browsing ]-------

Get the sources from https://github.com/ruby/ruby .

I use the ctags and etags programs to browse code in Vim and Emacs
respectively. Both editors let you jump around the tags and return to
the places in the code you detoured from to look up a definition of a
macro or a function. "ctags" is a separate package under MacPorts,
etags is a part of the "emacs" package. The "man ctags" page's
"HOW TO" section gives a brief intro for both editors.

I use Vim and "view" a lot, usually as follows: "alias v=view +'syn
on'", then "v -t function_var_or_macro_name" from command line to find
a tag.  From within Vim/view, pressing Ctrl+] on a name and Ctrl+t to
get back work best for me, but there are many others:
http://vim.wikia.com/wiki/Browsing_programs_with_tags

With Emacs, I use M-. and M-* , occasionally C-u M-. (but I have to
look it up as often as not). If you wonder why "go see a tag" is
M-. and "go back" is "M-*", so do I :) At one point it occurred to me
that a useful mnemonic might be to think of an archery arrow facing
away from you, with . for the tip and * for the fletching. It works
for me :)

To index specific files, do "ctags *.h *.c" or some such. To index
an entire project directory recursively, do "ctags -R ." in it.
Use etags for Emacs :)

Another good command-line tool is cscope. If you prefer using
your own web browser, then OpenGrok + a local webserver are a good bet.
If you find a free GUI that's nice and not heavy like Eclipse and does
nice tab completion (unlike Visual Studio), let me know :)

----------[ Getting the .inc files generated ]-------

Actual code for VM bytecode execution is only generated when you
"make" ruby. In the Ruby's source directory, see README.md; I
had to do "autoconf, then "./configure", then "make". The .inc
files such as vm.inc get made (with Ruby! I had to switch my
default old ruby 1.8 to a newer one for it to succeed) among the
first. You need not wait till the end of the compilation if you
only want to browse code, not build and install ruby2.4.

=================[ Why not just eval the AST?]=================

In principle, an interpreted language could work straight off of the
parsed AST, evaluating it recursively from the top: call the evaluator
function at the top node. If it represents a value, return it. If it
represents some operation on the child nodes, call it recursively to
evaluate the child nodes and replace them with resulting values, then
perform the operation.

Ruby 1.8 did just that, and so did early LISPs. This approach works,
and evaluator functions for it come out naturally recursive & nice;
however, such recursion can be quite heavy, because it essentially
traverses the AST both from the top down and then upward. 

Thus, faster representations of code were eventually devised. Still,
recursive evaluation of the AST is a fundamental programming pattern,
you must master it---because the very same pattern is now used 
by compilers, as Chapter 3 of the Ruby book explains. It's easier
to learn it in LISP, though (also, see the above point about Scheme
long being the language of choice for PL research).

Suggestion: Write such a recursive evaluator for arithmetic
expressions in LISP, for S-expressions in LISP. For example:

(arithm-eval '(+ 2 (* 2 3)))   => 8 
(arithm-eval 1)                => 1
(arithm-eval '(- 5 3))         => 2
  
etc. 

It should be a very simple function with a few cases for actual
arithmetic operations, and writing such a function should come simply
to you. Remember, trees and recursion are made for each other; you
almost never need to write a function that does something for the
whole tree, you only need to handle a node and call the function on its
immediate children if any---and then take some action on the return 
values if needed. 

Suggestion: write a (recursive) function that takes s-expressions
("sexps", for short) of arithmetic expressions as above, and prints
the corresponding arithmetic expression in the ordinary notation. For
example:

(arithm-print '(+ 2 (* 2 3)))   => 2 + 2 * 3
(arithm-print 1)                => 1
(arithm-print '(- 5 3))         => 5 - 3

etc. Note that this function will be somewhat similar to _flatten_
from OnLisp's Fig. 4.3 if done efficiently---except flatten gets
the symbols in wrong order.

Note that "arithm-read" that would take strings of ordinary
infix-syntax arithmetic operations and make a sexp out of them _with
the proper order of operation precedence_ is a more complex problem
than either of the above. Bison and similar tools solve it with
variations of what's known as the shift-reduce algorithm,
http://blog.reverberate.org/2013/07/ll-and-lr-parsing-demystified.html
(or https://en.wikipedia.org/wiki/Shift-reduce_parser)

Personally, I prefer the Reverse Polish notation; it's a bytecode
ready for execution by a stack machine ( "3 2 * 2 +" , read and
execute left to right; you can think of execution as pushing each
number onto a stack, and, whenever you encounter an operation,
consuming two top tokens from the stack and replacing them with the
op's result).  My first programmable computer was a clone of the Texas
Instruments microcalculator that did just that; the top of the stack
was what showed on the calculator's numeric display. It was glorious :)

Suggestion: write a (recursive) evaluator for a list of tokens
representing a Reverse Polish notation arithmetic expression, e.g.:

(rev-polish-eval '(3 2 * 2 +)) => 8

The "accumulator" argument in this case will hold the current contents
of the stack.

=================[ So why bytecode and JITs? ]=================

Compared with a CPU's linear fetch-decode-execute traversal of its
instruction stream. If any part of the code executes in a tight
loop---which is not infrequent in real programs---the up-and-down cost
of evaluating the AST it paid on every iteration. For the economics
facet of CS, this is too much---and this part happens to pay many
PL researchers' salaries. 

Thus the solution is to push as much of the work needed to "interpret"
the program---i.e., to translate it to faster, simpler operations---up
front, and then, in the tight loops, to only run the results of this
translation. It turns out that the gains of translation can be huge
(and this is, perhaps, one of the most important lessons of PL
practice to date).

This is why the recursive patterns now mostly apply to _compilation_
of ASTs rather than their direct execution. The ASTs are compiled into
simpler-to-execute stack machine bytecode, such as in Elisp, Ruby's
YARV, Python's, etc. The compilation recursively visits all the AST's
nodes as described in the Ch. 3 treatment of compile.c and
iseq_compile_each() on p. 42. This is what the book means when it
speaks of the tight relationship between the compiler and the VM.

The only hard limits here are the CPU speed at its own elementary
instructions, and---most importantly---the limits of knowledge about
values that we have at the translation time vs the run time. For
example, when we make functions the fly, like the lambdas returned by
_made-adder_ and _make-multiplier_, we *cannot* know the state (e.g.,
the values of variables) that these functions would close over---that
state will only be prepared/computed at runtime.

Thus, a major challenge of PL has been to push as much program
interpretation work as possible forward of the tight loops, to
"front-load" it to the limit of knowledge that we cannot have until
actual runtime. We compile as much of a program as possible to either
bytecode or native code, and parametrize this code with what we don't
know. 

This parametrization can take shape of referencing these
unknown-till-runtime values in some table, stack frame, or an
environment struct, where they reside at some known positions or
offsets; it can also take the shape of references-by-name in a hash
table (such as with LISP's global symbol tables for dynamic scoping).
Whatever it is, it's point is separating what we certainly know at
compile time from whatever clues the programmer provides from those
things we cannot possibly know. The bytecodes of VMs such as Elisp's,
Perl's, Python's, Ruby's, and Java's various implementations from the
stack-based JVM to the Android's register-based Dalvik and the newer
ART are really bets in this compile-vs-runtime separation and
parametrization, taken on their designers' intuition.

Other modern environments such Node.js actually de-emphasize the
language clues in favor of just-in-time (JIT) compilation of whatever
code seems to be in most need of it---e.g., it compiles the code or
bytecode snippets that appear to be executed repeatedly, into the
native binary code such as x86. With enough repetition, the cost of
this compilation is amortized; also, its actual execution environment
may feature more constants than variables, and may also exclude some
execution paths, which makes for simpler native assembly. This
approach bets on the runtime being able to better guess where a
program's bottlenecks might be than the programmer; we'll see how this
bet pans out. These JITs are getting very fast; the problem is
debugging them or your code when something goes wrong.

In the meantime, understanding the language's internal VM's bytecode
remains important---and this is what we are trying for with LISP and
Ruby.

=================[ Tail-call optimization and Tail Recursion ]=================

Recursion is great to write, as it follows the natural structure of
the data such such lists or trees, and thus minimizes the amount of
code and helps eliminate bugs in "utility" code like for-loop counter
handling by eliminating the need for that code---but it tends to cost.

There is one form of recursion, though, where its primary
cost---creating stack frames and traversing them backwards---can be
optimized away. It is called tail-call recursion. 

In tail-call recursion, the last operation of a recursive function is
the recursive call to itself, without any operations left to perform
or any values left to look up for these operations in the current
stack frame. In this case, climbing back up the stack has no point,
and recursion ends with the last recursive call. This means that the
stack need not keep any local information in the function's frames,
nor does it need to keep the return addresses except the very first
one from where the function was called. This, in turn, means, that the
function's call frames need not be stacked, but can instead simply
reuse the same frame, in which the return address is saved from the
first call and copied onward. 

A tail-recursive function can never run out of stack space.

This is the nature of the Tail Call Optimization (TCO). Where a
non-TCO program runs out of stack, a TCO-optimized program will
succeed, with the help of the TCO-performing compiler or interpreter.

---------------[ A tail-recursive factorial ]---------------

Our example in class was the factorial in Steel Bank Common Lisp (SBCL).

Consider 
CL-USER> (defun fact (x) (cond ((< x 2) 1) (t (* x (fact (- x 1))))))

This function fails for a sufficiently large argument, as it is not
tail-recursive. After its call to self returns, there's still work to
be done: multiplying by the saved value of x. This needs both space to
store x, and the space to store the position into the pending
computation (multiplication by x) to be returned to.

CL-USER> (fact 100000)
Control stack guard page temporarily disabled: proceed with caution                                     
; Evaluation aborted on #<SB-KERNEL::CONTROL-STACK-EXHAUSTED {10076D59D3}>.

It's actually quite amazing that it succeeds for (fact 10000) !

But, the trick of using the accumulator (see OnLisp Fig. 4.3) allows
us to rewrite this function recursively, by adding one argument, the
accumulator, which saves all the state that we want to pass forward.
Then backing up through the frames becomes unnecessary, and the SBCL
compiler can use its TCO optimization:

CL-USER> (defun fact (x res) (cond ((< x 2) res) (t (fact (- x 1) (* res x))))) 
WARNING: redefining COMMON-LISP-USER::FACT in DEFUN   

Then 
CL-USER> (fact 100000 1) 
succeeds (but slowed down my Emacs to a crawl---indeed, the resulting factorial
when written as a decimal number is over 480K long, all in one line!)

Scheme mandates that all compliant implementations of it MUST be tail-recursive.
Common Lisps differ in this: read 
http://0branch.com/notes/tco-cl.html   for SBCL and GCL in particular. 

Python repudiates TCO in principle, as complicating debugging by
collapsing the stack traces (of course, TCO is exactly about
eliminating unneeded stack frames!)  Ruby takes a middle-ground
position: it doesn't mandate TCO, but allows and optionally implements it.

---------------[ TCO in Ruby, by request to internals ]---------------

By default, Ruby has no TCO (but Ruby2.3 has an impressively deep stack,
which suffices at least till 10000)

$ cat fact.rb 
def fact n, acc = 1
  if n.zero?
    acc
  else
    fact n - 1, acc * n
  end
end

puts fact 1000

$ ruby2.0  fact.rb
4023872600770937735437024339230039857193748642107146325437999104299385123986290205920442084869694048004799886101971960586316668729948085589013238296699445909974245040870737599188236277271887325197795059509952761208749754624970436014182780946464962910563938874<skipped>3472000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

$ cat fact.rb
def fact n, acc = 1
  if n.zero?
    acc
  else
    fact n - 1, acc * n
  end
end

puts fact 10000

$ ruby2.0  fact.rb
fact.rb:2: stack level too deep (SystemStackError)

$ ruby2.3 fact.rb 
<still works, resulting in a few screenfuls of digits for the result, and about ten lines of zeros>

However, when we go to 1000000, even that fails:

$ ruby2.3  fact.rb
fact.rb:5:in `fact': stack level too deep (SystemStackError)
             from fact.rb:5:in `fact'
             from fact.rb:5:in `fact'
             from fact.rb:5:in `fact'
             from fact.rb:5:in `fact'
             from fact.rb:5:in `fact'
             from fact.rb:5:in `fact'
             from fact.rb:5:in `fact'
             from fact.rb:5:in `fact'
              ... 10067 levels...
              from fact.rb:5:in `fact'
              from fact.rb:5:in `fact'
              from fact.rb:5:in `fact'
              from fact.rb:9:in `<main>'

But not if we ask Ruby to do TCO explicitly!

$ cat tco.rb
# http://stackoverflow.com/questions/824562/does-ruby-perform-tail-call-optimization
# See also http://nithinbekal.com/posts/ruby-tco/
#  and http://blog.tdg5.com/tail-call-optimization-in-ruby-deep-dive/
source = <<-SOURCE
def fact n, acc = 1
  if n.zero?
    acc
  else
    fact n - 1, acc * n
  end
end

fact 100000
SOURCE

# This should print a very large number. If you don't set TCO option
# to true, you should get the "stack too deep" error.

i_seq = RubyVM::InstructionSequence.new source, nil, nil, nil,
  tailcall_optimization: true, trace_instruction: false
#  ^^^^^^^^^^^^^^^^^^^^^^^^^^

# Indeed, prints "#<SystemStackError: stack level too deep>" for 2.0--2.2
#  For 2.3, prints the number for 10,000 anyway, but for 100000
#    behaves as above 

#puts i_seq.disasm

begin
  value = i_seq.eval

  p value
rescue SystemStackError => e
  p e
end

$ ruby2.3 tco.rb 
<a huge number>

$ ruby2.3 tco.rb | wc
      1       1  456575

The same works for 2.0:

$ ruby2.0 tco.rb | wc
      1       1  456575

Read the above links for the technical details of TCO implementation;
note the differences between SBCL and GCL in the Common Lisp analysis
of TCO.