Scheduler Activations: Effective Kernel Support for the User-Level
Management of Parallelism
--- Anderson et al., 1992


Problem: Threads are a natural construct for parallel programs;
however, neither user-level nor kernel threads provide the correctness
and performance needed to satisfy such applications.


- threads can be supported by either the kernel or user-level libraries;

- kernel threads are the wrong abstraction, and user-level threads usually
  have a problem during the integration with other systems, because of the
  lack of kernel support;

- the effectiveness of parallel computing depends to a great extent on the
  performance of the primitives available;

- one way to construct a parallel program is to use several processes with
  shared memory;  this is suited for multiprogramming in a uniprocessor 
  environment, and too ineficient for parallel computing;

- user-level threads are flexible, as they can be modified without any kernel
  changes;

- thread packages view each process as a virtual processor;

- multiprogramming, I/O, page faults and other OS aspects distort the
  equivalence between virtual and physical processors, generating bad
  performance and even incorrect behavior;

- if the kernel supports threads, each thread is scheduled separately;

- in the absence of multiprogramming and I/O, user-level threads are one order
  of magnitude better;

* user-level threads are built on top of kernel threads exactly as they are
* built on top of traditional processes; they have the same performance 
* and the same problems;

- their solution is to use a kernel interface and a user-level thread 
  package;

- same performance of ULTs when there is no kernel intervention;

- when there is kernel intervention, like in processor reallocation and I/O,
  the system mimics the behavior of KLTs.

- the difficulty in achieving these goals in a multiprogrammed multiprocessor
  is that necessary control and scheduling information is distributed between
  kernel and each application's address space;

* the problem: the kernel needs information about the level of parallelism in
* a process, while the ULTs need information about processor events;

- applications know which processors are currently allocated to it and have
  complete control over thread scheduling.  The kernel controls distribution
  of processors to address spaces (processes);

- kernel notifies the address space thread scheduler of kernel events
  affecting the address space;

- the thread system notifies the kernel about operations that can affect
  processor allocation decisions;

- scheduler activations is the name given to the mechanism used to provide
  this communication;

- there are significant inherent costs to managing threads in the kernel:

	- the cost of accessing thread management operations.  The program
	  must cross an extra protection boundary on every thread operation,
	  even when threads are in the same address space;

	- the cost of generality.  A single implementation is used by all
	  applications;

- they show that ULTs are 10-30 times faster than KLTs, and KLTs are 4-12 times
  faster than processes;

- problems with kernel threads:

	- they block, resume, and are preempted without notification to the
	  user level;

	- are scheduled obliviously with respect to the user-level thread
	  state;

- when ULTs are running on top of KLTs, time-slicing can lead to problems, as
  in the case of spin-locks;

- other problem is that the priority views of kernel and application may be
  different;

- if a number of ULTs is running on top of a smaller number of KLTs, the may
  get blocked, leading to runnable ULTs and idle processors;


Their solution:
--------------

- combines the functionality of kernel threads with the performance and
  flexibility of user-level threads;

- the OS provides each user-level thread SYSTEM with a virtual multiprocessor,
  which number of processors may change over time;

	- the kernel allocates processors to address spaces;
	- the ULTSYS controls which threads get each processor;
	- kernel informs ULTSYS when # of processors change;
	- kernel informs ULTSYS when ULTs block or wake up;
	- UTLSYS notifies kernel when more/fewer processors are needed;
	- the programmer sees no difference;

* the increase in performance comes from the fact that there are always
* exactly as many running scheduler activations (vessels for running ULTs) as
* there are processors assigned to the address space;

- the thread's register state is saved by low-level kernel routines (handlers)
  and is passed to the ULTSYS as part of the upcall;

* interesting point: to execute an upcall, a processor is needed.  Therefore,
* during situations where the interruption of a processor does not happen
* inherently, the kernel preempts one of the threads running in the address
* space of the address space that needs to be notified.  If there is no such
* active thread, the system wais for a processor to be assigned;

- for example, when a processor is taken from an AS, the system preempts
  another processor running in the same AS and informs it about the TWO
  preemptions.  Why doesn't the system use the first processor to do the
  upcall?  To me, this is because the ULTSYS may decide to change the running
  thread given that the first thread was blocked (lost the processor);

- in the case when the ULTSYS decides to preempt a thread in the presence of a
  higher-priority one, it informs the kernel about the processor to preempt.
  The kenel then preempts it and does an upcall;

- the interface is well defined, and the kernel needs no knowledge of the data
  structures used by the ULTSYS;

* an AS notifies the kernel when it has more runnable threads than processors
* or the opposite.  Doesn't this generate instability in the system? I mean,
* if the ULT has its number of runnable threads vary a lot over time, too
* many communications with the kernel will happen;  Later in the text: to
* avoid processor reallocations, an idle processor spins for a short period
* before notifying the kernel that it is available;

- to avoid dishonest reports by applications, the OS may penalize ULTSYSs with
  big number of processors;

- problems (poor performance or deadlock) may occur if a ULT is preempted
  while executing a critical section, e.g., when holding the ready list lock
  (dedlock).  In their solution, the ULTSYS checks (after an upcall) if the
  preempted thread holds a lock.  If so, the thread is continued temporarily,
  via a user-level context switch;

- the ULTSYS must be able to check whether a preempted thread was holding a
  lock.  A solution is for a thread to set a flag when it enters a 
  critical section.  This imposes overhead on lock acquisition and release 
  whether or not a preemption or page fault occurs.  Their solution is 
  to use two copies of every critical section, one used in the common path 
  and the other when a preemption occurs.  The second one contains the code
  to relinquish the processor at the end;

- they used just one application for testing;