Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism --- Anderson et al., 1992 Problem: Threads are a natural construct for parallel programs; however, neither user-level nor kernel threads provide the correctness and performance needed to satisfy such applications. - threads can be supported by either the kernel or user-level libraries; - kernel threads are the wrong abstraction, and user-level threads usually have a problem during the integration with other systems, because of the lack of kernel support; - the effectiveness of parallel computing depends to a great extent on the performance of the primitives available; - one way to construct a parallel program is to use several processes with shared memory; this is suited for multiprogramming in a uniprocessor environment, and too ineficient for parallel computing; - user-level threads are flexible, as they can be modified without any kernel changes; - thread packages view each process as a virtual processor; - multiprogramming, I/O, page faults and other OS aspects distort the equivalence between virtual and physical processors, generating bad performance and even incorrect behavior; - if the kernel supports threads, each thread is scheduled separately; - in the absence of multiprogramming and I/O, user-level threads are one order of magnitude better; * user-level threads are built on top of kernel threads exactly as they are * built on top of traditional processes; they have the same performance * and the same problems; - their solution is to use a kernel interface and a user-level thread package; - same performance of ULTs when there is no kernel intervention; - when there is kernel intervention, like in processor reallocation and I/O, the system mimics the behavior of KLTs. - the difficulty in achieving these goals in a multiprogrammed multiprocessor is that necessary control and scheduling information is distributed between kernel and each application's address space; * the problem: the kernel needs information about the level of parallelism in * a process, while the ULTs need information about processor events; - applications know which processors are currently allocated to it and have complete control over thread scheduling. The kernel controls distribution of processors to address spaces (processes); - kernel notifies the address space thread scheduler of kernel events affecting the address space; - the thread system notifies the kernel about operations that can affect processor allocation decisions; - scheduler activations is the name given to the mechanism used to provide this communication; - there are significant inherent costs to managing threads in the kernel: - the cost of accessing thread management operations. The program must cross an extra protection boundary on every thread operation, even when threads are in the same address space; - the cost of generality. A single implementation is used by all applications; - they show that ULTs are 10-30 times faster than KLTs, and KLTs are 4-12 times faster than processes; - problems with kernel threads: - they block, resume, and are preempted without notification to the user level; - are scheduled obliviously with respect to the user-level thread state; - when ULTs are running on top of KLTs, time-slicing can lead to problems, as in the case of spin-locks; - other problem is that the priority views of kernel and application may be different; - if a number of ULTs is running on top of a smaller number of KLTs, the may get blocked, leading to runnable ULTs and idle processors; Their solution: -------------- - combines the functionality of kernel threads with the performance and flexibility of user-level threads; - the OS provides each user-level thread SYSTEM with a virtual multiprocessor, which number of processors may change over time; - the kernel allocates processors to address spaces; - the ULTSYS controls which threads get each processor; - kernel informs ULTSYS when # of processors change; - kernel informs ULTSYS when ULTs block or wake up; - UTLSYS notifies kernel when more/fewer processors are needed; - the programmer sees no difference; * the increase in performance comes from the fact that there are always * exactly as many running scheduler activations (vessels for running ULTs) as * there are processors assigned to the address space; - the thread's register state is saved by low-level kernel routines (handlers) and is passed to the ULTSYS as part of the upcall; * interesting point: to execute an upcall, a processor is needed. Therefore, * during situations where the interruption of a processor does not happen * inherently, the kernel preempts one of the threads running in the address * space of the address space that needs to be notified. If there is no such * active thread, the system wais for a processor to be assigned; - for example, when a processor is taken from an AS, the system preempts another processor running in the same AS and informs it about the TWO preemptions. Why doesn't the system use the first processor to do the upcall? To me, this is because the ULTSYS may decide to change the running thread given that the first thread was blocked (lost the processor); - in the case when the ULTSYS decides to preempt a thread in the presence of a higher-priority one, it informs the kernel about the processor to preempt. The kenel then preempts it and does an upcall; - the interface is well defined, and the kernel needs no knowledge of the data structures used by the ULTSYS; * an AS notifies the kernel when it has more runnable threads than processors * or the opposite. Doesn't this generate instability in the system? I mean, * if the ULT has its number of runnable threads vary a lot over time, too * many communications with the kernel will happen; Later in the text: to * avoid processor reallocations, an idle processor spins for a short period * before notifying the kernel that it is available; - to avoid dishonest reports by applications, the OS may penalize ULTSYSs with big number of processors; - problems (poor performance or deadlock) may occur if a ULT is preempted while executing a critical section, e.g., when holding the ready list lock (dedlock). In their solution, the ULTSYS checks (after an upcall) if the preempted thread holds a lock. If so, the thread is continued temporarily, via a user-level context switch; - the ULTSYS must be able to check whether a preempted thread was holding a lock. A solution is for a thread to set a flag when it enters a critical section. This imposes overhead on lock acquisition and release whether or not a preemption or page fault occurs. Their solution is to use two copies of every critical section, one used in the common path and the other when a preemption occurs. The second one contains the code to relinquish the processor at the end; - they used just one application for testing;