Linux Scheduling
Overview
The fundamental responsibility of a CPU scheduler is to manage what tasks are run by the system,
at what times they are run, and on what CPU(s) they are run. The CPU scheduler is essentially
responsible for managing the tasks of a system. In order to do this, a CPU scheduler must listen
for all events that might happen to that task, or at least the events that are of interest to the
scheduler, including a task being created, a task being destroyed (exiting), a task starting
execution for the first time, a task going to sleep, a task waking up, a task being preempted by
a higher priority task executing, a task resuming execution from a previous state, and any other
event that the scheduling algorithm relies on.
Every process and thread has a task_struct associated with it in the Linux kernel.
All task_structs except for init_task are allocated dynamically from a kmem_cache pool, which is allocated on a slab by fork_init() in
fork.c
struct task_struct init_task = INIT_TASK(init_task);
init_task is the internal kernel representation of the init process, which is the process used to boot the system and
start all other processes. It does this by calling fork() and then exec() from wi thin the newly forked process.
The memory that contains init_task is statically allocated at compile time within the binary kernel image. It becomes
available when the boot loader loads the kernel into memory.
The only entry point to begin execution of a task is through context_switch() in kernel/sched.c:
static inline void context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next);
which switches to a task's mm and register state, and also has the effect of switching stacks. The only place
context_switch() is called is from within schedule(), which is the main scheduling function. Tasks can also be move
from one CPU to another to be executed.
do_exit() is called from the system call sys_exit(), or exit() as it is known in user space. These functions are located
in exit.c, and are the counterpart to fork.c and its functions; the code in exit.c, particularly do_exit(), performs task
cleanup before a task terminates. do_exit() does not deallocate the task_struct memory associated with the task that called
it (a task that calls do_exit() is always available through the current pointer). The last thing do_exit() does before
completing is call schedule(). schedule() dequeues the task from the runqueue, and performs a context switch to whatever
task is chosen to run next (which is possibly the idle task). schedule() performs this context switch by calling
context_switch(). Before context_switch() returns, it calls finish_task_switch(). Among other things, finish_task_switch()
checks whether the value in task_struct.state of the task equals TASK_DEAD. If so, it deallocates the task_struct by calling
a function which eventually calls another function and so on, until one of the functions eventually frees the memory associated
with the task_struct to the kmem_cache pool from which task_struct structures are allocated from (a pointer called
task_struct_cachep).
-idle task and cpu_idle() (rq->idle)
-per-CPU task runqueues
-task priorities, scheduling policies
Linux has 140 task priorities total. The priority values as seen from inside the kernel are different than what is
seen from userspace. Both in the kernel and in userspace, a task is run with one of the CPU scheduling policies
available in Linux. In userspace, a task can request to be scheduled using various different task scheduling policies
by making system calls (the task can only use one scheduling policy at any given time). These scheduling policies are
then used by the Linux kernel to associate each task in the system with one of two scheduling classes: real-time or
fair. These scheduling classes (which are actually objects, even though they are called classes) are then responsible
for deciding what to run on the CPU.
Internally, the lower priority values are considered to be higher priorities.
-group scheduling and sched_entity
-rt_mutexes and mutual exclusion trees, priority inversion and priority inheritance (and priority ceiling)
-task_struct->pi_lock
The Linux scheduler supports systems with multiple processors using scheduling domains. A scheduling domain is
simply used to organize the CPUs in the system into a hierarchy for purposes of scheduling. They are intended to
facilitate load balancing across different CPUs in the system, and to facilitate the load balancing to be done
fairly. A scheduling domain is represented by a struct sched_domain. Each sched_domain owns a singly linked list
of cpu scheduling groups, represented by struct sched_group, which divide the CPUs that the domain spans into
groups. These groups of a domain need not correspond to physical CPUs. For example, on either a multi-core or
simultaneous multithreading (SMT) processor, each CPU group may span one or more virtual CPUs, whereas the domain
might span the entire physical CPU. However, since the domains are built hierarchically, domains higher up in the
hierarchy might own a singly linked list of sched_groups, each of which corresponds to one physical CPU. Each of
these CPUs might be represented by a single sched_domain at a lower level in the hierarchy.
-struct root_domain
Entry points to task preemption.
Entry points to task blocking.
Entry points to task activation (runnable but not running).
Entry points to task execution (runnable and running).
Entry points to task stopped, neither blocked nor preempted (cpu migration).
Entry points to task resumed, neither activated nor executed (cpu migration).
Entry points to task dequeueing, not deactivation (cpu migration).
Entry points to task enqueueing, not activation (cpu migration).
sched_assume_ownership()
set_task_ready()
set_task_executing()
set_task_execution_stopped()
set_task_migrating()
set_task_migration_complete()
set_task_preempted()
set_task_resumed()
set_task_blocked()
set_task_unblocked()
set_task_completed()
sched_relinquish_ownership()
effectively track all state changes of a task and control its behavior
only three ways a process can stop execution:
it gets stopped, it blocks, or it completes
only two ways a process can stop execution:
voluntarily, involuntarily
voluntary:
go to sleep, finish execution
involuntary:
gets preempted, gets blocked on I/O or lock, gets migrated to different cpu, timer goes off, gets priority updated because of held rt_mutex
Sporadic Server
To implement SCHED_SPORADIC, all instances of the following events must be detected and handled:
High priority events
Events that occur during high priority execution which must be handled (only applies to SCHED_SPORADIC tasks).
All of these should be guarded by if() checks to verify that a task is running at its high priority before
any of the following actions are performed:
- A task arrives on the runqueue. This can happen as the result of a new task
being scheduled for the first time, or an existing task that was previously blocked waking up.
When this happens, a task's activation time should be posted. A task must be guaranteed to have
spare replenishments if it arrives on the runqueue at high priority. It is not strictly necessary
to guarantee the task has remaining execution capacity, but such a guarantee would probably be a
good idea for performance reasons (otherwise, when the task was selected to run, its timeslice
would expire immediately, and the task would be requeued at low priority). Currently, this is done
in enqueue_task_rt().
- A task is actually selected to run. When this happens, the task's execution
start time should be posted, and a timer should be started to limit the task's consumable runtime
to its allotted runtime (which is its remaining execution capacity). Again, a task selected to run
at its high priority must be guaranteed to have spare, non-pending replenishments (and it would also
be beneficial for performance reasons to guarantee such a task has remaining execution capacity, as
described in the previous paragraph, although again not strictly necessary). This is currently done
in pick_next_task_rt(), which is called from schedule(). Optimization: As an
optimization, when selecting a SCHED_SPORADIC task to run, make sure it has execution capacity
remaining, and that any high priority SCHED_SPORADIC tasks with no execution capacity remaining are
set to their low priority.
- A task blocks (on I/O or by going to sleep). When this happens, the task's
execution capacity should be lowered by the amount which it just consumed (lowering its priority
if necessary) and a replenishment operation should be scheduled. The replenishment should be scheduled
to occur at the task's activation time, plus the task's replenishment period. If this schedules the
task's last remaining replenishment, the task's priority should be lowered. Currently, this is done
in dequeue_task_rt().
- A task gets preempted. When this happens, a task's execution capacity should
be lowered by the amount which it just consumed. The amount charged should be noted and incorporated
into the amount of the next replenishment, when the next replenishment is scheduled. If the task's
execution capacity is set to zero as a result of this operation, the task's priority should
not be lowered, as then when it was eventually rescheduled (if ever), it would be
chosen to run at low priority and would therefore not cause a replenishment to be scheduled since
the task's execution time is not limited when it is running at low priority. Currently, this is
done in check_preempt_curr_rt() (can this be done from put_prev_task_rt()?).
- A task consumes all of its execution capacity (budget). When this happens, a
task should be set to its lower priority, and a replenishment operation should be scheduled. The
task must be guaranteed to have spare replenishments in this situation, as described in the
the first two high priority descriptions above. The replenishment should be scheduled to occur at
the task's activation time, plus the task's replenishment period. A task consuming all of
its execution capacity is one of the two points of transition to low priority for a task running at
high priority (in the original definition of the Sporadic Server, where the limit on number of pending
replenishments for any given task is unbounded, this is the only point of transition). Currently,
this is handled via hrtimers which are scheduled by pick_next_task_rt(), which is called from schedule().
When a task is chosen to run at its high priority, a timer is set to expire exactly when the task's
execution capacity is exhausted. The timer expiration causes hrtick() to be called, which calls
task_tick_rt(). scheduler_tick() also calls task_tick_rt() every jiffy.
- A task schedules its last spare replenishment (Single Unix Specification only).
When this happens, a task must be set to its lower priority until a replenishment is executed
and is therefore made available for future use to prevent the task from using execution capacity
that cannot be recovered. A task scheduling a replenishment which causes the total number
of pending replenishments to become equal to sched_ss_max_repl is the second of two points of
transition from high to low priority in the Single Unix Specification's definition of the Sporadic
Server (in the original definition of Sporadic Server, where the number of pending replenishments
for any given task is unbounded, this point of transition to low priority does not exist).
Low priority events
A SCHED_SPORADIC task running at its low priority behaves like a SCHED_FIFO task, and therefore no
events that occur when a task is running at its low priority require special handling (other than
an asynchronous replenishment operation occuring as described below).
- A task arrives on the runqueue (as a result of a task waking up or having its priority lowered).
- A task is actually selected to run.
- A task blocks (on I/O or by going to sleep).
- A task gets preempted.
No action must be taken for any of these events when a task is running at its low priority.
Events that can occur at both high and low priority
Asynchronous events that can occur at either high priority execution or low priority execution which
must be handled:
- Replenishment operations (described below)
High priority operations
- Setting a task's activation time. Whenever a task becomes available to run at
high priority, its activation time must be set. It must be guaranteed that the task also has
spare replenishments when this occurs.
- Setting a task's execution start time. Whenever a task is selected to run at
its high priority, its execution start time must be set. It must be guaranteed that the task also
has spare replenishments when this occurs.
- Lowering a task's execution capacity. Any time this operation causes a task's
execution capacity to be become less than or equal to zero, the task's priority must be lowered.
A replenishment equal to the amount of execution capacity charged should also be scheduled.
- Scheduling a replenishment. When a replenishment is scheduled, a task must have
remaining spare replenishments. Replenishment events should be scheduled to occur at the execution
start time of the task, plus period of the task. If the replenishment which is scheduled is the
task's last remaining replenishment, the task's priority must be lowered, regardless of its
remaining execution capacity. The task's execution capacity should also be lowered by the amount
of the scheduled replenishment. The replenishment should be scheduled to occur at a time in the
future equal to the task's activation time, plus the task's replenishment period.
Optimization: As an optimization, no replenishment should be scheduled if the
task's execution capacity is equal to its initial budget (in which case the amount charged to
the task's execution capacity must have been zero). This currently occurs in dequeue_task_rt()
when a task's policy is first set to SCHED_SPORADIC, due to the task's state being equal to
TASK_INTERRUPTIBLE when schedule() is called.
- Lowering a task's priority. When a task's priority is lowered, the system must
be forced to reschedule, if necessary.
- Replenishment operation (high priority). Replenishment operations can occur at any
time when a task is set to its high priority (even if it is not currently running or runnable). If
a replenishment occurs when a task is not running, the task's execution capacity is incremented by
the amount associated with the replenishment. If a replenishment occurs when a task is running, the
task's execution capacity is incremented by the amount associated with the replenishment, and the
execution capacity expiration timer currently associated with the task must be adjusted to take into
account the increase in execution capacity.
Low priority operations
- Raising a task's priority. When a task's priority is raised, the system must be
forced to reschedule, if necessary.
- Replenishment operation (low priority). Replenishment operations can occur at any time
when a task is set to its low priority (even if it is not currently running or runnable). When a
replenishment occurs on a task set to its low priority, the task's execution capacity must be
incremented, and the task must be set to its high priority. If the task is running or runnable, the
task's new activation time should be posted. Since a SCHED_SPORADIC task set to its low priority behaves
like a SCHED_FIFO task, this does not require cancelling or updated the expiration times of any timers.
A replenishment operation is the one point of transition to high priority for a task running at
low priority.
Changes Made
The following were changes that were made while implementing Sporadic Server in Linux according to
the guidelines layed out in the Single Unix Specification.
Per-task scheduler information
- struct sched_entity - scheduler execution tracking object, used for CFS group scheduling
- struct sched_rt_entity - scheduler execution tracking object, used for RT group scheduling
- struct __sched_settings - static scheduler configuration
- struct __sched_state - dynamic scheduler state
- struct ss_replenishment_operation - object representing a replenishment operation
SCHED_SPORADIC related changes
- #define SCHED_SPORADIC in sched.h
- updated rt_policy() in sched.c
System interfaces:
- added fields to sched_param
- refactored sched_param usages:
- sched_setscheduler()
- sys_sched_get_priority_max()
- sys_sched_get_priority_min()
- sys_sched_rr_get_interval()
other sched_param related changes:
- sched_idle_next()
- migration_call()
- normalize_task()
Changes to C library:
- #define SCHED_SPORADIC
- struct sched_param
Execution limitation
- modified scheduler_tick()
- modified hrtick()
- modified task_tick_rt()
Event scheduling and task accounting
- modified pick_next_task_rt()
- modified put_prev_task_rt()
- modified enqueue_task_rt()
- modified dequeue_task_rt()
Replenishment API:
- defined struct ss_replenishment_event in sched.h
- implemented __ss_schedule_replenishment() in sched_rt.c
- implemented ss_schedule_replenishment() in sched_rt.c
- implemented ss_replenishment_callback() in sched_rt.c
Task Priority API:
- implemented ss_set_low_priority() in sched_rt.c
- implemented ss_set_high_priority() in sched_rt.c
- implemented change_task_priority_locks_held() in sched.c
SCHED_SPORADIC ss_* task API:
- implemented ss_has_low_priority() in sched_rt.c
- implemented ss_has_high_priority() in sched_rt.c
- implemented ss_has_spare_replenishments() in sched_rt.c
- implemented ss_has_execution_capacity() in sched_rt.c
- implemented ss_set_execution_capacity() in sched_rt.c
- implemented ss_set_execution_capacity_full() in sched_rt.c
- implemented ss_set_execution_capacity_empty() in sched_rt.c
- implemented ss_add_execution_capacity() in sched_rt.c
task_struct support:
- defined struct __sched_settings in sched.h
- defined struct __sched_state in sched.h
- added sched_settings and sched_state fields to task_struct in sched.h
task_struct replenishment pool setup and cleanup:
- exit.c: do_exit() - used to cleanup and deallocate SCHED_SPORADIC per-task replenishment pool
- added p->sched_class->register_task()
- added p->sched_class->unregister_task()
- modified __setscheduler()
- added p->sched_class->set_task_param()
- added p->sched_class->get_task_param()
- added p->sched_class->validate_task_param()*
future changes (tentative)
- p->sched_class->set_curr_task()
- check_class_changed()
- p->sched_class->switched_to()
- p->sched_class->switched_from()
- p->sched_class->prio_changed()
- wake_up_new_task()
- p->sched_class->task_new()
still to do (cleanup):
- handle task preemption
- test SCHED_SPORADIC tasks calling sys_fork() and impact on internal kernel data structures, such as
per-task linked list of replenishments, and task remaining execution capacity
- update ktime_add_ns() call in ss_replenishment_callback() to ns_to_ktime() (added in 2.6.26)
- create replenishment_amount mutator function to avoid setting negative or greater than init_budget
values for replenishment amounts
- create delay mutator function to avoid setting negative or greater than repl_period replenishment
delays
- create accessors and mutators for activation_time, execution_start_time, and num_pending_replenishments
to avoid inconsistent error checking throughout the code, and to centralize all variable checks to one
place
 |
|
 |