705 lines
30 KiB
HTML
705 lines
30 KiB
HTML
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
||
|
"http://www.w3.org/TR/html4/loose.dtd">
|
||
|
<html>
|
||
|
<head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title>
|
||
|
<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
|
||
|
|
||
|
<p>August 8, 2017</p>
|
||
|
<p>This article was contributed by Paul E. McKenney</p>
|
||
|
|
||
|
<h3>Introduction</h3>
|
||
|
|
||
|
<p>This document gives a rough visual overview of how Tree RCU's
|
||
|
grace-period memory ordering guarantee is provided.
|
||
|
|
||
|
<ol>
|
||
|
<li> <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
|
||
|
What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a>
|
||
|
<li> <a href="#Tree RCU Grace Period Memory Ordering Building Blocks">
|
||
|
Tree RCU Grace Period Memory Ordering Building Blocks</a>
|
||
|
<li> <a href="#Tree RCU Grace Period Memory Ordering Components">
|
||
|
Tree RCU Grace Period Memory Ordering Components</a>
|
||
|
<li> <a href="#Putting It All Together">Putting It All Together</a>
|
||
|
</ol>
|
||
|
|
||
|
<h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
|
||
|
What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3>
|
||
|
|
||
|
<p>RCU grace periods provide extremely strong memory-ordering guarantees
|
||
|
for non-idle non-offline code.
|
||
|
Any code that happens after the end of a given RCU grace period is guaranteed
|
||
|
to see the effects of all accesses prior to the beginning of that grace
|
||
|
period that are within RCU read-side critical sections.
|
||
|
Similarly, any code that happens before the beginning of a given RCU grace
|
||
|
period is guaranteed to see the effects of all accesses following the end
|
||
|
of that grace period that are within RCU read-side critical sections.
|
||
|
|
||
|
<p>Note well that RCU-sched read-side critical sections include any region
|
||
|
of code for which preemption is disabled.
|
||
|
Given that each individual machine instruction can be thought of as
|
||
|
an extremely small region of preemption-disabled code, one can think of
|
||
|
<tt>synchronize_rcu()</tt> as <tt>smp_mb()</tt> on steroids.
|
||
|
|
||
|
<p>RCU updaters use this guarantee by splitting their updates into
|
||
|
two phases, one of which is executed before the grace period and
|
||
|
the other of which is executed after the grace period.
|
||
|
In the most common use case, phase one removes an element from
|
||
|
a linked RCU-protected data structure, and phase two frees that element.
|
||
|
For this to work, any readers that have witnessed state prior to the
|
||
|
phase-one update (in the common case, removal) must not witness state
|
||
|
following the phase-two update (in the common case, freeing).
|
||
|
|
||
|
<p>The RCU implementation provides this guarantee using a network
|
||
|
of lock-based critical sections, memory barriers, and per-CPU
|
||
|
processing, as is described in the following sections.
|
||
|
|
||
|
<h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks">
|
||
|
Tree RCU Grace Period Memory Ordering Building Blocks</a></h3>
|
||
|
|
||
|
<p>The workhorse for RCU's grace-period memory ordering is the
|
||
|
critical section for the <tt>rcu_node</tt> structure's
|
||
|
<tt>->lock</tt>.
|
||
|
These critical sections use helper functions for lock acquisition, including
|
||
|
<tt>raw_spin_lock_rcu_node()</tt>,
|
||
|
<tt>raw_spin_lock_irq_rcu_node()</tt>, and
|
||
|
<tt>raw_spin_lock_irqsave_rcu_node()</tt>.
|
||
|
Their lock-release counterparts are
|
||
|
<tt>raw_spin_unlock_rcu_node()</tt>,
|
||
|
<tt>raw_spin_unlock_irq_rcu_node()</tt>, and
|
||
|
<tt>raw_spin_unlock_irqrestore_rcu_node()</tt>,
|
||
|
respectively.
|
||
|
For completeness, a
|
||
|
<tt>raw_spin_trylock_rcu_node()</tt>
|
||
|
is also provided.
|
||
|
The key point is that the lock-acquisition functions, including
|
||
|
<tt>raw_spin_trylock_rcu_node()</tt>, all invoke
|
||
|
<tt>smp_mb__after_unlock_lock()</tt> immediately after successful
|
||
|
acquisition of the lock.
|
||
|
|
||
|
<p>Therefore, for any given <tt>rcu_node</tt> structure, any access
|
||
|
happening before one of the above lock-release functions will be seen
|
||
|
by all CPUs as happening before any access happening after a later
|
||
|
one of the above lock-acquisition functions.
|
||
|
Furthermore, any access happening before one of the
|
||
|
above lock-release function on any given CPU will be seen by all
|
||
|
CPUs as happening before any access happening after a later one
|
||
|
of the above lock-acquisition functions executing on that same CPU,
|
||
|
even if the lock-release and lock-acquisition functions are operating
|
||
|
on different <tt>rcu_node</tt> structures.
|
||
|
Tree RCU uses these two ordering guarantees to form an ordering
|
||
|
network among all CPUs that were in any way involved in the grace
|
||
|
period, including any CPUs that came online or went offline during
|
||
|
the grace period in question.
|
||
|
|
||
|
<p>The following litmus test exhibits the ordering effects of these
|
||
|
lock-acquisition and lock-release functions:
|
||
|
|
||
|
<pre>
|
||
|
1 int x, y, z;
|
||
|
2
|
||
|
3 void task0(void)
|
||
|
4 {
|
||
|
5 raw_spin_lock_rcu_node(rnp);
|
||
|
6 WRITE_ONCE(x, 1);
|
||
|
7 r1 = READ_ONCE(y);
|
||
|
8 raw_spin_unlock_rcu_node(rnp);
|
||
|
9 }
|
||
|
10
|
||
|
11 void task1(void)
|
||
|
12 {
|
||
|
13 raw_spin_lock_rcu_node(rnp);
|
||
|
14 WRITE_ONCE(y, 1);
|
||
|
15 r2 = READ_ONCE(z);
|
||
|
16 raw_spin_unlock_rcu_node(rnp);
|
||
|
17 }
|
||
|
18
|
||
|
19 void task2(void)
|
||
|
20 {
|
||
|
21 WRITE_ONCE(z, 1);
|
||
|
22 smp_mb();
|
||
|
23 r3 = READ_ONCE(x);
|
||
|
24 }
|
||
|
25
|
||
|
26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0);
|
||
|
</pre>
|
||
|
|
||
|
<p>The <tt>WARN_ON()</tt> is evaluated at “the end of time”,
|
||
|
after all changes have propagated throughout the system.
|
||
|
Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the
|
||
|
acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example
|
||
|
on PowerPC.
|
||
|
The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this
|
||
|
<tt>WARN_ON()</tt> from triggering.
|
||
|
|
||
|
<p>This approach must be extended to include idle CPUs, which need
|
||
|
RCU's grace-period memory ordering guarantee to extend to any
|
||
|
RCU read-side critical sections preceding and following the current
|
||
|
idle sojourn.
|
||
|
This case is handled by calls to the strongly ordered
|
||
|
<tt>atomic_add_return()</tt> read-modify-write atomic operation that
|
||
|
is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry
|
||
|
time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time.
|
||
|
The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and
|
||
|
<tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke
|
||
|
an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs.
|
||
|
|
||
|
<table>
|
||
|
<tr><th> </th></tr>
|
||
|
<tr><th align="left">Quick Quiz:</th></tr>
|
||
|
<tr><td>
|
||
|
But what about CPUs that remain offline for the entire
|
||
|
grace period?
|
||
|
</td></tr>
|
||
|
<tr><th align="left">Answer:</th></tr>
|
||
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||
|
Such CPUs will be offline at the beginning of the grace period,
|
||
|
so the grace period won't expect quiescent states from them.
|
||
|
Races between grace-period start and CPU-hotplug operations
|
||
|
are mediated by the CPU's leaf <tt>rcu_node</tt> structure's
|
||
|
<tt>->lock</tt> as described above.
|
||
|
</font></td></tr>
|
||
|
<tr><td> </td></tr>
|
||
|
</table>
|
||
|
|
||
|
<p>The approach must be extended to handle one final case, that
|
||
|
of waking a task blocked in <tt>synchronize_rcu()</tt>.
|
||
|
This task might be affinitied to a CPU that is not yet aware that
|
||
|
the grace period has ended, and thus might not yet be subject to
|
||
|
the grace period's memory ordering.
|
||
|
Therefore, there is an <tt>smp_mb()</tt> after the return from
|
||
|
<tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt>
|
||
|
code path.
|
||
|
|
||
|
<table>
|
||
|
<tr><th> </th></tr>
|
||
|
<tr><th align="left">Quick Quiz:</th></tr>
|
||
|
<tr><td>
|
||
|
What? Where???
|
||
|
I don't see any <tt>smp_mb()</tt> after the return from
|
||
|
<tt>wait_for_completion()</tt>!!!
|
||
|
</td></tr>
|
||
|
<tr><th align="left">Answer:</th></tr>
|
||
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||
|
That would be because I spotted the need for that
|
||
|
<tt>smp_mb()</tt> during the creation of this documentation,
|
||
|
and it is therefore unlikely to hit mainline before v4.14.
|
||
|
Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and
|
||
|
Jonathan Cameron for asking questions that sensitized me
|
||
|
to the rather elaborate sequence of events that demonstrate
|
||
|
the need for this memory barrier.
|
||
|
</font></td></tr>
|
||
|
<tr><td> </td></tr>
|
||
|
</table>
|
||
|
|
||
|
<p>Tree RCU's grace--period memory-ordering guarantees rely most
|
||
|
heavily on the <tt>rcu_node</tt> structure's <tt>->lock</tt>
|
||
|
field, so much so that it is necessary to abbreviate this pattern
|
||
|
in the diagrams in the next section.
|
||
|
For example, consider the <tt>rcu_prepare_for_idle()</tt> function
|
||
|
shown below, which is one of several functions that enforce ordering
|
||
|
of newly arrived RCU callbacks against future grace periods:
|
||
|
|
||
|
<pre>
|
||
|
1 static void rcu_prepare_for_idle(void)
|
||
|
2 {
|
||
|
3 bool needwake;
|
||
|
4 struct rcu_data *rdp;
|
||
|
5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
|
||
|
6 struct rcu_node *rnp;
|
||
|
7 struct rcu_state *rsp;
|
||
|
8 int tne;
|
||
|
9
|
||
|
10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
|
||
|
11 rcu_is_nocb_cpu(smp_processor_id()))
|
||
|
12 return;
|
||
|
13 tne = READ_ONCE(tick_nohz_active);
|
||
|
14 if (tne != rdtp->tick_nohz_enabled_snap) {
|
||
|
15 if (rcu_cpu_has_callbacks(NULL))
|
||
|
16 invoke_rcu_core();
|
||
|
17 rdtp->tick_nohz_enabled_snap = tne;
|
||
|
18 return;
|
||
|
19 }
|
||
|
20 if (!tne)
|
||
|
21 return;
|
||
|
22 if (rdtp->all_lazy &&
|
||
|
23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) {
|
||
|
24 rdtp->all_lazy = false;
|
||
|
25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted;
|
||
|
26 invoke_rcu_core();
|
||
|
27 return;
|
||
|
28 }
|
||
|
29 if (rdtp->last_accelerate == jiffies)
|
||
|
30 return;
|
||
|
31 rdtp->last_accelerate = jiffies;
|
||
|
32 for_each_rcu_flavor(rsp) {
|
||
|
33 rdp = this_cpu_ptr(rsp->rda);
|
||
|
34 if (rcu_segcblist_pend_cbs(&rdp->cblist))
|
||
|
35 continue;
|
||
|
36 rnp = rdp->mynode;
|
||
|
37 raw_spin_lock_rcu_node(rnp);
|
||
|
38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
|
||
|
39 raw_spin_unlock_rcu_node(rnp);
|
||
|
40 if (needwake)
|
||
|
41 rcu_gp_kthread_wake(rsp);
|
||
|
42 }
|
||
|
43 }
|
||
|
</pre>
|
||
|
|
||
|
<p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really
|
||
|
matters for this discussion are lines 37–39.
|
||
|
We will therefore abbreviate this function as follows:
|
||
|
|
||
|
</p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg">
|
||
|
|
||
|
<p>The box represents the <tt>rcu_node</tt> structure's <tt>->lock</tt>
|
||
|
critical section, with the double line on top representing the additional
|
||
|
<tt>smp_mb__after_unlock_lock()</tt>.
|
||
|
|
||
|
<h3><a name="Tree RCU Grace Period Memory Ordering Components">
|
||
|
Tree RCU Grace Period Memory Ordering Components</a></h3>
|
||
|
|
||
|
<p>Tree RCU's grace-period memory-ordering guarantee is provided by
|
||
|
a number of RCU components:
|
||
|
|
||
|
<ol>
|
||
|
<li> <a href="#Callback Registry">Callback Registry</a>
|
||
|
<li> <a href="#Grace-Period Initialization">Grace-Period Initialization</a>
|
||
|
<li> <a href="#Self-Reported Quiescent States">
|
||
|
Self-Reported Quiescent States</a>
|
||
|
<li> <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a>
|
||
|
<li> <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a>
|
||
|
<li> <a href="Forcing Quiescent States">Forcing Quiescent States</a>
|
||
|
<li> <a href="Grace-Period Cleanup">Grace-Period Cleanup</a>
|
||
|
<li> <a href="Callback Invocation">Callback Invocation</a>
|
||
|
</ol>
|
||
|
|
||
|
<p>Each of the following section looks at the corresponding component
|
||
|
in detail.
|
||
|
|
||
|
<h4><a name="Callback Registry">Callback Registry</a></h4>
|
||
|
|
||
|
<p>If RCU's grace-period guarantee is to mean anything at all, any
|
||
|
access that happens before a given invocation of <tt>call_rcu()</tt>
|
||
|
must also happen before the corresponding grace period.
|
||
|
The implementation of this portion of RCU's grace period guarantee
|
||
|
is shown in the following figure:
|
||
|
|
||
|
</p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg">
|
||
|
|
||
|
<p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state,
|
||
|
it provides no ordering guarantees, either for itself or for
|
||
|
phase one of the update (which again will usually be removal of
|
||
|
an element from an RCU-protected data structure).
|
||
|
It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list,
|
||
|
which cannot become associated with a grace period until a later
|
||
|
call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above.
|
||
|
|
||
|
<p>One set of code paths shown on the left invokes
|
||
|
<tt>rcu_accelerate_cbs()</tt> via
|
||
|
<tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if
|
||
|
the current CPU is inundated with queued <tt>rcu_head</tt> structures)
|
||
|
or more likely from an <tt>RCU_SOFTIRQ</tt> handler.
|
||
|
Another code path in the middle is taken only in kernels built with
|
||
|
<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes
|
||
|
<tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>.
|
||
|
The final code path on the right is taken only in kernels built with
|
||
|
<tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes
|
||
|
<tt>rcu_accelerate_cbs()</tt> via
|
||
|
<tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>,
|
||
|
<tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>,
|
||
|
which in turn is invoked on a surviving CPU after the outgoing
|
||
|
CPU has been completely offlined.
|
||
|
|
||
|
<p>There are a few other code paths within grace-period processing
|
||
|
that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>.
|
||
|
However, either way, all of the CPU's recently queued <tt>rcu_head</tt>
|
||
|
structures are associated with a future grace-period number under
|
||
|
the protection of the CPU's lead <tt>rcu_node</tt> structure's
|
||
|
<tt>->lock</tt>.
|
||
|
In all cases, there is full ordering against any prior critical section
|
||
|
for that same <tt>rcu_node</tt> structure's <tt>->lock</tt>, and
|
||
|
also full ordering against any of the current task's or CPU's prior critical
|
||
|
sections for any <tt>rcu_node</tt> structure's <tt>->lock</tt>.
|
||
|
|
||
|
<p>The next section will show how this ordering ensures that any
|
||
|
accesses prior to the <tt>call_rcu()</tt> (particularly including phase
|
||
|
one of the update)
|
||
|
happen before the start of the corresponding grace period.
|
||
|
|
||
|
<table>
|
||
|
<tr><th> </th></tr>
|
||
|
<tr><th align="left">Quick Quiz:</th></tr>
|
||
|
<tr><td>
|
||
|
But what about <tt>synchronize_rcu()</tt>?
|
||
|
</td></tr>
|
||
|
<tr><th align="left">Answer:</th></tr>
|
||
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||
|
The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt>
|
||
|
to <tt>wait_rcu_gp()</tt>, which invokes it.
|
||
|
So either way, it eventually comes down to <tt>call_rcu()</tt>.
|
||
|
</font></td></tr>
|
||
|
<tr><td> </td></tr>
|
||
|
</table>
|
||
|
|
||
|
<h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4>
|
||
|
|
||
|
<p>Grace-period initialization is carried out by
|
||
|
the grace-period kernel thread, which makes several passes over the
|
||
|
<tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function.
|
||
|
This means that showing the full flow of ordering through the
|
||
|
grace-period computation will require duplicating this tree.
|
||
|
If you find this confusing, please note that the state of the
|
||
|
<tt>rcu_node</tt> changes over time, just like Heraclitus's river.
|
||
|
However, to keep the <tt>rcu_node</tt> river tractable, the
|
||
|
grace-period kernel thread's traversals are presented in multiple
|
||
|
parts, starting in this section with the various phases of
|
||
|
grace-period initialization.
|
||
|
|
||
|
<p>The first ordering-related grace-period initialization action is to
|
||
|
advance the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt>
|
||
|
grace-period-number counter, as shown below:
|
||
|
|
||
|
</p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
|
||
|
|
||
|
<p>The actual increment is carried out using <tt>smp_store_release()</tt>,
|
||
|
which helps reject false-positive RCU CPU stall detection.
|
||
|
Note that only the root <tt>rcu_node</tt> structure is touched.
|
||
|
|
||
|
<p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks
|
||
|
based on CPUs having come online or gone offline since the start of
|
||
|
the previous grace period.
|
||
|
In the common case where the number of online CPUs for this <tt>rcu_node</tt>
|
||
|
structure has not transitioned to or from zero,
|
||
|
this pass will scan only the leaf <tt>rcu_node</tt> structures.
|
||
|
However, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
|
||
|
structure has transitioned from zero,
|
||
|
<tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU.
|
||
|
Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
|
||
|
structure has transitioned to zero,
|
||
|
<tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU.
|
||
|
The diagram below shows the path of ordering if the leftmost
|
||
|
<tt>rcu_node</tt> structure onlines its first CPU and if the next
|
||
|
<tt>rcu_node</tt> structure has no online CPUs
|
||
|
(or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines
|
||
|
its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs).
|
||
|
|
||
|
</p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
|
||
|
|
||
|
<p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt>
|
||
|
tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's
|
||
|
<tt>->gp_seq</tt> field to the newly advanced value from the
|
||
|
<tt>rcu_state</tt> structure, as shown in the following diagram.
|
||
|
|
||
|
</p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
|
||
|
|
||
|
<p>This change will also cause each CPU's next call to
|
||
|
<tt>__note_gp_changes()</tt>
|
||
|
to notice that a new grace period has started, as described in the next
|
||
|
section.
|
||
|
But because the grace-period kthread started the grace period at the
|
||
|
root (with the advancing of the <tt>rcu_state</tt> structure's
|
||
|
<tt>->gp_seq</tt> field) before setting each leaf <tt>rcu_node</tt>
|
||
|
structure's <tt>->gp_seq</tt> field, each CPU's observation of
|
||
|
the start of the grace period will happen after the actual start
|
||
|
of the grace period.
|
||
|
|
||
|
<table>
|
||
|
<tr><th> </th></tr>
|
||
|
<tr><th align="left">Quick Quiz:</th></tr>
|
||
|
<tr><td>
|
||
|
But what about the CPU that started the grace period?
|
||
|
Why wouldn't it see the start of the grace period right when
|
||
|
it started that grace period?
|
||
|
</td></tr>
|
||
|
<tr><th align="left">Answer:</th></tr>
|
||
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||
|
In some deep philosophical and overly anthromorphized
|
||
|
sense, yes, the CPU starting the grace period is immediately
|
||
|
aware of having done so.
|
||
|
However, if we instead assume that RCU is not self-aware,
|
||
|
then even the CPU starting the grace period does not really
|
||
|
become aware of the start of this grace period until its
|
||
|
first call to <tt>__note_gp_changes()</tt>.
|
||
|
On the other hand, this CPU potentially gets early notification
|
||
|
because it invokes <tt>__note_gp_changes()</tt> during its
|
||
|
last <tt>rcu_gp_init()</tt> pass through its leaf
|
||
|
<tt>rcu_node</tt> structure.
|
||
|
</font></td></tr>
|
||
|
<tr><td> </td></tr>
|
||
|
</table>
|
||
|
|
||
|
<h4><a name="Self-Reported Quiescent States">
|
||
|
Self-Reported Quiescent States</a></h4>
|
||
|
|
||
|
<p>When all entities that might block the grace period have reported
|
||
|
quiescent states (or as described in a later section, had quiescent
|
||
|
states reported on their behalf), the grace period can end.
|
||
|
Online non-idle CPUs report their own quiescent states, as shown
|
||
|
in the following diagram:
|
||
|
|
||
|
</p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%">
|
||
|
|
||
|
<p>This is for the last CPU to report a quiescent state, which signals
|
||
|
the end of the grace period.
|
||
|
Earlier quiescent states would push up the <tt>rcu_node</tt> tree
|
||
|
only until they encountered an <tt>rcu_node</tt> structure that
|
||
|
is waiting for additional quiescent states.
|
||
|
However, ordering is nevertheless preserved because some later quiescent
|
||
|
state will acquire that <tt>rcu_node</tt> structure's <tt>->lock</tt>.
|
||
|
|
||
|
<p>Any number of events can lead up to a CPU invoking
|
||
|
<tt>note_gp_changes</tt> (or alternatively, directly invoking
|
||
|
<tt>__note_gp_changes()</tt>), at which point that CPU will notice
|
||
|
the start of a new grace period while holding its leaf
|
||
|
<tt>rcu_node</tt> lock.
|
||
|
Therefore, all execution shown in this diagram happens after the
|
||
|
start of the grace period.
|
||
|
In addition, this CPU will consider any RCU read-side critical
|
||
|
section that started before the invocation of <tt>__note_gp_changes()</tt>
|
||
|
to have started before the grace period, and thus a critical
|
||
|
section that the grace period must wait on.
|
||
|
|
||
|
<table>
|
||
|
<tr><th> </th></tr>
|
||
|
<tr><th align="left">Quick Quiz:</th></tr>
|
||
|
<tr><td>
|
||
|
But a RCU read-side critical section might have started
|
||
|
after the beginning of the grace period
|
||
|
(the advancing of <tt>->gp_seq</tt> from earlier), so why should
|
||
|
the grace period wait on such a critical section?
|
||
|
</td></tr>
|
||
|
<tr><th align="left">Answer:</th></tr>
|
||
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||
|
It is indeed not necessary for the grace period to wait on such
|
||
|
a critical section.
|
||
|
However, it is permissible to wait on it.
|
||
|
And it is furthermore important to wait on it, as this
|
||
|
lazy approach is far more scalable than a “big bang”
|
||
|
all-at-once grace-period start could possibly be.
|
||
|
</font></td></tr>
|
||
|
<tr><td> </td></tr>
|
||
|
</table>
|
||
|
|
||
|
<p>If the CPU does a context switch, a quiescent state will be
|
||
|
noted by <tt>rcu_node_context_switch()</tt> on the left.
|
||
|
On the other hand, if the CPU takes a scheduler-clock interrupt
|
||
|
while executing in usermode, a quiescent state will be noted by
|
||
|
<tt>rcu_sched_clock_irq()</tt> on the right.
|
||
|
Either way, the passage through a quiescent state will be noted
|
||
|
in a per-CPU variable.
|
||
|
|
||
|
<p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on
|
||
|
this CPU (for example, after the next scheduler-clock
|
||
|
interrupt), <tt>rcu_core()</tt> will invoke
|
||
|
<tt>rcu_check_quiescent_state()</tt>, which will notice the
|
||
|
recorded quiescent state, and invoke
|
||
|
<tt>rcu_report_qs_rdp()</tt>.
|
||
|
If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state
|
||
|
really does apply to the current grace period, it invokes
|
||
|
<tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt>
|
||
|
tree as shown at the bottom of the diagram, clearing bits from
|
||
|
each <tt>rcu_node</tt> structure's <tt>->qsmask</tt> field,
|
||
|
and propagating up the tree when the result is zero.
|
||
|
|
||
|
<p>Note that traversal passes upwards out of a given <tt>rcu_node</tt>
|
||
|
structure only if the current CPU is reporting the last quiescent
|
||
|
state for the subtree headed by that <tt>rcu_node</tt> structure.
|
||
|
A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt>
|
||
|
structure, then there will be a later traversal by another CPU
|
||
|
(or perhaps the same one) that proceeds upwards
|
||
|
from that point, and the <tt>rcu_node</tt> <tt>->lock</tt>
|
||
|
guarantees that the first CPU's quiescent state happens before the
|
||
|
remainder of the second CPU's traversal.
|
||
|
Applying this line of thought repeatedly shows that all CPUs'
|
||
|
quiescent states happen before the last CPU traverses through
|
||
|
the root <tt>rcu_node</tt> structure, the “last CPU”
|
||
|
being the one that clears the last bit in the root <tt>rcu_node</tt>
|
||
|
structure's <tt>->qsmask</tt> field.
|
||
|
|
||
|
<h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4>
|
||
|
|
||
|
<p>Due to energy-efficiency considerations, RCU is forbidden from
|
||
|
disturbing idle CPUs.
|
||
|
CPUs are therefore required to notify RCU when entering or leaving idle
|
||
|
state, which they do via fully ordered value-returning atomic operations
|
||
|
on a per-CPU variable.
|
||
|
The ordering effects are as shown below:
|
||
|
|
||
|
</p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%">
|
||
|
|
||
|
<p>The RCU grace-period kernel thread samples the per-CPU idleness
|
||
|
variable while holding the corresponding CPU's leaf <tt>rcu_node</tt>
|
||
|
structure's <tt>->lock</tt>.
|
||
|
This means that any RCU read-side critical sections that precede the
|
||
|
idle period (the oval near the top of the diagram above) will happen
|
||
|
before the end of the current grace period.
|
||
|
Similarly, the beginning of the current grace period will happen before
|
||
|
any RCU read-side critical sections that follow the
|
||
|
idle period (the oval near the bottom of the diagram above).
|
||
|
|
||
|
<p>Plumbing this into the full grace-period execution is described
|
||
|
<a href="#Forcing Quiescent States">below</a>.
|
||
|
|
||
|
<h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4>
|
||
|
|
||
|
<p>RCU is also forbidden from disturbing offline CPUs, which might well
|
||
|
be powered off and removed from the system completely.
|
||
|
CPUs are therefore required to notify RCU of their comings and goings
|
||
|
as part of the corresponding CPU hotplug operations.
|
||
|
The ordering effects are shown below:
|
||
|
|
||
|
</p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%">
|
||
|
|
||
|
<p>Because CPU hotplug operations are much less frequent than idle transitions,
|
||
|
they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt>
|
||
|
structure's <tt>->lock</tt> and update this structure's
|
||
|
<tt>->qsmaskinitnext</tt>.
|
||
|
The RCU grace-period kernel thread samples this mask to detect CPUs
|
||
|
having gone offline since the beginning of this grace period.
|
||
|
|
||
|
<p>Plumbing this into the full grace-period execution is described
|
||
|
<a href="#Forcing Quiescent States">below</a>.
|
||
|
|
||
|
<h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4>
|
||
|
|
||
|
<p>As noted above, idle and offline CPUs cannot report their own
|
||
|
quiescent states, and therefore the grace-period kernel thread
|
||
|
must do the reporting on their behalf.
|
||
|
This process is called “forcing quiescent states”, it is
|
||
|
repeated every few jiffies, and its ordering effects are shown below:
|
||
|
|
||
|
</p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%">
|
||
|
|
||
|
<p>Each pass of quiescent state forcing is guaranteed to traverse the
|
||
|
leaf <tt>rcu_node</tt> structures, and if there are no new quiescent
|
||
|
states due to recently idled and/or offlined CPUs, then only the
|
||
|
leaves are traversed.
|
||
|
However, if there is a newly offlined CPU as illustrated on the left
|
||
|
or a newly idled CPU as illustrated on the right, the corresponding
|
||
|
quiescent state will be driven up towards the root.
|
||
|
As with self-reported quiescent states, the upwards driving stops
|
||
|
once it reaches an <tt>rcu_node</tt> structure that has quiescent
|
||
|
states outstanding from other CPUs.
|
||
|
|
||
|
<table>
|
||
|
<tr><th> </th></tr>
|
||
|
<tr><th align="left">Quick Quiz:</th></tr>
|
||
|
<tr><td>
|
||
|
The leftmost drive to root stopped before it reached
|
||
|
the root <tt>rcu_node</tt> structure, which means that
|
||
|
there are still CPUs subordinate to that structure on
|
||
|
which the current grace period is waiting.
|
||
|
Given that, how is it possible that the rightmost drive
|
||
|
to root ended the grace period?
|
||
|
</td></tr>
|
||
|
<tr><th align="left">Answer:</th></tr>
|
||
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||
|
Good analysis!
|
||
|
It is in fact impossible in the absence of bugs in RCU.
|
||
|
But this diagram is complex enough as it is, so simplicity
|
||
|
overrode accuracy.
|
||
|
You can think of it as poetic license, or you can think of
|
||
|
it as misdirection that is resolved in the
|
||
|
<a href="#Putting It All Together">stitched-together diagram</a>.
|
||
|
</font></td></tr>
|
||
|
<tr><td> </td></tr>
|
||
|
</table>
|
||
|
|
||
|
<h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4>
|
||
|
|
||
|
<p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree
|
||
|
breadth-first advancing all the <tt>->gp_seq</tt> fields, then it
|
||
|
advances the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> field.
|
||
|
The ordering effects are shown below:
|
||
|
|
||
|
</p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%">
|
||
|
|
||
|
<p>As indicated by the oval at the bottom of the diagram, once
|
||
|
grace-period cleanup is complete, the next grace period can begin.
|
||
|
|
||
|
<table>
|
||
|
<tr><th> </th></tr>
|
||
|
<tr><th align="left">Quick Quiz:</th></tr>
|
||
|
<tr><td>
|
||
|
But when precisely does the grace period end?
|
||
|
</td></tr>
|
||
|
<tr><th align="left">Answer:</th></tr>
|
||
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||
|
There is no useful single point at which the grace period
|
||
|
can be said to end.
|
||
|
The earliest reasonable candidate is as soon as the last
|
||
|
CPU has reported its quiescent state, but it may be some
|
||
|
milliseconds before RCU becomes aware of this.
|
||
|
The latest reasonable candidate is once the <tt>rcu_state</tt>
|
||
|
structure's <tt>->gp_seq</tt> field has been updated,
|
||
|
but it is quite possible that some CPUs have already completed
|
||
|
phase two of their updates by that time.
|
||
|
In short, if you are going to work with RCU, you need to
|
||
|
learn to embrace uncertainty.
|
||
|
</font></td></tr>
|
||
|
<tr><td> </td></tr>
|
||
|
</table>
|
||
|
|
||
|
|
||
|
<h4><a name="Callback Invocation">Callback Invocation</a></h4>
|
||
|
|
||
|
<p>Once a given CPU's leaf <tt>rcu_node</tt> structure's
|
||
|
<tt>->gp_seq</tt> field has been updated, that CPU can begin
|
||
|
invoking its RCU callbacks that were waiting for this grace period
|
||
|
to end.
|
||
|
These callbacks are identified by <tt>rcu_advance_cbs()</tt>,
|
||
|
which is usually invoked by <tt>__note_gp_changes()</tt>.
|
||
|
As shown in the diagram below, this invocation can be triggered by
|
||
|
the scheduling-clock interrupt (<tt>rcu_sched_clock_irq()</tt> on
|
||
|
the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on
|
||
|
the right, but only for kernels build with
|
||
|
<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>).
|
||
|
Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in
|
||
|
<tt>rcu_do_batch()</tt> invoking the callbacks, which in turn
|
||
|
allows those callbacks to carry out (either directly or indirectly
|
||
|
via wakeup) the needed phase-two processing for each update.
|
||
|
|
||
|
</p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%">
|
||
|
|
||
|
<p>Please note that callback invocation can also be prompted by any
|
||
|
number of corner-case code paths, for example, when a CPU notes that
|
||
|
it has excessive numbers of callbacks queued.
|
||
|
In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's
|
||
|
<tt>->lock</tt> before invoking callbacks, which preserves the
|
||
|
required ordering against the newly completed grace period.
|
||
|
|
||
|
<p>However, if the callback function communicates to other CPUs,
|
||
|
for example, doing a wakeup, then it is that function's responsibility
|
||
|
to maintain ordering.
|
||
|
For example, if the callback function wakes up a task that runs on
|
||
|
some other CPU, proper ordering must in place in both the callback
|
||
|
function and the task being awakened.
|
||
|
To see why this is important, consider the top half of the
|
||
|
<a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram.
|
||
|
The callback might be running on a CPU corresponding to the leftmost
|
||
|
leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on
|
||
|
a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure,
|
||
|
and the grace-period kernel thread might not yet have reached the
|
||
|
rightmost leaf.
|
||
|
In this case, the grace period's memory ordering might not yet have
|
||
|
reached that CPU, so again the callback function and the awakened
|
||
|
task must supply proper ordering.
|
||
|
|
||
|
<h3><a name="Putting It All Together">Putting It All Together</a></h3>
|
||
|
|
||
|
<p>A stitched-together diagram is
|
||
|
<a href="Tree-RCU-Diagram.html">here</a>.
|
||
|
|
||
|
<h3><a name="Legal Statement">
|
||
|
Legal Statement</a></h3>
|
||
|
|
||
|
<p>This work represents the view of the author and does not necessarily
|
||
|
represent the view of IBM.
|
||
|
|
||
|
</p><p>Linux is a registered trademark of Linus Torvalds.
|
||
|
|
||
|
</p><p>Other company, product, and service names may be trademarks or
|
||
|
service marks of others.
|
||
|
|
||
|
</body></html>
|