3402 lines
129 KiB
HTML
3402 lines
129 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
|
"http://www.w3.org/TR/html4/loose.dtd">
|
|
<html>
|
|
<head><title>A Tour Through RCU's Requirements [LWN.net]</title>
|
|
<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
|
|
|
|
<h1>A Tour Through RCU's Requirements</h1>
|
|
|
|
<p>Copyright IBM Corporation, 2015</p>
|
|
<p>Author: Paul E. McKenney</p>
|
|
<p><i>The initial version of this document appeared in the
|
|
<a href="https://lwn.net/">LWN</a> articles
|
|
<a href="https://lwn.net/Articles/652156/">here</a>,
|
|
<a href="https://lwn.net/Articles/652677/">here</a>, and
|
|
<a href="https://lwn.net/Articles/653326/">here</a>.</i></p>
|
|
|
|
<h2>Introduction</h2>
|
|
|
|
<p>
|
|
Read-copy update (RCU) is a synchronization mechanism that is often
|
|
used as a replacement for reader-writer locking.
|
|
RCU is unusual in that updaters do not block readers,
|
|
which means that RCU's read-side primitives can be exceedingly fast
|
|
and scalable.
|
|
In addition, updaters can make useful forward progress concurrently
|
|
with readers.
|
|
However, all this concurrency between RCU readers and updaters does raise
|
|
the question of exactly what RCU readers are doing, which in turn
|
|
raises the question of exactly what RCU's requirements are.
|
|
|
|
<p>
|
|
This document therefore summarizes RCU's requirements, and can be thought
|
|
of as an informal, high-level specification for RCU.
|
|
It is important to understand that RCU's specification is primarily
|
|
empirical in nature;
|
|
in fact, I learned about many of these requirements the hard way.
|
|
This situation might cause some consternation, however, not only
|
|
has this learning process been a lot of fun, but it has also been
|
|
a great privilege to work with so many people willing to apply
|
|
technologies in interesting new ways.
|
|
|
|
<p>
|
|
All that aside, here are the categories of currently known RCU requirements:
|
|
</p>
|
|
|
|
<ol>
|
|
<li> <a href="#Fundamental Requirements">
|
|
Fundamental Requirements</a>
|
|
<li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a>
|
|
<li> <a href="#Parallelism Facts of Life">
|
|
Parallelism Facts of Life</a>
|
|
<li> <a href="#Quality-of-Implementation Requirements">
|
|
Quality-of-Implementation Requirements</a>
|
|
<li> <a href="#Linux Kernel Complications">
|
|
Linux Kernel Complications</a>
|
|
<li> <a href="#Software-Engineering Requirements">
|
|
Software-Engineering Requirements</a>
|
|
<li> <a href="#Other RCU Flavors">
|
|
Other RCU Flavors</a>
|
|
<li> <a href="#Possible Future Changes">
|
|
Possible Future Changes</a>
|
|
</ol>
|
|
|
|
<p>
|
|
This is followed by a <a href="#Summary">summary</a>,
|
|
however, the answers to each quick quiz immediately follows the quiz.
|
|
Select the big white space with your mouse to see the answer.
|
|
|
|
<h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2>
|
|
|
|
<p>
|
|
RCU's fundamental requirements are the closest thing RCU has to hard
|
|
mathematical requirements.
|
|
These are:
|
|
|
|
<ol>
|
|
<li> <a href="#Grace-Period Guarantee">
|
|
Grace-Period Guarantee</a>
|
|
<li> <a href="#Publish-Subscribe Guarantee">
|
|
Publish-Subscribe Guarantee</a>
|
|
<li> <a href="#Memory-Barrier Guarantees">
|
|
Memory-Barrier Guarantees</a>
|
|
<li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally">
|
|
RCU Primitives Guaranteed to Execute Unconditionally</a>
|
|
<li> <a href="#Guaranteed Read-to-Write Upgrade">
|
|
Guaranteed Read-to-Write Upgrade</a>
|
|
</ol>
|
|
|
|
<h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3>
|
|
|
|
<p>
|
|
RCU's grace-period guarantee is unusual in being premeditated:
|
|
Jack Slingwine and I had this guarantee firmly in mind when we started
|
|
work on RCU (then called “rclock”) in the early 1990s.
|
|
That said, the past two decades of experience with RCU have produced
|
|
a much more detailed understanding of this guarantee.
|
|
|
|
<p>
|
|
RCU's grace-period guarantee allows updaters to wait for the completion
|
|
of all pre-existing RCU read-side critical sections.
|
|
An RCU read-side critical section
|
|
begins with the marker <tt>rcu_read_lock()</tt> and ends with
|
|
the marker <tt>rcu_read_unlock()</tt>.
|
|
These markers may be nested, and RCU treats a nested set as one
|
|
big RCU read-side critical section.
|
|
Production-quality implementations of <tt>rcu_read_lock()</tt> and
|
|
<tt>rcu_read_unlock()</tt> are extremely lightweight, and in
|
|
fact have exactly zero overhead in Linux kernels built for production
|
|
use with <tt>CONFIG_PREEMPT=n</tt>.
|
|
|
|
<p>
|
|
This guarantee allows ordering to be enforced with extremely low
|
|
overhead to readers, for example:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 int x, y;
|
|
2
|
|
3 void thread0(void)
|
|
4 {
|
|
5 rcu_read_lock();
|
|
6 r1 = READ_ONCE(x);
|
|
7 r2 = READ_ONCE(y);
|
|
8 rcu_read_unlock();
|
|
9 }
|
|
10
|
|
11 void thread1(void)
|
|
12 {
|
|
13 WRITE_ONCE(x, 1);
|
|
14 synchronize_rcu();
|
|
15 WRITE_ONCE(y, 1);
|
|
16 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
Because the <tt>synchronize_rcu()</tt> on line 14 waits for
|
|
all pre-existing readers, any instance of <tt>thread0()</tt> that
|
|
loads a value of zero from <tt>x</tt> must complete before
|
|
<tt>thread1()</tt> stores to <tt>y</tt>, so that instance must
|
|
also load a value of zero from <tt>y</tt>.
|
|
Similarly, any instance of <tt>thread0()</tt> that loads a value of
|
|
one from <tt>y</tt> must have started after the
|
|
<tt>synchronize_rcu()</tt> started, and must therefore also load
|
|
a value of one from <tt>x</tt>.
|
|
Therefore, the outcome:
|
|
<blockquote>
|
|
<pre>
|
|
(r1 == 0 && r2 == 1)
|
|
</pre>
|
|
</blockquote>
|
|
cannot happen.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Wait a minute!
|
|
You said that updaters can make useful forward progress concurrently
|
|
with readers, but pre-existing readers will block
|
|
<tt>synchronize_rcu()</tt>!!!
|
|
Just who are you trying to fool???
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
First, if updaters do not wish to be blocked by readers, they can use
|
|
<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will
|
|
be discussed later.
|
|
Second, even when using <tt>synchronize_rcu()</tt>, the other
|
|
update-side code does run concurrently with readers, whether
|
|
pre-existing or not.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
This scenario resembles one of the first uses of RCU in
|
|
<a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>,
|
|
which managed a distributed lock manager's transition into
|
|
a state suitable for handling recovery from node failure,
|
|
more or less as follows:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 #define STATE_NORMAL 0
|
|
2 #define STATE_WANT_RECOVERY 1
|
|
3 #define STATE_RECOVERING 2
|
|
4 #define STATE_WANT_NORMAL 3
|
|
5
|
|
6 int state = STATE_NORMAL;
|
|
7
|
|
8 void do_something_dlm(void)
|
|
9 {
|
|
10 int state_snap;
|
|
11
|
|
12 rcu_read_lock();
|
|
13 state_snap = READ_ONCE(state);
|
|
14 if (state_snap == STATE_NORMAL)
|
|
15 do_something();
|
|
16 else
|
|
17 do_something_carefully();
|
|
18 rcu_read_unlock();
|
|
19 }
|
|
20
|
|
21 void start_recovery(void)
|
|
22 {
|
|
23 WRITE_ONCE(state, STATE_WANT_RECOVERY);
|
|
24 synchronize_rcu();
|
|
25 WRITE_ONCE(state, STATE_RECOVERING);
|
|
26 recovery();
|
|
27 WRITE_ONCE(state, STATE_WANT_NORMAL);
|
|
28 synchronize_rcu();
|
|
29 WRITE_ONCE(state, STATE_NORMAL);
|
|
30 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
The RCU read-side critical section in <tt>do_something_dlm()</tt>
|
|
works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt>
|
|
to guarantee that <tt>do_something()</tt> never runs concurrently
|
|
with <tt>recovery()</tt>, but with little or no synchronization
|
|
overhead in <tt>do_something_dlm()</tt>.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Why is the <tt>synchronize_rcu()</tt> on line 28 needed?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
Without that extra grace period, memory reordering could result in
|
|
<tt>do_something_dlm()</tt> executing <tt>do_something()</tt>
|
|
concurrently with the last bits of <tt>recovery()</tt>.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
In order to avoid fatal problems such as deadlocks,
|
|
an RCU read-side critical section must not contain calls to
|
|
<tt>synchronize_rcu()</tt>.
|
|
Similarly, an RCU read-side critical section must not
|
|
contain anything that waits, directly or indirectly, on completion of
|
|
an invocation of <tt>synchronize_rcu()</tt>.
|
|
|
|
<p>
|
|
Although RCU's grace-period guarantee is useful in and of itself, with
|
|
<a href="https://lwn.net/Articles/573497/">quite a few use cases</a>,
|
|
it would be good to be able to use RCU to coordinate read-side
|
|
access to linked data structures.
|
|
For this, the grace-period guarantee is not sufficient, as can
|
|
be seen in function <tt>add_gp_buggy()</tt> below.
|
|
We will look at the reader's code later, but in the meantime, just think of
|
|
the reader as locklessly picking up the <tt>gp</tt> pointer,
|
|
and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the
|
|
<tt>->a</tt> and <tt>->b</tt> fields.
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool add_gp_buggy(int a, int b)
|
|
2 {
|
|
3 p = kmalloc(sizeof(*p), GFP_KERNEL);
|
|
4 if (!p)
|
|
5 return -ENOMEM;
|
|
6 spin_lock(&gp_lock);
|
|
7 if (rcu_access_pointer(gp)) {
|
|
8 spin_unlock(&gp_lock);
|
|
9 return false;
|
|
10 }
|
|
11 p->a = a;
|
|
12 p->b = a;
|
|
13 gp = p; /* ORDERING BUG */
|
|
14 spin_unlock(&gp_lock);
|
|
15 return true;
|
|
16 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
The problem is that both the compiler and weakly ordered CPUs are within
|
|
their rights to reorder this code as follows:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool add_gp_buggy_optimized(int a, int b)
|
|
2 {
|
|
3 p = kmalloc(sizeof(*p), GFP_KERNEL);
|
|
4 if (!p)
|
|
5 return -ENOMEM;
|
|
6 spin_lock(&gp_lock);
|
|
7 if (rcu_access_pointer(gp)) {
|
|
8 spin_unlock(&gp_lock);
|
|
9 return false;
|
|
10 }
|
|
<b>11 gp = p; /* ORDERING BUG */
|
|
12 p->a = a;
|
|
13 p->b = a;</b>
|
|
14 spin_unlock(&gp_lock);
|
|
15 return true;
|
|
16 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
If an RCU reader fetches <tt>gp</tt> just after
|
|
<tt>add_gp_buggy_optimized</tt> executes line 11,
|
|
it will see garbage in the <tt>->a</tt> and <tt>->b</tt>
|
|
fields.
|
|
And this is but one of many ways in which compiler and hardware optimizations
|
|
could cause trouble.
|
|
Therefore, we clearly need some way to prevent the compiler and the CPU from
|
|
reordering in this manner, which brings us to the publish-subscribe
|
|
guarantee discussed in the next section.
|
|
|
|
<h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3>
|
|
|
|
<p>
|
|
RCU's publish-subscribe guarantee allows data to be inserted
|
|
into a linked data structure without disrupting RCU readers.
|
|
The updater uses <tt>rcu_assign_pointer()</tt> to insert the
|
|
new data, and readers use <tt>rcu_dereference()</tt> to
|
|
access data, whether new or old.
|
|
The following shows an example of insertion:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool add_gp(int a, int b)
|
|
2 {
|
|
3 p = kmalloc(sizeof(*p), GFP_KERNEL);
|
|
4 if (!p)
|
|
5 return -ENOMEM;
|
|
6 spin_lock(&gp_lock);
|
|
7 if (rcu_access_pointer(gp)) {
|
|
8 spin_unlock(&gp_lock);
|
|
9 return false;
|
|
10 }
|
|
11 p->a = a;
|
|
12 p->b = a;
|
|
13 rcu_assign_pointer(gp, p);
|
|
14 spin_unlock(&gp_lock);
|
|
15 return true;
|
|
16 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
The <tt>rcu_assign_pointer()</tt> on line 13 is conceptually
|
|
equivalent to a simple assignment statement, but also guarantees
|
|
that its assignment will
|
|
happen after the two assignments in lines 11 and 12,
|
|
similar to the C11 <tt>memory_order_release</tt> store operation.
|
|
It also prevents any number of “interesting” compiler
|
|
optimizations, for example, the use of <tt>gp</tt> as a scratch
|
|
location immediately preceding the assignment.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
But <tt>rcu_assign_pointer()</tt> does nothing to prevent the
|
|
two assignments to <tt>p->a</tt> and <tt>p->b</tt>
|
|
from being reordered.
|
|
Can't that also cause problems?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
No, it cannot.
|
|
The readers cannot see either of these two fields until
|
|
the assignment to <tt>gp</tt>, by which time both fields are
|
|
fully initialized.
|
|
So reordering the assignments
|
|
to <tt>p->a</tt> and <tt>p->b</tt> cannot possibly
|
|
cause any problems.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
It is tempting to assume that the reader need not do anything special
|
|
to control its accesses to the RCU-protected data,
|
|
as shown in <tt>do_something_gp_buggy()</tt> below:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool do_something_gp_buggy(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 p = gp; /* OPTIMIZATIONS GALORE!!! */
|
|
5 if (p) {
|
|
6 do_something(p->a, p->b);
|
|
7 rcu_read_unlock();
|
|
8 return true;
|
|
9 }
|
|
10 rcu_read_unlock();
|
|
11 return false;
|
|
12 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
However, this temptation must be resisted because there are a
|
|
surprisingly large number of ways that the compiler
|
|
(to say nothing of
|
|
<a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>)
|
|
can trip this code up.
|
|
For but one example, if the compiler were short of registers, it
|
|
might choose to refetch from <tt>gp</tt> rather than keeping
|
|
a separate copy in <tt>p</tt> as follows:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool do_something_gp_buggy_optimized(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 if (gp) { /* OPTIMIZATIONS GALORE!!! */
|
|
<b> 5 do_something(gp->a, gp->b);</b>
|
|
6 rcu_read_unlock();
|
|
7 return true;
|
|
8 }
|
|
9 rcu_read_unlock();
|
|
10 return false;
|
|
11 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
If this function ran concurrently with a series of updates that
|
|
replaced the current structure with a new one,
|
|
the fetches of <tt>gp->a</tt>
|
|
and <tt>gp->b</tt> might well come from two different structures,
|
|
which could cause serious confusion.
|
|
To prevent this (and much else besides), <tt>do_something_gp()</tt> uses
|
|
<tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool do_something_gp(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 p = rcu_dereference(gp);
|
|
5 if (p) {
|
|
6 do_something(p->a, p->b);
|
|
7 rcu_read_unlock();
|
|
8 return true;
|
|
9 }
|
|
10 rcu_read_unlock();
|
|
11 return false;
|
|
12 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha)
|
|
memory barriers in the Linux kernel.
|
|
Should a
|
|
<a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a>
|
|
ever appear, then <tt>rcu_dereference()</tt> could be implemented
|
|
as a <tt>memory_order_consume</tt> load.
|
|
Regardless of the exact implementation, a pointer fetched by
|
|
<tt>rcu_dereference()</tt> may not be used outside of the
|
|
outermost RCU read-side critical section containing that
|
|
<tt>rcu_dereference()</tt>, unless protection of
|
|
the corresponding data element has been passed from RCU to some
|
|
other synchronization mechanism, most commonly locking or
|
|
<a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>.
|
|
|
|
<p>
|
|
In short, updaters use <tt>rcu_assign_pointer()</tt> and readers
|
|
use <tt>rcu_dereference()</tt>, and these two RCU API elements
|
|
work together to ensure that readers have a consistent view of
|
|
newly added data elements.
|
|
|
|
<p>
|
|
Of course, it is also necessary to remove elements from RCU-protected
|
|
data structures, for example, using the following process:
|
|
|
|
<ol>
|
|
<li> Remove the data element from the enclosing structure.
|
|
<li> Wait for all pre-existing RCU read-side critical sections
|
|
to complete (because only pre-existing readers can possibly have
|
|
a reference to the newly removed data element).
|
|
<li> At this point, only the updater has a reference to the
|
|
newly removed data element, so it can safely reclaim
|
|
the data element, for example, by passing it to <tt>kfree()</tt>.
|
|
</ol>
|
|
|
|
This process is implemented by <tt>remove_gp_synchronous()</tt>:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool remove_gp_synchronous(void)
|
|
2 {
|
|
3 struct foo *p;
|
|
4
|
|
5 spin_lock(&gp_lock);
|
|
6 p = rcu_access_pointer(gp);
|
|
7 if (!p) {
|
|
8 spin_unlock(&gp_lock);
|
|
9 return false;
|
|
10 }
|
|
11 rcu_assign_pointer(gp, NULL);
|
|
12 spin_unlock(&gp_lock);
|
|
13 synchronize_rcu();
|
|
14 kfree(p);
|
|
15 return true;
|
|
16 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
This function is straightforward, with line 13 waiting for a grace
|
|
period before line 14 frees the old data element.
|
|
This waiting ensures that readers will reach line 7 of
|
|
<tt>do_something_gp()</tt> before the data element referenced by
|
|
<tt>p</tt> is freed.
|
|
The <tt>rcu_access_pointer()</tt> on line 6 is similar to
|
|
<tt>rcu_dereference()</tt>, except that:
|
|
|
|
<ol>
|
|
<li> The value returned by <tt>rcu_access_pointer()</tt>
|
|
cannot be dereferenced.
|
|
If you want to access the value pointed to as well as
|
|
the pointer itself, use <tt>rcu_dereference()</tt>
|
|
instead of <tt>rcu_access_pointer()</tt>.
|
|
<li> The call to <tt>rcu_access_pointer()</tt> need not be
|
|
protected.
|
|
In contrast, <tt>rcu_dereference()</tt> must either be
|
|
within an RCU read-side critical section or in a code
|
|
segment where the pointer cannot change, for example, in
|
|
code protected by the corresponding update-side lock.
|
|
</ol>
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Without the <tt>rcu_dereference()</tt> or the
|
|
<tt>rcu_access_pointer()</tt>, what destructive optimizations
|
|
might the compiler make use of?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
Let's start with what happens to <tt>do_something_gp()</tt>
|
|
if it fails to use <tt>rcu_dereference()</tt>.
|
|
It could reuse a value formerly fetched from this same pointer.
|
|
It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time
|
|
manner, resulting in <i>load tearing</i>, in turn resulting a bytewise
|
|
mash-up of two distinct pointer values.
|
|
It might even use value-speculation optimizations, where it makes
|
|
a wrong guess, but by the time it gets around to checking the
|
|
value, an update has changed the pointer to match the wrong guess.
|
|
Too bad about any dereferences that returned pre-initialization garbage
|
|
in the meantime!
|
|
</font>
|
|
|
|
<p><font color="ffffff">
|
|
For <tt>remove_gp_synchronous()</tt>, as long as all modifications
|
|
to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>,
|
|
the above optimizations are harmless.
|
|
However, <tt>sparse</tt> will complain if you
|
|
define <tt>gp</tt> with <tt>__rcu</tt> and then
|
|
access it without using
|
|
either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
In short, RCU's publish-subscribe guarantee is provided by the combination
|
|
of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>.
|
|
This guarantee allows data elements to be safely added to RCU-protected
|
|
linked data structures without disrupting RCU readers.
|
|
This guarantee can be used in combination with the grace-period
|
|
guarantee to also allow data elements to be removed from RCU-protected
|
|
linked data structures, again without disrupting RCU readers.
|
|
|
|
<p>
|
|
This guarantee was only partially premeditated.
|
|
DYNIX/ptx used an explicit memory barrier for publication, but had nothing
|
|
resembling <tt>rcu_dereference()</tt> for subscription, nor did it
|
|
have anything resembling the <tt>smp_read_barrier_depends()</tt>
|
|
that was later subsumed into <tt>rcu_dereference()</tt> and later
|
|
still into <tt>READ_ONCE()</tt>.
|
|
The need for these operations made itself known quite suddenly at a
|
|
late-1990s meeting with the DEC Alpha architects, back in the days when
|
|
DEC was still a free-standing company.
|
|
It took the Alpha architects a good hour to convince me that any sort
|
|
of barrier would ever be needed, and it then took me a good <i>two</i> hours
|
|
to convince them that their documentation did not make this point clear.
|
|
More recent work with the C and C++ standards committees have provided
|
|
much education on tricks and traps from the compiler.
|
|
In short, compilers were much less tricky in the early 1990s, but in
|
|
2015, don't even think about omitting <tt>rcu_dereference()</tt>!
|
|
|
|
<h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3>
|
|
|
|
<p>
|
|
The previous section's simple linked-data-structure scenario clearly
|
|
demonstrates the need for RCU's stringent memory-ordering guarantees on
|
|
systems with more than one CPU:
|
|
|
|
<ol>
|
|
<li> Each CPU that has an RCU read-side critical section that
|
|
begins before <tt>synchronize_rcu()</tt> starts is
|
|
guaranteed to execute a full memory barrier between the time
|
|
that the RCU read-side critical section ends and the time that
|
|
<tt>synchronize_rcu()</tt> returns.
|
|
Without this guarantee, a pre-existing RCU read-side critical section
|
|
might hold a reference to the newly removed <tt>struct foo</tt>
|
|
after the <tt>kfree()</tt> on line 14 of
|
|
<tt>remove_gp_synchronous()</tt>.
|
|
<li> Each CPU that has an RCU read-side critical section that ends
|
|
after <tt>synchronize_rcu()</tt> returns is guaranteed
|
|
to execute a full memory barrier between the time that
|
|
<tt>synchronize_rcu()</tt> begins and the time that the RCU
|
|
read-side critical section begins.
|
|
Without this guarantee, a later RCU read-side critical section
|
|
running after the <tt>kfree()</tt> on line 14 of
|
|
<tt>remove_gp_synchronous()</tt> might
|
|
later run <tt>do_something_gp()</tt> and find the
|
|
newly deleted <tt>struct foo</tt>.
|
|
<li> If the task invoking <tt>synchronize_rcu()</tt> remains
|
|
on a given CPU, then that CPU is guaranteed to execute a full
|
|
memory barrier sometime during the execution of
|
|
<tt>synchronize_rcu()</tt>.
|
|
This guarantee ensures that the <tt>kfree()</tt> on
|
|
line 14 of <tt>remove_gp_synchronous()</tt> really does
|
|
execute after the removal on line 11.
|
|
<li> If the task invoking <tt>synchronize_rcu()</tt> migrates
|
|
among a group of CPUs during that invocation, then each of the
|
|
CPUs in that group is guaranteed to execute a full memory barrier
|
|
sometime during the execution of <tt>synchronize_rcu()</tt>.
|
|
This guarantee also ensures that the <tt>kfree()</tt> on
|
|
line 14 of <tt>remove_gp_synchronous()</tt> really does
|
|
execute after the removal on
|
|
line 11, but also in the case where the thread executing the
|
|
<tt>synchronize_rcu()</tt> migrates in the meantime.
|
|
</ol>
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Given that multiple CPUs can start RCU read-side critical sections
|
|
at any time without any ordering whatsoever, how can RCU possibly
|
|
tell whether or not a given RCU read-side critical section starts
|
|
before a given instance of <tt>synchronize_rcu()</tt>?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
If RCU cannot tell whether or not a given
|
|
RCU read-side critical section starts before a
|
|
given instance of <tt>synchronize_rcu()</tt>,
|
|
then it must assume that the RCU read-side critical section
|
|
started first.
|
|
In other words, a given instance of <tt>synchronize_rcu()</tt>
|
|
can avoid waiting on a given RCU read-side critical section only
|
|
if it can prove that <tt>synchronize_rcu()</tt> started first.
|
|
</font>
|
|
|
|
<p><font color="ffffff">
|
|
A related question is “When <tt>rcu_read_lock()</tt>
|
|
doesn't generate any code, why does it matter how it relates
|
|
to a grace period?”
|
|
The answer is that it is not the relationship of
|
|
<tt>rcu_read_lock()</tt> itself that is important, but rather
|
|
the relationship of the code within the enclosed RCU read-side
|
|
critical section to the code preceding and following the
|
|
grace period.
|
|
If we take this viewpoint, then a given RCU read-side critical
|
|
section begins before a given grace period when some access
|
|
preceding the grace period observes the effect of some access
|
|
within the critical section, in which case none of the accesses
|
|
within the critical section may observe the effects of any
|
|
access following the grace period.
|
|
</font>
|
|
|
|
<p><font color="ffffff">
|
|
As of late 2016, mathematical models of RCU take this
|
|
viewpoint, for example, see slides 62 and 63
|
|
of the
|
|
<a href="http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.2016.10.04c.LCE.pdf">2016 LinuxCon EU</a>
|
|
presentation.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
The first and second guarantees require unbelievably strict ordering!
|
|
Are all these memory barriers <i> really</i> required?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
Yes, they really are required.
|
|
To see why the first guarantee is required, consider the following
|
|
sequence of events:
|
|
</font>
|
|
|
|
<ol>
|
|
<li> <font color="ffffff">
|
|
CPU 1: <tt>rcu_read_lock()</tt>
|
|
</font>
|
|
<li> <font color="ffffff">
|
|
CPU 1: <tt>q = rcu_dereference(gp);
|
|
/* Very likely to return p. */</tt>
|
|
</font>
|
|
<li> <font color="ffffff">
|
|
CPU 0: <tt>list_del_rcu(p);</tt>
|
|
</font>
|
|
<li> <font color="ffffff">
|
|
CPU 0: <tt>synchronize_rcu()</tt> starts.
|
|
</font>
|
|
<li> <font color="ffffff">
|
|
CPU 1: <tt>do_something_with(q->a);
|
|
/* No smp_mb(), so might happen after kfree(). */</tt>
|
|
</font>
|
|
<li> <font color="ffffff">
|
|
CPU 1: <tt>rcu_read_unlock()</tt>
|
|
</font>
|
|
<li> <font color="ffffff">
|
|
CPU 0: <tt>synchronize_rcu()</tt> returns.
|
|
</font>
|
|
<li> <font color="ffffff">
|
|
CPU 0: <tt>kfree(p);</tt>
|
|
</font>
|
|
</ol>
|
|
|
|
<p><font color="ffffff">
|
|
Therefore, there absolutely must be a full memory barrier between the
|
|
end of the RCU read-side critical section and the end of the
|
|
grace period.
|
|
</font>
|
|
|
|
<p><font color="ffffff">
|
|
The sequence of events demonstrating the necessity of the second rule
|
|
is roughly similar:
|
|
</font>
|
|
|
|
<ol>
|
|
<li> <font color="ffffff">CPU 0: <tt>list_del_rcu(p);</tt>
|
|
</font>
|
|
<li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> starts.
|
|
</font>
|
|
<li> <font color="ffffff">CPU 1: <tt>rcu_read_lock()</tt>
|
|
</font>
|
|
<li> <font color="ffffff">CPU 1: <tt>q = rcu_dereference(gp);
|
|
/* Might return p if no memory barrier. */</tt>
|
|
</font>
|
|
<li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> returns.
|
|
</font>
|
|
<li> <font color="ffffff">CPU 0: <tt>kfree(p);</tt>
|
|
</font>
|
|
<li> <font color="ffffff">
|
|
CPU 1: <tt>do_something_with(q->a); /* Boom!!! */</tt>
|
|
</font>
|
|
<li> <font color="ffffff">CPU 1: <tt>rcu_read_unlock()</tt>
|
|
</font>
|
|
</ol>
|
|
|
|
<p><font color="ffffff">
|
|
And similarly, without a memory barrier between the beginning of the
|
|
grace period and the beginning of the RCU read-side critical section,
|
|
CPU 1 might end up accessing the freelist.
|
|
</font>
|
|
|
|
<p><font color="ffffff">
|
|
The “as if” rule of course applies, so that any
|
|
implementation that acts as if the appropriate memory barriers
|
|
were in place is a correct implementation.
|
|
That said, it is much easier to fool yourself into believing
|
|
that you have adhered to the as-if rule than it is to actually
|
|
adhere to it!
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
You claim that <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>
|
|
generate absolutely no code in some kernel builds.
|
|
This means that the compiler might arbitrarily rearrange consecutive
|
|
RCU read-side critical sections.
|
|
Given such rearrangement, if a given RCU read-side critical section
|
|
is done, how can you be sure that all prior RCU read-side critical
|
|
sections are done?
|
|
Won't the compiler rearrangements make that impossible to determine?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
In cases where <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>
|
|
generate absolutely no code, RCU infers quiescent states only at
|
|
special locations, for example, within the scheduler.
|
|
Because calls to <tt>schedule()</tt> had better prevent calling-code
|
|
accesses to shared variables from being rearranged across the call to
|
|
<tt>schedule()</tt>, if RCU detects the end of a given RCU read-side
|
|
critical section, it will necessarily detect the end of all prior
|
|
RCU read-side critical sections, no matter how aggressively the
|
|
compiler scrambles the code.
|
|
</font>
|
|
|
|
<p><font color="ffffff">
|
|
Again, this all assumes that the compiler cannot scramble code across
|
|
calls to the scheduler, out of interrupt handlers, into the idle loop,
|
|
into user-mode code, and so on.
|
|
But if your kernel build allows that sort of scrambling, you have broken
|
|
far more than just RCU!
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
Note that these memory-barrier requirements do not replace the fundamental
|
|
RCU requirement that a grace period wait for all pre-existing readers.
|
|
On the contrary, the memory barriers called out in this section must operate in
|
|
such a way as to <i>enforce</i> this fundamental requirement.
|
|
Of course, different implementations enforce this requirement in different
|
|
ways, but enforce it they must.
|
|
|
|
<h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3>
|
|
|
|
<p>
|
|
The common-case RCU primitives are unconditional.
|
|
They are invoked, they do their job, and they return, with no possibility
|
|
of error, and no need to retry.
|
|
This is a key RCU design philosophy.
|
|
|
|
<p>
|
|
However, this philosophy is pragmatic rather than pigheaded.
|
|
If someone comes up with a good justification for a particular conditional
|
|
RCU primitive, it might well be implemented and added.
|
|
After all, this guarantee was reverse-engineered, not premeditated.
|
|
The unconditional nature of the RCU primitives was initially an
|
|
accident of implementation, and later experience with synchronization
|
|
primitives with conditional primitives caused me to elevate this
|
|
accident to a guarantee.
|
|
Therefore, the justification for adding a conditional primitive to
|
|
RCU would need to be based on detailed and compelling use cases.
|
|
|
|
<h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3>
|
|
|
|
<p>
|
|
As far as RCU is concerned, it is always possible to carry out an
|
|
update within an RCU read-side critical section.
|
|
For example, that RCU read-side critical section might search for
|
|
a given data element, and then might acquire the update-side
|
|
spinlock in order to update that element, all while remaining
|
|
in that RCU read-side critical section.
|
|
Of course, it is necessary to exit the RCU read-side critical section
|
|
before invoking <tt>synchronize_rcu()</tt>, however, this
|
|
inconvenience can be avoided through use of the
|
|
<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members
|
|
described later in this document.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
But how does the upgrade-to-write operation exclude other readers?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
It doesn't, just like normal RCU updates, which also do not exclude
|
|
RCU readers.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
This guarantee allows lookup code to be shared between read-side
|
|
and update-side code, and was premeditated, appearing in the earliest
|
|
DYNIX/ptx RCU documentation.
|
|
|
|
<h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2>
|
|
|
|
<p>
|
|
RCU provides extremely lightweight readers, and its read-side guarantees,
|
|
though quite useful, are correspondingly lightweight.
|
|
It is therefore all too easy to assume that RCU is guaranteeing more
|
|
than it really is.
|
|
Of course, the list of things that RCU does not guarantee is infinitely
|
|
long, however, the following sections list a few non-guarantees that
|
|
have caused confusion.
|
|
Except where otherwise noted, these non-guarantees were premeditated.
|
|
|
|
<ol>
|
|
<li> <a href="#Readers Impose Minimal Ordering">
|
|
Readers Impose Minimal Ordering</a>
|
|
<li> <a href="#Readers Do Not Exclude Updaters">
|
|
Readers Do Not Exclude Updaters</a>
|
|
<li> <a href="#Updaters Only Wait For Old Readers">
|
|
Updaters Only Wait For Old Readers</a>
|
|
<li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections">
|
|
Grace Periods Don't Partition Read-Side Critical Sections</a>
|
|
<li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods">
|
|
Read-Side Critical Sections Don't Partition Grace Periods</a>
|
|
</ol>
|
|
|
|
<h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3>
|
|
|
|
<p>
|
|
Reader-side markers such as <tt>rcu_read_lock()</tt> and
|
|
<tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees
|
|
except through their interaction with the grace-period APIs such as
|
|
<tt>synchronize_rcu()</tt>.
|
|
To see this, consider the following pair of threads:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 void thread0(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 WRITE_ONCE(x, 1);
|
|
5 rcu_read_unlock();
|
|
6 rcu_read_lock();
|
|
7 WRITE_ONCE(y, 1);
|
|
8 rcu_read_unlock();
|
|
9 }
|
|
10
|
|
11 void thread1(void)
|
|
12 {
|
|
13 rcu_read_lock();
|
|
14 r1 = READ_ONCE(y);
|
|
15 rcu_read_unlock();
|
|
16 rcu_read_lock();
|
|
17 r2 = READ_ONCE(x);
|
|
18 rcu_read_unlock();
|
|
19 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
After <tt>thread0()</tt> and <tt>thread1()</tt> execute
|
|
concurrently, it is quite possible to have
|
|
|
|
<blockquote>
|
|
<pre>
|
|
(r1 == 1 && r2 == 0)
|
|
</pre>
|
|
</blockquote>
|
|
|
|
(that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>),
|
|
which would not be possible if <tt>rcu_read_lock()</tt> and
|
|
<tt>rcu_read_unlock()</tt> had much in the way of ordering
|
|
properties.
|
|
But they do not, so the CPU is within its rights
|
|
to do significant reordering.
|
|
This is by design: Any significant ordering constraints would slow down
|
|
these fast-path APIs.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Can't the compiler also reorder this code?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
No, the volatile casts in <tt>READ_ONCE()</tt> and
|
|
<tt>WRITE_ONCE()</tt> prevent the compiler from reordering in
|
|
this particular case.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3>
|
|
|
|
<p>
|
|
Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt>
|
|
exclude updates.
|
|
All they do is to prevent grace periods from ending.
|
|
The following example illustrates this:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 void thread0(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 r1 = READ_ONCE(y);
|
|
5 if (r1) {
|
|
6 do_something_with_nonzero_x();
|
|
7 r2 = READ_ONCE(x);
|
|
8 WARN_ON(!r2); /* BUG!!! */
|
|
9 }
|
|
10 rcu_read_unlock();
|
|
11 }
|
|
12
|
|
13 void thread1(void)
|
|
14 {
|
|
15 spin_lock(&my_lock);
|
|
16 WRITE_ONCE(x, 1);
|
|
17 WRITE_ONCE(y, 1);
|
|
18 spin_unlock(&my_lock);
|
|
19 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt>
|
|
excluded the <tt>thread1()</tt> function's update,
|
|
the <tt>WARN_ON()</tt> could never fire.
|
|
But the fact is that <tt>rcu_read_lock()</tt> does not exclude
|
|
much of anything aside from subsequent grace periods, of which
|
|
<tt>thread1()</tt> has none, so the
|
|
<tt>WARN_ON()</tt> can and does fire.
|
|
|
|
<h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3>
|
|
|
|
<p>
|
|
It might be tempting to assume that after <tt>synchronize_rcu()</tt>
|
|
completes, there are no readers executing.
|
|
This temptation must be avoided because
|
|
new readers can start immediately after <tt>synchronize_rcu()</tt>
|
|
starts, and <tt>synchronize_rcu()</tt> is under no
|
|
obligation to wait for these new readers.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Suppose that synchronize_rcu() did wait until <i>all</i>
|
|
readers had completed instead of waiting only on
|
|
pre-existing readers.
|
|
For how long would the updater be able to rely on there
|
|
being no readers?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
For no time at all.
|
|
Even if <tt>synchronize_rcu()</tt> were to wait until
|
|
all readers had completed, a new reader might start immediately after
|
|
<tt>synchronize_rcu()</tt> completed.
|
|
Therefore, the code following
|
|
<tt>synchronize_rcu()</tt> can <i>never</i> rely on there being
|
|
no readers.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<h3><a name="Grace Periods Don't Partition Read-Side Critical Sections">
|
|
Grace Periods Don't Partition Read-Side Critical Sections</a></h3>
|
|
|
|
<p>
|
|
It is tempting to assume that if any part of one RCU read-side critical
|
|
section precedes a given grace period, and if any part of another RCU
|
|
read-side critical section follows that same grace period, then all of
|
|
the first RCU read-side critical section must precede all of the second.
|
|
However, this just isn't the case: A single grace period does not
|
|
partition the set of RCU read-side critical sections.
|
|
An example of this situation can be illustrated as follows, where
|
|
<tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 void thread0(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 WRITE_ONCE(a, 1);
|
|
5 WRITE_ONCE(b, 1);
|
|
6 rcu_read_unlock();
|
|
7 }
|
|
8
|
|
9 void thread1(void)
|
|
10 {
|
|
11 r1 = READ_ONCE(a);
|
|
12 synchronize_rcu();
|
|
13 WRITE_ONCE(c, 1);
|
|
14 }
|
|
15
|
|
16 void thread2(void)
|
|
17 {
|
|
18 rcu_read_lock();
|
|
19 r2 = READ_ONCE(b);
|
|
20 r3 = READ_ONCE(c);
|
|
21 rcu_read_unlock();
|
|
22 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
It turns out that the outcome:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
(r1 == 1 && r2 == 0 && r3 == 1)
|
|
</pre>
|
|
</blockquote>
|
|
|
|
is entirely possible.
|
|
The following figure show how this can happen, with each circled
|
|
<tt>QS</tt> indicating the point at which RCU recorded a
|
|
<i>quiescent state</i> for each thread, that is, a state in which
|
|
RCU knows that the thread cannot be in the midst of an RCU read-side
|
|
critical section that started before the current grace period:
|
|
|
|
<p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p>
|
|
|
|
<p>
|
|
If it is necessary to partition RCU read-side critical sections in this
|
|
manner, it is necessary to use two grace periods, where the first
|
|
grace period is known to end before the second grace period starts:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 void thread0(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 WRITE_ONCE(a, 1);
|
|
5 WRITE_ONCE(b, 1);
|
|
6 rcu_read_unlock();
|
|
7 }
|
|
8
|
|
9 void thread1(void)
|
|
10 {
|
|
11 r1 = READ_ONCE(a);
|
|
12 synchronize_rcu();
|
|
13 WRITE_ONCE(c, 1);
|
|
14 }
|
|
15
|
|
16 void thread2(void)
|
|
17 {
|
|
18 r2 = READ_ONCE(c);
|
|
19 synchronize_rcu();
|
|
20 WRITE_ONCE(d, 1);
|
|
21 }
|
|
22
|
|
23 void thread3(void)
|
|
24 {
|
|
25 rcu_read_lock();
|
|
26 r3 = READ_ONCE(b);
|
|
27 r4 = READ_ONCE(d);
|
|
28 rcu_read_unlock();
|
|
29 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
Here, if <tt>(r1 == 1)</tt>, then
|
|
<tt>thread0()</tt>'s write to <tt>b</tt> must happen
|
|
before the end of <tt>thread1()</tt>'s grace period.
|
|
If in addition <tt>(r4 == 1)</tt>, then
|
|
<tt>thread3()</tt>'s read from <tt>b</tt> must happen
|
|
after the beginning of <tt>thread2()</tt>'s grace period.
|
|
If it is also the case that <tt>(r2 == 1)</tt>, then the
|
|
end of <tt>thread1()</tt>'s grace period must precede the
|
|
beginning of <tt>thread2()</tt>'s grace period.
|
|
This mean that the two RCU read-side critical sections cannot overlap,
|
|
guaranteeing that <tt>(r3 == 1)</tt>.
|
|
As a result, the outcome:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
(r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1)
|
|
</pre>
|
|
</blockquote>
|
|
|
|
cannot happen.
|
|
|
|
<p>
|
|
This non-requirement was also non-premeditated, but became apparent
|
|
when studying RCU's interaction with memory ordering.
|
|
|
|
<h3><a name="Read-Side Critical Sections Don't Partition Grace Periods">
|
|
Read-Side Critical Sections Don't Partition Grace Periods</a></h3>
|
|
|
|
<p>
|
|
It is also tempting to assume that if an RCU read-side critical section
|
|
happens between a pair of grace periods, then those grace periods cannot
|
|
overlap.
|
|
However, this temptation leads nowhere good, as can be illustrated by
|
|
the following, with all variables initially zero:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 void thread0(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 WRITE_ONCE(a, 1);
|
|
5 WRITE_ONCE(b, 1);
|
|
6 rcu_read_unlock();
|
|
7 }
|
|
8
|
|
9 void thread1(void)
|
|
10 {
|
|
11 r1 = READ_ONCE(a);
|
|
12 synchronize_rcu();
|
|
13 WRITE_ONCE(c, 1);
|
|
14 }
|
|
15
|
|
16 void thread2(void)
|
|
17 {
|
|
18 rcu_read_lock();
|
|
19 WRITE_ONCE(d, 1);
|
|
20 r2 = READ_ONCE(c);
|
|
21 rcu_read_unlock();
|
|
22 }
|
|
23
|
|
24 void thread3(void)
|
|
25 {
|
|
26 r3 = READ_ONCE(d);
|
|
27 synchronize_rcu();
|
|
28 WRITE_ONCE(e, 1);
|
|
29 }
|
|
30
|
|
31 void thread4(void)
|
|
32 {
|
|
33 rcu_read_lock();
|
|
34 r4 = READ_ONCE(b);
|
|
35 r5 = READ_ONCE(e);
|
|
36 rcu_read_unlock();
|
|
37 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
In this case, the outcome:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
(r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1)
|
|
</pre>
|
|
</blockquote>
|
|
|
|
is entirely possible, as illustrated below:
|
|
|
|
<p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p>
|
|
|
|
<p>
|
|
Again, an RCU read-side critical section can overlap almost all of a
|
|
given grace period, just so long as it does not overlap the entire
|
|
grace period.
|
|
As a result, an RCU read-side critical section cannot partition a pair
|
|
of RCU grace periods.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
How long a sequence of grace periods, each separated by an RCU
|
|
read-side critical section, would be required to partition the RCU
|
|
read-side critical sections at the beginning and end of the chain?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
In theory, an infinite number.
|
|
In practice, an unknown number that is sensitive to both implementation
|
|
details and timing considerations.
|
|
Therefore, even in practice, RCU users must abide by the
|
|
theoretical rather than the practical answer.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2>
|
|
|
|
<p>
|
|
These parallelism facts of life are by no means specific to RCU, but
|
|
the RCU implementation must abide by them.
|
|
They therefore bear repeating:
|
|
|
|
<ol>
|
|
<li> Any CPU or task may be delayed at any time,
|
|
and any attempts to avoid these delays by disabling
|
|
preemption, interrupts, or whatever are completely futile.
|
|
This is most obvious in preemptible user-level
|
|
environments and in virtualized environments (where
|
|
a given guest OS's VCPUs can be preempted at any time by
|
|
the underlying hypervisor), but can also happen in bare-metal
|
|
environments due to ECC errors, NMIs, and other hardware
|
|
events.
|
|
Although a delay of more than about 20 seconds can result
|
|
in splats, the RCU implementation is obligated to use
|
|
algorithms that can tolerate extremely long delays, but where
|
|
“extremely long” is not long enough to allow
|
|
wrap-around when incrementing a 64-bit counter.
|
|
<li> Both the compiler and the CPU can reorder memory accesses.
|
|
Where it matters, RCU must use compiler directives and
|
|
memory-barrier instructions to preserve ordering.
|
|
<li> Conflicting writes to memory locations in any given cache line
|
|
will result in expensive cache misses.
|
|
Greater numbers of concurrent writes and more-frequent
|
|
concurrent writes will result in more dramatic slowdowns.
|
|
RCU is therefore obligated to use algorithms that have
|
|
sufficient locality to avoid significant performance and
|
|
scalability problems.
|
|
<li> As a rough rule of thumb, only one CPU's worth of processing
|
|
may be carried out under the protection of any given exclusive
|
|
lock.
|
|
RCU must therefore use scalable locking designs.
|
|
<li> Counters are finite, especially on 32-bit systems.
|
|
RCU's use of counters must therefore tolerate counter wrap,
|
|
or be designed such that counter wrap would take way more
|
|
time than a single system is likely to run.
|
|
An uptime of ten years is quite possible, a runtime
|
|
of a century much less so.
|
|
As an example of the latter, RCU's dyntick-idle nesting counter
|
|
allows 54 bits for interrupt nesting level (this counter
|
|
is 64 bits even on a 32-bit system).
|
|
Overflowing this counter requires 2<sup>54</sup>
|
|
half-interrupts on a given CPU without that CPU ever going idle.
|
|
If a half-interrupt happened every microsecond, it would take
|
|
570 years of runtime to overflow this counter, which is currently
|
|
believed to be an acceptably long time.
|
|
<li> Linux systems can have thousands of CPUs running a single
|
|
Linux kernel in a single shared-memory environment.
|
|
RCU must therefore pay close attention to high-end scalability.
|
|
</ol>
|
|
|
|
<p>
|
|
This last parallelism fact of life means that RCU must pay special
|
|
attention to the preceding facts of life.
|
|
The idea that Linux might scale to systems with thousands of CPUs would
|
|
have been met with some skepticism in the 1990s, but these requirements
|
|
would have otherwise have been unsurprising, even in the early 1990s.
|
|
|
|
<h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2>
|
|
|
|
<p>
|
|
These sections list quality-of-implementation requirements.
|
|
Although an RCU implementation that ignores these requirements could
|
|
still be used, it would likely be subject to limitations that would
|
|
make it inappropriate for industrial-strength production use.
|
|
Classes of quality-of-implementation requirements are as follows:
|
|
|
|
<ol>
|
|
<li> <a href="#Specialization">Specialization</a>
|
|
<li> <a href="#Performance and Scalability">Performance and Scalability</a>
|
|
<li> <a href="#Forward Progress">Forward Progress</a>
|
|
<li> <a href="#Composability">Composability</a>
|
|
<li> <a href="#Corner Cases">Corner Cases</a>
|
|
</ol>
|
|
|
|
<p>
|
|
These classes is covered in the following sections.
|
|
|
|
<h3><a name="Specialization">Specialization</a></h3>
|
|
|
|
<p>
|
|
RCU is and always has been intended primarily for read-mostly situations,
|
|
which means that RCU's read-side primitives are optimized, often at the
|
|
expense of its update-side primitives.
|
|
Experience thus far is captured by the following list of situations:
|
|
|
|
<ol>
|
|
<li> Read-mostly data, where stale and inconsistent data is not
|
|
a problem: RCU works great!
|
|
<li> Read-mostly data, where data must be consistent:
|
|
RCU works well.
|
|
<li> Read-write data, where data must be consistent:
|
|
RCU <i>might</i> work OK.
|
|
Or not.
|
|
<li> Write-mostly data, where data must be consistent:
|
|
RCU is very unlikely to be the right tool for the job,
|
|
with the following exceptions, where RCU can provide:
|
|
<ol type=a>
|
|
<li> Existence guarantees for update-friendly mechanisms.
|
|
<li> Wait-free read-side primitives for real-time use.
|
|
</ol>
|
|
</ol>
|
|
|
|
<p>
|
|
This focus on read-mostly situations means that RCU must interoperate
|
|
with other synchronization primitives.
|
|
For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt>
|
|
examples discussed earlier use RCU to protect readers and locking to
|
|
coordinate updaters.
|
|
However, the need extends much farther, requiring that a variety of
|
|
synchronization primitives be legal within RCU read-side critical sections,
|
|
including spinlocks, sequence locks, atomic operations, reference
|
|
counters, and memory barriers.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
What about sleeping locks?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
These are forbidden within Linux-kernel RCU read-side critical
|
|
sections because it is not legal to place a quiescent state
|
|
(in this case, voluntary context switch) within an RCU read-side
|
|
critical section.
|
|
However, sleeping locks may be used within userspace RCU read-side
|
|
critical sections, and also within Linux-kernel sleepable RCU
|
|
<a href="#Sleepable RCU"><font color="ffffff">(SRCU)</font></a>
|
|
read-side critical sections.
|
|
In addition, the -rt patchset turns spinlocks into a
|
|
sleeping locks so that the corresponding critical sections
|
|
can be preempted, which also means that these sleeplockified
|
|
spinlocks (but not other sleeping locks!) may be acquire within
|
|
-rt-Linux-kernel RCU read-side critical sections.
|
|
</font>
|
|
|
|
<p><font color="ffffff">
|
|
Note that it <i>is</i> legal for a normal RCU read-side
|
|
critical section to conditionally acquire a sleeping locks
|
|
(as in <tt>mutex_trylock()</tt>), but only as long as it does
|
|
not loop indefinitely attempting to conditionally acquire that
|
|
sleeping locks.
|
|
The key point is that things like <tt>mutex_trylock()</tt>
|
|
either return with the mutex held, or return an error indication if
|
|
the mutex was not immediately available.
|
|
Either way, <tt>mutex_trylock()</tt> returns immediately without
|
|
sleeping.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
It often comes as a surprise that many algorithms do not require a
|
|
consistent view of data, but many can function in that mode,
|
|
with network routing being the poster child.
|
|
Internet routing algorithms take significant time to propagate
|
|
updates, so that by the time an update arrives at a given system,
|
|
that system has been sending network traffic the wrong way for
|
|
a considerable length of time.
|
|
Having a few threads continue to send traffic the wrong way for a
|
|
few more milliseconds is clearly not a problem: In the worst case,
|
|
TCP retransmissions will eventually get the data where it needs to go.
|
|
In general, when tracking the state of the universe outside of the
|
|
computer, some level of inconsistency must be tolerated due to
|
|
speed-of-light delays if nothing else.
|
|
|
|
<p>
|
|
Furthermore, uncertainty about external state is inherent in many cases.
|
|
For example, a pair of veterinarians might use heartbeat to determine
|
|
whether or not a given cat was alive.
|
|
But how long should they wait after the last heartbeat to decide that
|
|
the cat is in fact dead?
|
|
Waiting less than 400 milliseconds makes no sense because this would
|
|
mean that a relaxed cat would be considered to cycle between death
|
|
and life more than 100 times per minute.
|
|
Moreover, just as with human beings, a cat's heart might stop for
|
|
some period of time, so the exact wait period is a judgment call.
|
|
One of our pair of veterinarians might wait 30 seconds before pronouncing
|
|
the cat dead, while the other might insist on waiting a full minute.
|
|
The two veterinarians would then disagree on the state of the cat during
|
|
the final 30 seconds of the minute following the last heartbeat.
|
|
|
|
<p>
|
|
Interestingly enough, this same situation applies to hardware.
|
|
When push comes to shove, how do we tell whether or not some
|
|
external server has failed?
|
|
We send messages to it periodically, and declare it failed if we
|
|
don't receive a response within a given period of time.
|
|
Policy decisions can usually tolerate short
|
|
periods of inconsistency.
|
|
The policy was decided some time ago, and is only now being put into
|
|
effect, so a few milliseconds of delay is normally inconsequential.
|
|
|
|
<p>
|
|
However, there are algorithms that absolutely must see consistent data.
|
|
For example, the translation between a user-level SystemV semaphore
|
|
ID to the corresponding in-kernel data structure is protected by RCU,
|
|
but it is absolutely forbidden to update a semaphore that has just been
|
|
removed.
|
|
In the Linux kernel, this need for consistency is accommodated by acquiring
|
|
spinlocks located in the in-kernel data structure from within
|
|
the RCU read-side critical section, and this is indicated by the
|
|
green box in the figure above.
|
|
Many other techniques may be used, and are in fact used within the
|
|
Linux kernel.
|
|
|
|
<p>
|
|
In short, RCU is not required to maintain consistency, and other
|
|
mechanisms may be used in concert with RCU when consistency is required.
|
|
RCU's specialization allows it to do its job extremely well, and its
|
|
ability to interoperate with other synchronization mechanisms allows
|
|
the right mix of synchronization tools to be used for a given job.
|
|
|
|
<h3><a name="Performance and Scalability">Performance and Scalability</a></h3>
|
|
|
|
<p>
|
|
Energy efficiency is a critical component of performance today,
|
|
and Linux-kernel RCU implementations must therefore avoid unnecessarily
|
|
awakening idle CPUs.
|
|
I cannot claim that this requirement was premeditated.
|
|
In fact, I learned of it during a telephone conversation in which I
|
|
was given “frank and open” feedback on the importance
|
|
of energy efficiency in battery-powered systems and on specific
|
|
energy-efficiency shortcomings of the Linux-kernel RCU implementation.
|
|
In my experience, the battery-powered embedded community will consider
|
|
any unnecessary wakeups to be extremely unfriendly acts.
|
|
So much so that mere Linux-kernel-mailing-list posts are
|
|
insufficient to vent their ire.
|
|
|
|
<p>
|
|
Memory consumption is not particularly important for in most
|
|
situations, and has become decreasingly
|
|
so as memory sizes have expanded and memory
|
|
costs have plummeted.
|
|
However, as I learned from Matt Mackall's
|
|
<a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a>
|
|
efforts, memory footprint is critically important on single-CPU systems with
|
|
non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus
|
|
<a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a>
|
|
was born.
|
|
Josh Triplett has since taken over the small-memory banner with his
|
|
<a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a>
|
|
project, which resulted in
|
|
<a href="#Sleepable RCU">SRCU</a>
|
|
becoming optional for those kernels not needing it.
|
|
|
|
<p>
|
|
The remaining performance requirements are, for the most part,
|
|
unsurprising.
|
|
For example, in keeping with RCU's read-side specialization,
|
|
<tt>rcu_dereference()</tt> should have negligible overhead (for
|
|
example, suppression of a few minor compiler optimizations).
|
|
Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and
|
|
<tt>rcu_read_unlock()</tt> should have exactly zero overhead.
|
|
|
|
<p>
|
|
In preemptible environments, in the case where the RCU read-side
|
|
critical section was not preempted (as will be the case for the
|
|
highest-priority real-time process), <tt>rcu_read_lock()</tt> and
|
|
<tt>rcu_read_unlock()</tt> should have minimal overhead.
|
|
In particular, they should not contain atomic read-modify-write
|
|
operations, memory-barrier instructions, preemption disabling,
|
|
interrupt disabling, or backwards branches.
|
|
However, in the case where the RCU read-side critical section was preempted,
|
|
<tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts.
|
|
This is why it is better to nest an RCU read-side critical section
|
|
within a preempt-disable region than vice versa, at least in cases
|
|
where that critical section is short enough to avoid unduly degrading
|
|
real-time latencies.
|
|
|
|
<p>
|
|
The <tt>synchronize_rcu()</tt> grace-period-wait primitive is
|
|
optimized for throughput.
|
|
It may therefore incur several milliseconds of latency in addition to
|
|
the duration of the longest RCU read-side critical section.
|
|
On the other hand, multiple concurrent invocations of
|
|
<tt>synchronize_rcu()</tt> are required to use batching optimizations
|
|
so that they can be satisfied by a single underlying grace-period-wait
|
|
operation.
|
|
For example, in the Linux kernel, it is not unusual for a single
|
|
grace-period-wait operation to serve more than
|
|
<a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a>
|
|
of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation
|
|
overhead down to nearly zero.
|
|
However, the grace-period optimization is also required to avoid
|
|
measurable degradation of real-time scheduling and interrupt latencies.
|
|
|
|
<p>
|
|
In some cases, the multi-millisecond <tt>synchronize_rcu()</tt>
|
|
latencies are unacceptable.
|
|
In these cases, <tt>synchronize_rcu_expedited()</tt> may be used
|
|
instead, reducing the grace-period latency down to a few tens of
|
|
microseconds on small systems, at least in cases where the RCU read-side
|
|
critical sections are short.
|
|
There are currently no special latency requirements for
|
|
<tt>synchronize_rcu_expedited()</tt> on large systems, but,
|
|
consistent with the empirical nature of the RCU specification,
|
|
that is subject to change.
|
|
However, there most definitely are scalability requirements:
|
|
A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096
|
|
CPUs should at least make reasonable forward progress.
|
|
In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt>
|
|
is permitted to impose modest degradation of real-time latency
|
|
on non-idle online CPUs.
|
|
Here, “modest” means roughly the same latency
|
|
degradation as a scheduling-clock interrupt.
|
|
|
|
<p>
|
|
There are a number of situations where even
|
|
<tt>synchronize_rcu_expedited()</tt>'s reduced grace-period
|
|
latency is unacceptable.
|
|
In these situations, the asynchronous <tt>call_rcu()</tt> can be
|
|
used in place of <tt>synchronize_rcu()</tt> as follows:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 struct foo {
|
|
2 int a;
|
|
3 int b;
|
|
4 struct rcu_head rh;
|
|
5 };
|
|
6
|
|
7 static void remove_gp_cb(struct rcu_head *rhp)
|
|
8 {
|
|
9 struct foo *p = container_of(rhp, struct foo, rh);
|
|
10
|
|
11 kfree(p);
|
|
12 }
|
|
13
|
|
14 bool remove_gp_asynchronous(void)
|
|
15 {
|
|
16 struct foo *p;
|
|
17
|
|
18 spin_lock(&gp_lock);
|
|
19 p = rcu_access_pointer(gp);
|
|
20 if (!p) {
|
|
21 spin_unlock(&gp_lock);
|
|
22 return false;
|
|
23 }
|
|
24 rcu_assign_pointer(gp, NULL);
|
|
25 call_rcu(&p->rh, remove_gp_cb);
|
|
26 spin_unlock(&gp_lock);
|
|
27 return true;
|
|
28 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
A definition of <tt>struct foo</tt> is finally needed, and appears
|
|
on lines 1-5.
|
|
The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt>
|
|
on line 25, and will be invoked after the end of a subsequent
|
|
grace period.
|
|
This gets the same effect as <tt>remove_gp_synchronous()</tt>,
|
|
but without forcing the updater to wait for a grace period to elapse.
|
|
The <tt>call_rcu()</tt> function may be used in a number of
|
|
situations where neither <tt>synchronize_rcu()</tt> nor
|
|
<tt>synchronize_rcu_expedited()</tt> would be legal,
|
|
including within preempt-disable code, <tt>local_bh_disable()</tt> code,
|
|
interrupt-disable code, and interrupt handlers.
|
|
However, even <tt>call_rcu()</tt> is illegal within NMI handlers
|
|
and from idle and offline CPUs.
|
|
The callback function (<tt>remove_gp_cb()</tt> in this case) will be
|
|
executed within softirq (software interrupt) environment within the
|
|
Linux kernel,
|
|
either within a real softirq handler or under the protection
|
|
of <tt>local_bh_disable()</tt>.
|
|
In both the Linux kernel and in userspace, it is bad practice to
|
|
write an RCU callback function that takes too long.
|
|
Long-running operations should be relegated to separate threads or
|
|
(in the Linux kernel) workqueues.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Why does line 19 use <tt>rcu_access_pointer()</tt>?
|
|
After all, <tt>call_rcu()</tt> on line 25 stores into the
|
|
structure, which would interact badly with concurrent insertions.
|
|
Doesn't this mean that <tt>rcu_dereference()</tt> is required?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
Presumably the <tt>->gp_lock</tt> acquired on line 18 excludes
|
|
any changes, including any insertions that <tt>rcu_dereference()</tt>
|
|
would protect against.
|
|
Therefore, any insertions will be delayed until after
|
|
<tt>->gp_lock</tt>
|
|
is released on line 25, which in turn means that
|
|
<tt>rcu_access_pointer()</tt> suffices.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
However, all that <tt>remove_gp_cb()</tt> is doing is
|
|
invoking <tt>kfree()</tt> on the data element.
|
|
This is a common idiom, and is supported by <tt>kfree_rcu()</tt>,
|
|
which allows “fire and forget” operation as shown below:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 struct foo {
|
|
2 int a;
|
|
3 int b;
|
|
4 struct rcu_head rh;
|
|
5 };
|
|
6
|
|
7 bool remove_gp_faf(void)
|
|
8 {
|
|
9 struct foo *p;
|
|
10
|
|
11 spin_lock(&gp_lock);
|
|
12 p = rcu_dereference(gp);
|
|
13 if (!p) {
|
|
14 spin_unlock(&gp_lock);
|
|
15 return false;
|
|
16 }
|
|
17 rcu_assign_pointer(gp, NULL);
|
|
18 kfree_rcu(p, rh);
|
|
19 spin_unlock(&gp_lock);
|
|
20 return true;
|
|
21 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
Note that <tt>remove_gp_faf()</tt> simply invokes
|
|
<tt>kfree_rcu()</tt> and proceeds, without any need to pay any
|
|
further attention to the subsequent grace period and <tt>kfree()</tt>.
|
|
It is permissible to invoke <tt>kfree_rcu()</tt> from the same
|
|
environments as for <tt>call_rcu()</tt>.
|
|
Interestingly enough, DYNIX/ptx had the equivalents of
|
|
<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not
|
|
<tt>synchronize_rcu()</tt>.
|
|
This was due to the fact that RCU was not heavily used within DYNIX/ptx,
|
|
so the very few places that needed something like
|
|
<tt>synchronize_rcu()</tt> simply open-coded it.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Earlier it was claimed that <tt>call_rcu()</tt> and
|
|
<tt>kfree_rcu()</tt> allowed updaters to avoid being blocked
|
|
by readers.
|
|
But how can that be correct, given that the invocation of the callback
|
|
and the freeing of the memory (respectively) must still wait for
|
|
a grace period to elapse?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
We could define things this way, but keep in mind that this sort of
|
|
definition would say that updates in garbage-collected languages
|
|
cannot complete until the next time the garbage collector runs,
|
|
which does not seem at all reasonable.
|
|
The key point is that in most cases, an updater using either
|
|
<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the
|
|
next update as soon as it has invoked <tt>call_rcu()</tt> or
|
|
<tt>kfree_rcu()</tt>, without having to wait for a subsequent
|
|
grace period.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
But what if the updater must wait for the completion of code to be
|
|
executed after the end of the grace period, but has other tasks
|
|
that can be carried out in the meantime?
|
|
The polling-style <tt>get_state_synchronize_rcu()</tt> and
|
|
<tt>cond_synchronize_rcu()</tt> functions may be used for this
|
|
purpose, as shown below:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool remove_gp_poll(void)
|
|
2 {
|
|
3 struct foo *p;
|
|
4 unsigned long s;
|
|
5
|
|
6 spin_lock(&gp_lock);
|
|
7 p = rcu_access_pointer(gp);
|
|
8 if (!p) {
|
|
9 spin_unlock(&gp_lock);
|
|
10 return false;
|
|
11 }
|
|
12 rcu_assign_pointer(gp, NULL);
|
|
13 spin_unlock(&gp_lock);
|
|
14 s = get_state_synchronize_rcu();
|
|
15 do_something_while_waiting();
|
|
16 cond_synchronize_rcu(s);
|
|
17 kfree(p);
|
|
18 return true;
|
|
19 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
On line 14, <tt>get_state_synchronize_rcu()</tt> obtains a
|
|
“cookie” from RCU,
|
|
then line 15 carries out other tasks,
|
|
and finally, line 16 returns immediately if a grace period has
|
|
elapsed in the meantime, but otherwise waits as required.
|
|
The need for <tt>get_state_synchronize_rcu</tt> and
|
|
<tt>cond_synchronize_rcu()</tt> has appeared quite recently,
|
|
so it is too early to tell whether they will stand the test of time.
|
|
|
|
<p>
|
|
RCU thus provides a range of tools to allow updaters to strike the
|
|
required tradeoff between latency, flexibility and CPU overhead.
|
|
|
|
<h3><a name="Forward Progress">Forward Progress</a></h3>
|
|
|
|
<p>
|
|
In theory, delaying grace-period completion and callback invocation
|
|
is harmless.
|
|
In practice, not only are memory sizes finite but also callbacks sometimes
|
|
do wakeups, and sufficiently deferred wakeups can be difficult
|
|
to distinguish from system hangs.
|
|
Therefore, RCU must provide a number of mechanisms to promote forward
|
|
progress.
|
|
|
|
<p>
|
|
These mechanisms are not foolproof, nor can they be.
|
|
For one simple example, an infinite loop in an RCU read-side critical
|
|
section must by definition prevent later grace periods from ever completing.
|
|
For a more involved example, consider a 64-CPU system built with
|
|
<tt>CONFIG_RCU_NOCB_CPU=y</tt> and booted with <tt>rcu_nocbs=1-63</tt>,
|
|
where CPUs 1 through 63 spin in tight loops that invoke
|
|
<tt>call_rcu()</tt>.
|
|
Even if these tight loops also contain calls to <tt>cond_resched()</tt>
|
|
(thus allowing grace periods to complete), CPU 0 simply will
|
|
not be able to invoke callbacks as fast as the other 63 CPUs can
|
|
register them, at least not until the system runs out of memory.
|
|
In both of these examples, the Spiderman principle applies: With great
|
|
power comes great responsibility.
|
|
However, short of this level of abuse, RCU is required to
|
|
ensure timely completion of grace periods and timely invocation of
|
|
callbacks.
|
|
|
|
<p>
|
|
RCU takes the following steps to encourage timely completion of
|
|
grace periods:
|
|
|
|
<ol>
|
|
<li> If a grace period fails to complete within 100 milliseconds,
|
|
RCU causes future invocations of <tt>cond_resched()</tt> on
|
|
the holdout CPUs to provide an RCU quiescent state.
|
|
RCU also causes those CPUs' <tt>need_resched()</tt> invocations
|
|
to return <tt>true</tt>, but only after the corresponding CPU's
|
|
next scheduling-clock.
|
|
<li> CPUs mentioned in the <tt>nohz_full</tt> kernel boot parameter
|
|
can run indefinitely in the kernel without scheduling-clock
|
|
interrupts, which defeats the above <tt>need_resched()</tt>
|
|
strategem.
|
|
RCU will therefore invoke <tt>resched_cpu()</tt> on any
|
|
<tt>nohz_full</tt> CPUs still holding out after
|
|
109 milliseconds.
|
|
<li> In kernels built with <tt>CONFIG_RCU_BOOST=y</tt>, if a given
|
|
task that has been preempted within an RCU read-side critical
|
|
section is holding out for more than 500 milliseconds,
|
|
RCU will resort to priority boosting.
|
|
<li> If a CPU is still holding out 10 seconds into the grace
|
|
period, RCU will invoke <tt>resched_cpu()</tt> on it regardless
|
|
of its <tt>nohz_full</tt> state.
|
|
</ol>
|
|
|
|
<p>
|
|
The above values are defaults for systems running with <tt>HZ=1000</tt>.
|
|
They will vary as the value of <tt>HZ</tt> varies, and can also be
|
|
changed using the relevant Kconfig options and kernel boot parameters.
|
|
RCU currently does not do much sanity checking of these
|
|
parameters, so please use caution when changing them.
|
|
Note that these forward-progress measures are provided only for RCU,
|
|
not for
|
|
<a href="#Sleepable RCU">SRCU</a> or
|
|
<a href="#Tasks RCU">Tasks RCU</a>.
|
|
|
|
<p>
|
|
RCU takes the following steps in <tt>call_rcu()</tt> to encourage timely
|
|
invocation of callbacks when any given non-<tt>rcu_nocbs</tt> CPU has
|
|
10,000 callbacks, or has 10,000 more callbacks than it had the last time
|
|
encouragement was provided:
|
|
|
|
<ol>
|
|
<li> Starts a grace period, if one is not already in progress.
|
|
<li> Forces immediate checking for quiescent states, rather than
|
|
waiting for three milliseconds to have elapsed since the
|
|
beginning of the grace period.
|
|
<li> Immediately tags the CPU's callbacks with their grace period
|
|
completion numbers, rather than waiting for the <tt>RCU_SOFTIRQ</tt>
|
|
handler to get around to it.
|
|
<li> Lifts callback-execution batch limits, which speeds up callback
|
|
invocation at the expense of degrading realtime response.
|
|
</ol>
|
|
|
|
<p>
|
|
Again, these are default values when running at <tt>HZ=1000</tt>,
|
|
and can be overridden.
|
|
Again, these forward-progress measures are provided only for RCU,
|
|
not for
|
|
<a href="#Sleepable RCU">SRCU</a> or
|
|
<a href="#Tasks RCU">Tasks RCU</a>.
|
|
Even for RCU, callback-invocation forward progress for <tt>rcu_nocbs</tt>
|
|
CPUs is much less well-developed, in part because workloads benefiting
|
|
from <tt>rcu_nocbs</tt> CPUs tend to invoke <tt>call_rcu()</tt>
|
|
relatively infrequently.
|
|
If workloads emerge that need both <tt>rcu_nocbs</tt> CPUs and high
|
|
<tt>call_rcu()</tt> invocation rates, then additional forward-progress
|
|
work will be required.
|
|
|
|
<h3><a name="Composability">Composability</a></h3>
|
|
|
|
<p>
|
|
Composability has received much attention in recent years, perhaps in part
|
|
due to the collision of multicore hardware with object-oriented techniques
|
|
designed in single-threaded environments for single-threaded use.
|
|
And in theory, RCU read-side critical sections may be composed, and in
|
|
fact may be nested arbitrarily deeply.
|
|
In practice, as with all real-world implementations of composable
|
|
constructs, there are limitations.
|
|
|
|
<p>
|
|
Implementations of RCU for which <tt>rcu_read_lock()</tt>
|
|
and <tt>rcu_read_unlock()</tt> generate no code, such as
|
|
Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be
|
|
nested arbitrarily deeply.
|
|
After all, there is no overhead.
|
|
Except that if all these instances of <tt>rcu_read_lock()</tt>
|
|
and <tt>rcu_read_unlock()</tt> are visible to the compiler,
|
|
compilation will eventually fail due to exhausting memory,
|
|
mass storage, or user patience, whichever comes first.
|
|
If the nesting is not visible to the compiler, as is the case with
|
|
mutually recursive functions each in its own translation unit,
|
|
stack overflow will result.
|
|
If the nesting takes the form of loops, perhaps in the guise of tail
|
|
recursion, either the control variable
|
|
will overflow or (in the Linux kernel) you will get an RCU CPU stall warning.
|
|
Nevertheless, this class of RCU implementations is one
|
|
of the most composable constructs in existence.
|
|
|
|
<p>
|
|
RCU implementations that explicitly track nesting depth
|
|
are limited by the nesting-depth counter.
|
|
For example, the Linux kernel's preemptible RCU limits nesting to
|
|
<tt>INT_MAX</tt>.
|
|
This should suffice for almost all practical purposes.
|
|
That said, a consecutive pair of RCU read-side critical sections
|
|
between which there is an operation that waits for a grace period
|
|
cannot be enclosed in another RCU read-side critical section.
|
|
This is because it is not legal to wait for a grace period within
|
|
an RCU read-side critical section: To do so would result either
|
|
in deadlock or
|
|
in RCU implicitly splitting the enclosing RCU read-side critical
|
|
section, neither of which is conducive to a long-lived and prosperous
|
|
kernel.
|
|
|
|
<p>
|
|
It is worth noting that RCU is not alone in limiting composability.
|
|
For example, many transactional-memory implementations prohibit
|
|
composing a pair of transactions separated by an irrevocable
|
|
operation (for example, a network receive operation).
|
|
For another example, lock-based critical sections can be composed
|
|
surprisingly freely, but only if deadlock is avoided.
|
|
|
|
<p>
|
|
In short, although RCU read-side critical sections are highly composable,
|
|
care is required in some situations, just as is the case for any other
|
|
composable synchronization mechanism.
|
|
|
|
<h3><a name="Corner Cases">Corner Cases</a></h3>
|
|
|
|
<p>
|
|
A given RCU workload might have an endless and intense stream of
|
|
RCU read-side critical sections, perhaps even so intense that there
|
|
was never a point in time during which there was not at least one
|
|
RCU read-side critical section in flight.
|
|
RCU cannot allow this situation to block grace periods: As long as
|
|
all the RCU read-side critical sections are finite, grace periods
|
|
must also be finite.
|
|
|
|
<p>
|
|
That said, preemptible RCU implementations could potentially result
|
|
in RCU read-side critical sections being preempted for long durations,
|
|
which has the effect of creating a long-duration RCU read-side
|
|
critical section.
|
|
This situation can arise only in heavily loaded systems, but systems using
|
|
real-time priorities are of course more vulnerable.
|
|
Therefore, RCU priority boosting is provided to help deal with this
|
|
case.
|
|
That said, the exact requirements on RCU priority boosting will likely
|
|
evolve as more experience accumulates.
|
|
|
|
<p>
|
|
Other workloads might have very high update rates.
|
|
Although one can argue that such workloads should instead use
|
|
something other than RCU, the fact remains that RCU must
|
|
handle such workloads gracefully.
|
|
This requirement is another factor driving batching of grace periods,
|
|
but it is also the driving force behind the checks for large numbers
|
|
of queued RCU callbacks in the <tt>call_rcu()</tt> code path.
|
|
Finally, high update rates should not delay RCU read-side critical
|
|
sections, although some small read-side delays can occur when using
|
|
<tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use
|
|
of <tt>smp_call_function_single()</tt>.
|
|
|
|
<p>
|
|
Although all three of these corner cases were understood in the early
|
|
1990s, a simple user-level test consisting of <tt>close(open(path))</tt>
|
|
in a tight loop
|
|
in the early 2000s suddenly provided a much deeper appreciation of the
|
|
high-update-rate corner case.
|
|
This test also motivated addition of some RCU code to react to high update
|
|
rates, for example, if a given CPU finds itself with more than 10,000
|
|
RCU callbacks queued, it will cause RCU to take evasive action by
|
|
more aggressively starting grace periods and more aggressively forcing
|
|
completion of grace-period processing.
|
|
This evasive action causes the grace period to complete more quickly,
|
|
but at the cost of restricting RCU's batching optimizations, thus
|
|
increasing the CPU overhead incurred by that grace period.
|
|
|
|
<h2><a name="Software-Engineering Requirements">
|
|
Software-Engineering Requirements</a></h2>
|
|
|
|
<p>
|
|
Between Murphy's Law and “To err is human”, it is necessary to
|
|
guard against mishaps and misuse:
|
|
|
|
<ol>
|
|
<li> It is all too easy to forget to use <tt>rcu_read_lock()</tt>
|
|
everywhere that it is needed, so kernels built with
|
|
<tt>CONFIG_PROVE_RCU=y</tt> will splat if
|
|
<tt>rcu_dereference()</tt> is used outside of an
|
|
RCU read-side critical section.
|
|
Update-side code can use <tt>rcu_dereference_protected()</tt>,
|
|
which takes a
|
|
<a href="https://lwn.net/Articles/371986/">lockdep expression</a>
|
|
to indicate what is providing the protection.
|
|
If the indicated protection is not provided, a lockdep splat
|
|
is emitted.
|
|
|
|
<p>
|
|
Code shared between readers and updaters can use
|
|
<tt>rcu_dereference_check()</tt>, which also takes a
|
|
lockdep expression, and emits a lockdep splat if neither
|
|
<tt>rcu_read_lock()</tt> nor the indicated protection
|
|
is in place.
|
|
In addition, <tt>rcu_dereference_raw()</tt> is used in those
|
|
(hopefully rare) cases where the required protection cannot
|
|
be easily described.
|
|
Finally, <tt>rcu_read_lock_held()</tt> is provided to
|
|
allow a function to verify that it has been invoked within
|
|
an RCU read-side critical section.
|
|
I was made aware of this set of requirements shortly after Thomas
|
|
Gleixner audited a number of RCU uses.
|
|
<li> A given function might wish to check for RCU-related preconditions
|
|
upon entry, before using any other RCU API.
|
|
The <tt>rcu_lockdep_assert()</tt> does this job,
|
|
asserting the expression in kernels having lockdep enabled
|
|
and doing nothing otherwise.
|
|
<li> It is also easy to forget to use <tt>rcu_assign_pointer()</tt>
|
|
and <tt>rcu_dereference()</tt>, perhaps (incorrectly)
|
|
substituting a simple assignment.
|
|
To catch this sort of error, a given RCU-protected pointer may be
|
|
tagged with <tt>__rcu</tt>, after which sparse
|
|
will complain about simple-assignment accesses to that pointer.
|
|
Arnd Bergmann made me aware of this requirement, and also
|
|
supplied the needed
|
|
<a href="https://lwn.net/Articles/376011/">patch series</a>.
|
|
<li> Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt>
|
|
will splat if a data element is passed to <tt>call_rcu()</tt>
|
|
twice in a row, without a grace period in between.
|
|
(This error is similar to a double free.)
|
|
The corresponding <tt>rcu_head</tt> structures that are
|
|
dynamically allocated are automatically tracked, but
|
|
<tt>rcu_head</tt> structures allocated on the stack
|
|
must be initialized with <tt>init_rcu_head_on_stack()</tt>
|
|
and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>.
|
|
Similarly, statically allocated non-stack <tt>rcu_head</tt>
|
|
structures must be initialized with <tt>init_rcu_head()</tt>
|
|
and cleaned up with <tt>destroy_rcu_head()</tt>.
|
|
Mathieu Desnoyers made me aware of this requirement, and also
|
|
supplied the needed
|
|
<a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>.
|
|
<li> An infinite loop in an RCU read-side critical section will
|
|
eventually trigger an RCU CPU stall warning splat, with
|
|
the duration of “eventually” being controlled by the
|
|
<tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or,
|
|
alternatively, by the
|
|
<tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs
|
|
parameter.
|
|
However, RCU is not obligated to produce this splat
|
|
unless there is a grace period waiting on that particular
|
|
RCU read-side critical section.
|
|
<p>
|
|
Some extreme workloads might intentionally delay
|
|
RCU grace periods, and systems running those workloads can
|
|
be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt>
|
|
to suppress the splats.
|
|
This kernel parameter may also be set via <tt>sysfs</tt>.
|
|
Furthermore, RCU CPU stall warnings are counter-productive
|
|
during sysrq dumps and during panics.
|
|
RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and
|
|
<tt>rcu_sysrq_end()</tt> API members to be called before
|
|
and after long sysrq dumps.
|
|
RCU also supplies the <tt>rcu_panic()</tt> notifier that is
|
|
automatically invoked at the beginning of a panic to suppress
|
|
further RCU CPU stall warnings.
|
|
|
|
<p>
|
|
This requirement made itself known in the early 1990s, pretty
|
|
much the first time that it was necessary to debug a CPU stall.
|
|
That said, the initial implementation in DYNIX/ptx was quite
|
|
generic in comparison with that of Linux.
|
|
<li> Although it would be very good to detect pointers leaking out
|
|
of RCU read-side critical sections, there is currently no
|
|
good way of doing this.
|
|
One complication is the need to distinguish between pointers
|
|
leaking and pointers that have been handed off from RCU to
|
|
some other synchronization mechanism, for example, reference
|
|
counting.
|
|
<li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related
|
|
information is provided via event tracing.
|
|
<li> Open-coded use of <tt>rcu_assign_pointer()</tt> and
|
|
<tt>rcu_dereference()</tt> to create typical linked
|
|
data structures can be surprisingly error-prone.
|
|
Therefore, RCU-protected
|
|
<a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a>
|
|
and, more recently, RCU-protected
|
|
<a href="https://lwn.net/Articles/612100/">hash tables</a>
|
|
are available.
|
|
Many other special-purpose RCU-protected data structures are
|
|
available in the Linux kernel and the userspace RCU library.
|
|
<li> Some linked structures are created at compile time, but still
|
|
require <tt>__rcu</tt> checking.
|
|
The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this
|
|
purpose.
|
|
<li> It is not necessary to use <tt>rcu_assign_pointer()</tt>
|
|
when creating linked structures that are to be published via
|
|
a single external pointer.
|
|
The <tt>RCU_INIT_POINTER()</tt> macro is provided for
|
|
this task and also for assigning <tt>NULL</tt> pointers
|
|
at runtime.
|
|
</ol>
|
|
|
|
<p>
|
|
This not a hard-and-fast list: RCU's diagnostic capabilities will
|
|
continue to be guided by the number and type of usage bugs found
|
|
in real-world RCU usage.
|
|
|
|
<h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2>
|
|
|
|
<p>
|
|
The Linux kernel provides an interesting environment for all kinds of
|
|
software, including RCU.
|
|
Some of the relevant points of interest are as follows:
|
|
|
|
<ol>
|
|
<li> <a href="#Configuration">Configuration</a>.
|
|
<li> <a href="#Firmware Interface">Firmware Interface</a>.
|
|
<li> <a href="#Early Boot">Early Boot</a>.
|
|
<li> <a href="#Interrupts and NMIs">
|
|
Interrupts and non-maskable interrupts (NMIs)</a>.
|
|
<li> <a href="#Loadable Modules">Loadable Modules</a>.
|
|
<li> <a href="#Hotplug CPU">Hotplug CPU</a>.
|
|
<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>.
|
|
<li> <a href="#Tracing and RCU">Tracing and RCU</a>.
|
|
<li> <a href="#Accesses to User Memory and RCU">
|
|
Accesses to User Memory and RCU</a>.
|
|
<li> <a href="#Energy Efficiency">Energy Efficiency</a>.
|
|
<li> <a href="#Scheduling-Clock Interrupts and RCU">
|
|
Scheduling-Clock Interrupts and RCU</a>.
|
|
<li> <a href="#Memory Efficiency">Memory Efficiency</a>.
|
|
<li> <a href="#Performance, Scalability, Response Time, and Reliability">
|
|
Performance, Scalability, Response Time, and Reliability</a>.
|
|
</ol>
|
|
|
|
<p>
|
|
This list is probably incomplete, but it does give a feel for the
|
|
most notable Linux-kernel complications.
|
|
Each of the following sections covers one of the above topics.
|
|
|
|
<h3><a name="Configuration">Configuration</a></h3>
|
|
|
|
<p>
|
|
RCU's goal is automatic configuration, so that almost nobody
|
|
needs to worry about RCU's <tt>Kconfig</tt> options.
|
|
And for almost all users, RCU does in fact work well
|
|
“out of the box.”
|
|
|
|
<p>
|
|
However, there are specialized use cases that are handled by
|
|
kernel boot parameters and <tt>Kconfig</tt> options.
|
|
Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users
|
|
about new <tt>Kconfig</tt> options, which requires almost all of them
|
|
be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option.
|
|
|
|
<p>
|
|
This all should be quite obvious, but the fact remains that
|
|
Linus Torvalds recently had to
|
|
<a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a>
|
|
me of this requirement.
|
|
|
|
<h3><a name="Firmware Interface">Firmware Interface</a></h3>
|
|
|
|
<p>
|
|
In many cases, kernel obtains information about the system from the
|
|
firmware, and sometimes things are lost in translation.
|
|
Or the translation is accurate, but the original message is bogus.
|
|
|
|
<p>
|
|
For example, some systems' firmware overreports the number of CPUs,
|
|
sometimes by a large factor.
|
|
If RCU naively believed the firmware, as it used to do,
|
|
it would create too many per-CPU kthreads.
|
|
Although the resulting system will still run correctly, the extra
|
|
kthreads needlessly consume memory and can cause confusion
|
|
when they show up in <tt>ps</tt> listings.
|
|
|
|
<p>
|
|
RCU must therefore wait for a given CPU to actually come online before
|
|
it can allow itself to believe that the CPU actually exists.
|
|
The resulting “ghost CPUs” (which are never going to
|
|
come online) cause a number of
|
|
<a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>.
|
|
|
|
<h3><a name="Early Boot">Early Boot</a></h3>
|
|
|
|
<p>
|
|
The Linux kernel's boot sequence is an interesting process,
|
|
and RCU is used early, even before <tt>rcu_init()</tt>
|
|
is invoked.
|
|
In fact, a number of RCU's primitives can be used as soon as the
|
|
initial task's <tt>task_struct</tt> is available and the
|
|
boot CPU's per-CPU variables are set up.
|
|
The read-side primitives (<tt>rcu_read_lock()</tt>,
|
|
<tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>,
|
|
and <tt>rcu_access_pointer()</tt>) will operate normally very early on,
|
|
as will <tt>rcu_assign_pointer()</tt>.
|
|
|
|
<p>
|
|
Although <tt>call_rcu()</tt> may be invoked at any
|
|
time during boot, callbacks are not guaranteed to be invoked until after
|
|
all of RCU's kthreads have been spawned, which occurs at
|
|
<tt>early_initcall()</tt> time.
|
|
This delay in callback invocation is due to the fact that RCU does not
|
|
invoke callbacks until it is fully initialized, and this full initialization
|
|
cannot occur until after the scheduler has initialized itself to the
|
|
point where RCU can spawn and run its kthreads.
|
|
In theory, it would be possible to invoke callbacks earlier,
|
|
however, this is not a panacea because there would be severe restrictions
|
|
on what operations those callbacks could invoke.
|
|
|
|
<p>
|
|
Perhaps surprisingly, <tt>synchronize_rcu()</tt> and
|
|
<tt>synchronize_rcu_expedited()</tt>,
|
|
will operate normally
|
|
during very early boot, the reason being that there is only one CPU
|
|
and preemption is disabled.
|
|
This means that the call <tt>synchronize_rcu()</tt> (or friends)
|
|
itself is a quiescent
|
|
state and thus a grace period, so the early-boot implementation can
|
|
be a no-op.
|
|
|
|
<p>
|
|
However, once the scheduler has spawned its first kthread, this early
|
|
boot trick fails for <tt>synchronize_rcu()</tt> (as well as for
|
|
<tt>synchronize_rcu_expedited()</tt>) in <tt>CONFIG_PREEMPT=y</tt>
|
|
kernels.
|
|
The reason is that an RCU read-side critical section might be preempted,
|
|
which means that a subsequent <tt>synchronize_rcu()</tt> really does have
|
|
to wait for something, as opposed to simply returning immediately.
|
|
Unfortunately, <tt>synchronize_rcu()</tt> can't do this until all of
|
|
its kthreads are spawned, which doesn't happen until some time during
|
|
<tt>early_initcalls()</tt> time.
|
|
But this is no excuse: RCU is nevertheless required to correctly handle
|
|
synchronous grace periods during this time period.
|
|
Once all of its kthreads are up and running, RCU starts running
|
|
normally.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
How can RCU possibly handle grace periods before all of its
|
|
kthreads have been spawned???
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
Very carefully!
|
|
</font>
|
|
|
|
<p><font color="ffffff">
|
|
During the “dead zone” between the time that the
|
|
scheduler spawns the first task and the time that all of RCU's
|
|
kthreads have been spawned, all synchronous grace periods are
|
|
handled by the expedited grace-period mechanism.
|
|
At runtime, this expedited mechanism relies on workqueues, but
|
|
during the dead zone the requesting task itself drives the
|
|
desired expedited grace period.
|
|
Because dead-zone execution takes place within task context,
|
|
everything works.
|
|
Once the dead zone ends, expedited grace periods go back to
|
|
using workqueues, as is required to avoid problems that would
|
|
otherwise occur when a user task received a POSIX signal while
|
|
driving an expedited grace period.
|
|
</font>
|
|
|
|
<p><font color="ffffff">
|
|
And yes, this does mean that it is unhelpful to send POSIX
|
|
signals to random tasks between the time that the scheduler
|
|
spawns its first kthread and the time that RCU's kthreads
|
|
have all been spawned.
|
|
If there ever turns out to be a good reason for sending POSIX
|
|
signals during that time, appropriate adjustments will be made.
|
|
(If it turns out that POSIX signals are sent during this time for
|
|
no good reason, other adjustments will be made, appropriate
|
|
or otherwise.)
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
I learned of these boot-time requirements as a result of a series of
|
|
system hangs.
|
|
|
|
<h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3>
|
|
|
|
<p>
|
|
The Linux kernel has interrupts, and RCU read-side critical sections are
|
|
legal within interrupt handlers and within interrupt-disabled regions
|
|
of code, as are invocations of <tt>call_rcu()</tt>.
|
|
|
|
<p>
|
|
Some Linux-kernel architectures can enter an interrupt handler from
|
|
non-idle process context, and then just never leave it, instead stealthily
|
|
transitioning back to process context.
|
|
This trick is sometimes used to invoke system calls from inside the kernel.
|
|
These “half-interrupts” mean that RCU has to be very careful
|
|
about how it counts interrupt nesting levels.
|
|
I learned of this requirement the hard way during a rewrite
|
|
of RCU's dyntick-idle code.
|
|
|
|
<p>
|
|
The Linux kernel has non-maskable interrupts (NMIs), and
|
|
RCU read-side critical sections are legal within NMI handlers.
|
|
Thankfully, RCU update-side primitives, including
|
|
<tt>call_rcu()</tt>, are prohibited within NMI handlers.
|
|
|
|
<p>
|
|
The name notwithstanding, some Linux-kernel architectures
|
|
can have nested NMIs, which RCU must handle correctly.
|
|
Andy Lutomirski
|
|
<a href="https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a>
|
|
with this requirement;
|
|
he also kindly surprised me with
|
|
<a href="https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a>
|
|
that meets this requirement.
|
|
|
|
<p>
|
|
Furthermore, NMI handlers can be interrupted by what appear to RCU
|
|
to be normal interrupts.
|
|
One way that this can happen is for code that directly invokes
|
|
<tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> to be called
|
|
from an NMI handler.
|
|
This astonishing fact of life prompted the current code structure,
|
|
which has <tt>rcu_irq_enter()</tt> invoking <tt>rcu_nmi_enter()</tt>
|
|
and <tt>rcu_irq_exit()</tt> invoking <tt>rcu_nmi_exit()</tt>.
|
|
And yes, I also learned of this requirement the hard way.
|
|
|
|
<h3><a name="Loadable Modules">Loadable Modules</a></h3>
|
|
|
|
<p>
|
|
The Linux kernel has loadable modules, and these modules can
|
|
also be unloaded.
|
|
After a given module has been unloaded, any attempt to call
|
|
one of its functions results in a segmentation fault.
|
|
The module-unload functions must therefore cancel any
|
|
delayed calls to loadable-module functions, for example,
|
|
any outstanding <tt>mod_timer()</tt> must be dealt with
|
|
via <tt>del_timer_sync()</tt> or similar.
|
|
|
|
<p>
|
|
Unfortunately, there is no way to cancel an RCU callback;
|
|
once you invoke <tt>call_rcu()</tt>, the callback function is
|
|
eventually going to be invoked, unless the system goes down first.
|
|
Because it is normally considered socially irresponsible to crash the system
|
|
in response to a module unload request, we need some other way
|
|
to deal with in-flight RCU callbacks.
|
|
|
|
<p>
|
|
RCU therefore provides
|
|
<tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>,
|
|
which waits until all in-flight RCU callbacks have been invoked.
|
|
If a module uses <tt>call_rcu()</tt>, its exit function should therefore
|
|
prevent any future invocation of <tt>call_rcu()</tt>, then invoke
|
|
<tt>rcu_barrier()</tt>.
|
|
In theory, the underlying module-unload code could invoke
|
|
<tt>rcu_barrier()</tt> unconditionally, but in practice this would
|
|
incur unacceptable latencies.
|
|
|
|
<p>
|
|
Nikita Danilov noted this requirement for an analogous filesystem-unmount
|
|
situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU.
|
|
The need for <tt>rcu_barrier()</tt> for module unloading became
|
|
apparent later.
|
|
|
|
<p>
|
|
<b>Important note</b>: The <tt>rcu_barrier()</tt> function is not,
|
|
repeat, <i>not</i>, obligated to wait for a grace period.
|
|
It is instead only required to wait for RCU callbacks that have
|
|
already been posted.
|
|
Therefore, if there are no RCU callbacks posted anywhere in the system,
|
|
<tt>rcu_barrier()</tt> is within its rights to return immediately.
|
|
Even if there are callbacks posted, <tt>rcu_barrier()</tt> does not
|
|
necessarily need to wait for a grace period.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Wait a minute!
|
|
Each RCU callbacks must wait for a grace period to complete,
|
|
and <tt>rcu_barrier()</tt> must wait for each pre-existing
|
|
callback to be invoked.
|
|
Doesn't <tt>rcu_barrier()</tt> therefore need to wait for
|
|
a full grace period if there is even one callback posted anywhere
|
|
in the system?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
Absolutely not!!!
|
|
</font>
|
|
|
|
<p><font color="ffffff">
|
|
Yes, each RCU callbacks must wait for a grace period to complete,
|
|
but it might well be partly (or even completely) finished waiting
|
|
by the time <tt>rcu_barrier()</tt> is invoked.
|
|
In that case, <tt>rcu_barrier()</tt> need only wait for the
|
|
remaining portion of the grace period to elapse.
|
|
So even if there are quite a few callbacks posted,
|
|
<tt>rcu_barrier()</tt> might well return quite quickly.
|
|
</font>
|
|
|
|
<p><font color="ffffff">
|
|
So if you need to wait for a grace period as well as for all
|
|
pre-existing callbacks, you will need to invoke both
|
|
<tt>synchronize_rcu()</tt> and <tt>rcu_barrier()</tt>.
|
|
If latency is a concern, you can always use workqueues
|
|
to invoke them concurrently.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<h3><a name="Hotplug CPU">Hotplug CPU</a></h3>
|
|
|
|
<p>
|
|
The Linux kernel supports CPU hotplug, which means that CPUs
|
|
can come and go.
|
|
It is of course illegal to use any RCU API member from an offline CPU,
|
|
with the exception of <a href="#Sleepable RCU">SRCU</a> read-side
|
|
critical sections.
|
|
This requirement was present from day one in DYNIX/ptx, but
|
|
on the other hand, the Linux kernel's CPU-hotplug implementation
|
|
is “interesting.”
|
|
|
|
<p>
|
|
The Linux-kernel CPU-hotplug implementation has notifiers that
|
|
are used to allow the various kernel subsystems (including RCU)
|
|
to respond appropriately to a given CPU-hotplug operation.
|
|
Most RCU operations may be invoked from CPU-hotplug notifiers,
|
|
including even synchronous grace-period operations such as
|
|
<tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>.
|
|
|
|
<p>
|
|
However, all-callback-wait operations such as
|
|
<tt>rcu_barrier()</tt> are also not supported, due to the
|
|
fact that there are phases of CPU-hotplug operations where
|
|
the outgoing CPU's callbacks will not be invoked until after
|
|
the CPU-hotplug operation ends, which could also result in deadlock.
|
|
Furthermore, <tt>rcu_barrier()</tt> blocks CPU-hotplug operations
|
|
during its execution, which results in another type of deadlock
|
|
when invoked from a CPU-hotplug notifier.
|
|
|
|
<h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3>
|
|
|
|
<p>
|
|
RCU depends on the scheduler, and the scheduler uses RCU to
|
|
protect some of its data structures.
|
|
The preemptible-RCU <tt>rcu_read_unlock()</tt>
|
|
implementation must therefore be written carefully to avoid deadlocks
|
|
involving the scheduler's runqueue and priority-inheritance locks.
|
|
In particular, <tt>rcu_read_unlock()</tt> must tolerate an
|
|
interrupt where the interrupt handler invokes both
|
|
<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>.
|
|
This possibility requires <tt>rcu_read_unlock()</tt> to use
|
|
negative nesting levels to avoid destructive recursion via
|
|
interrupt handler's use of RCU.
|
|
|
|
<p>
|
|
This scheduler-RCU requirement came as a
|
|
<a href="https://lwn.net/Articles/453002/">complete surprise</a>.
|
|
|
|
<p>
|
|
As noted above, RCU makes use of kthreads, and it is necessary to
|
|
avoid excessive CPU-time accumulation by these kthreads.
|
|
This requirement was no surprise, but RCU's violation of it
|
|
when running context-switch-heavy workloads when built with
|
|
<tt>CONFIG_NO_HZ_FULL=y</tt>
|
|
<a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>.
|
|
RCU has made good progress towards meeting this requirement, even
|
|
for context-switch-heavy <tt>CONFIG_NO_HZ_FULL=y</tt> workloads,
|
|
but there is room for further improvement.
|
|
|
|
<p>
|
|
It is forbidden to hold any of scheduler's runqueue or priority-inheritance
|
|
spinlocks across an <tt>rcu_read_unlock()</tt> unless interrupts have been
|
|
disabled across the entire RCU read-side critical section, that is,
|
|
up to and including the matching <tt>rcu_read_lock()</tt>.
|
|
Violating this restriction can result in deadlocks involving these
|
|
scheduler spinlocks.
|
|
There was hope that this restriction might be lifted when interrupt-disabled
|
|
calls to <tt>rcu_read_unlock()</tt> started deferring the reporting of
|
|
the resulting RCU-preempt quiescent state until the end of the corresponding
|
|
interrupts-disabled region.
|
|
Unfortunately, timely reporting of the corresponding quiescent state
|
|
to expedited grace periods requires a call to <tt>raise_softirq()</tt>,
|
|
which can acquire these scheduler spinlocks.
|
|
In addition, real-time systems using RCU priority boosting
|
|
need this restriction to remain in effect because deferred
|
|
quiescent-state reporting would also defer deboosting, which in turn
|
|
would degrade real-time latencies.
|
|
|
|
<p>
|
|
In theory, if a given RCU read-side critical section could be
|
|
guaranteed to be less than one second in duration, holding a scheduler
|
|
spinlock across that critical section's <tt>rcu_read_unlock()</tt>
|
|
would require only that preemption be disabled across the entire
|
|
RCU read-side critical section, not interrupts.
|
|
Unfortunately, given the possibility of vCPU preemption, long-running
|
|
interrupts, and so on, it is not possible in practice to guarantee
|
|
that a given RCU read-side critical section will complete in less than
|
|
one second.
|
|
Therefore, as noted above, if scheduler spinlocks are held across
|
|
a given call to <tt>rcu_read_unlock()</tt>, interrupts must be
|
|
disabled across the entire RCU read-side critical section.
|
|
|
|
<h3><a name="Tracing and RCU">Tracing and RCU</a></h3>
|
|
|
|
<p>
|
|
It is possible to use tracing on RCU code, but tracing itself
|
|
uses RCU.
|
|
For this reason, <tt>rcu_dereference_raw_check()</tt>
|
|
is provided for use by tracing, which avoids the destructive
|
|
recursion that could otherwise ensue.
|
|
This API is also used by virtualization in some architectures,
|
|
where RCU readers execute in environments in which tracing
|
|
cannot be used.
|
|
The tracing folks both located the requirement and provided the
|
|
needed fix, so this surprise requirement was relatively painless.
|
|
|
|
<h3><a name="Accesses to User Memory and RCU">
|
|
Accesses to User Memory and RCU</a></h3>
|
|
|
|
<p>
|
|
The kernel needs to access user-space memory, for example, to access
|
|
data referenced by system-call parameters.
|
|
The <tt>get_user()</tt> macro does this job.
|
|
|
|
<p>
|
|
However, user-space memory might well be paged out, which means
|
|
that <tt>get_user()</tt> might well page-fault and thus block while
|
|
waiting for the resulting I/O to complete.
|
|
It would be a very bad thing for the compiler to reorder
|
|
a <tt>get_user()</tt> invocation into an RCU read-side critical
|
|
section.
|
|
For example, suppose that the source code looked like this:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 rcu_read_lock();
|
|
2 p = rcu_dereference(gp);
|
|
3 v = p->value;
|
|
4 rcu_read_unlock();
|
|
5 get_user(user_v, user_p);
|
|
6 do_something_with(v, user_v);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
The compiler must not be permitted to transform this source code into
|
|
the following:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 rcu_read_lock();
|
|
2 p = rcu_dereference(gp);
|
|
3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!!
|
|
4 v = p->value;
|
|
5 rcu_read_unlock();
|
|
6 do_something_with(v, user_v);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
If the compiler did make this transformation in a
|
|
<tt>CONFIG_PREEMPT=n</tt> kernel build, and if <tt>get_user()</tt> did
|
|
page fault, the result would be a quiescent state in the middle
|
|
of an RCU read-side critical section.
|
|
This misplaced quiescent state could result in line 4 being
|
|
a use-after-free access, which could be bad for your kernel's
|
|
actuarial statistics.
|
|
Similar examples can be constructed with the call to <tt>get_user()</tt>
|
|
preceding the <tt>rcu_read_lock()</tt>.
|
|
|
|
<p>
|
|
Unfortunately, <tt>get_user()</tt> doesn't have any particular
|
|
ordering properties, and in some architectures the underlying <tt>asm</tt>
|
|
isn't even marked <tt>volatile</tt>.
|
|
And even if it was marked <tt>volatile</tt>, the above access to
|
|
<tt>p->value</tt> is not volatile, so the compiler would not have any
|
|
reason to keep those two accesses in order.
|
|
|
|
<p>
|
|
Therefore, the Linux-kernel definitions of <tt>rcu_read_lock()</tt>
|
|
and <tt>rcu_read_unlock()</tt> must act as compiler barriers,
|
|
at least for outermost instances of <tt>rcu_read_lock()</tt> and
|
|
<tt>rcu_read_unlock()</tt> within a nested set of RCU read-side critical
|
|
sections.
|
|
|
|
<h3><a name="Energy Efficiency">Energy Efficiency</a></h3>
|
|
|
|
<p>
|
|
Interrupting idle CPUs is considered socially unacceptable,
|
|
especially by people with battery-powered embedded systems.
|
|
RCU therefore conserves energy by detecting which CPUs are
|
|
idle, including tracking CPUs that have been interrupted from idle.
|
|
This is a large part of the energy-efficiency requirement,
|
|
so I learned of this via an irate phone call.
|
|
|
|
<p>
|
|
Because RCU avoids interrupting idle CPUs, it is illegal to
|
|
execute an RCU read-side critical section on an idle CPU.
|
|
(Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat
|
|
if you try it.)
|
|
The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt>
|
|
event tracing is provided to work around this restriction.
|
|
In addition, <tt>rcu_is_watching()</tt> may be used to
|
|
test whether or not it is currently legal to run RCU read-side
|
|
critical sections on this CPU.
|
|
I learned of the need for diagnostics on the one hand
|
|
and <tt>RCU_NONIDLE()</tt> on the other while inspecting
|
|
idle-loop code.
|
|
Steven Rostedt supplied <tt>_rcuidle</tt> event tracing,
|
|
which is used quite heavily in the idle loop.
|
|
However, there are some restrictions on the code placed within
|
|
<tt>RCU_NONIDLE()</tt>:
|
|
|
|
<ol>
|
|
<li> Blocking is prohibited.
|
|
In practice, this is not a serious restriction given that idle
|
|
tasks are prohibited from blocking to begin with.
|
|
<li> Although nesting <tt>RCU_NONIDLE()</tt> is permitted, they cannot
|
|
nest indefinitely deeply.
|
|
However, given that they can be nested on the order of a million
|
|
deep, even on 32-bit systems, this should not be a serious
|
|
restriction.
|
|
This nesting limit would probably be reached long after the
|
|
compiler OOMed or the stack overflowed.
|
|
<li> Any code path that enters <tt>RCU_NONIDLE()</tt> must sequence
|
|
out of that same <tt>RCU_NONIDLE()</tt>.
|
|
For example, the following is grossly illegal:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 RCU_NONIDLE({
|
|
2 do_something();
|
|
3 goto bad_idea; /* BUG!!! */
|
|
4 do_something_else();});
|
|
5 bad_idea:
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
It is just as illegal to transfer control into the middle of
|
|
<tt>RCU_NONIDLE()</tt>'s argument.
|
|
Yes, in theory, you could transfer in as long as you also
|
|
transferred out, but in practice you could also expect to get sharply
|
|
worded review comments.
|
|
</ol>
|
|
|
|
<p>
|
|
It is similarly socially unacceptable to interrupt an
|
|
<tt>nohz_full</tt> CPU running in userspace.
|
|
RCU must therefore track <tt>nohz_full</tt> userspace
|
|
execution.
|
|
RCU must therefore be able to sample state at two points in
|
|
time, and be able to determine whether or not some other CPU spent
|
|
any time idle and/or executing in userspace.
|
|
|
|
<p>
|
|
These energy-efficiency requirements have proven quite difficult to
|
|
understand and to meet, for example, there have been more than five
|
|
clean-sheet rewrites of RCU's energy-efficiency code, the last of
|
|
which was finally able to demonstrate
|
|
<a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>.
|
|
As noted earlier,
|
|
I learned of many of these requirements via angry phone calls:
|
|
Flaming me on the Linux-kernel mailing list was apparently not
|
|
sufficient to fully vent their ire at RCU's energy-efficiency bugs!
|
|
|
|
<h3><a name="Scheduling-Clock Interrupts and RCU">
|
|
Scheduling-Clock Interrupts and RCU</a></h3>
|
|
|
|
<p>
|
|
The kernel transitions between in-kernel non-idle execution, userspace
|
|
execution, and the idle loop.
|
|
Depending on kernel configuration, RCU handles these states differently:
|
|
|
|
<table border=3>
|
|
<tr><th><tt>HZ</tt> Kconfig</th>
|
|
<th>In-Kernel</th>
|
|
<th>Usermode</th>
|
|
<th>Idle</th></tr>
|
|
<tr><th align="left"><tt>HZ_PERIODIC</tt></th>
|
|
<td>Can rely on scheduling-clock interrupt.</td>
|
|
<td>Can rely on scheduling-clock interrupt and its
|
|
detection of interrupt from usermode.</td>
|
|
<td>Can rely on RCU's dyntick-idle detection.</td></tr>
|
|
<tr><th align="left"><tt>NO_HZ_IDLE</tt></th>
|
|
<td>Can rely on scheduling-clock interrupt.</td>
|
|
<td>Can rely on scheduling-clock interrupt and its
|
|
detection of interrupt from usermode.</td>
|
|
<td>Can rely on RCU's dyntick-idle detection.</td></tr>
|
|
<tr><th align="left"><tt>NO_HZ_FULL</tt></th>
|
|
<td>Can only sometimes rely on scheduling-clock interrupt.
|
|
In other cases, it is necessary to bound kernel execution
|
|
times and/or use IPIs.</td>
|
|
<td>Can rely on RCU's dyntick-idle detection.</td>
|
|
<td>Can rely on RCU's dyntick-idle detection.</td></tr>
|
|
</table>
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the
|
|
scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt>
|
|
and <tt>NO_HZ_IDLE</tt> do?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
Because, as a performance optimization, <tt>NO_HZ_FULL</tt>
|
|
does not necessarily re-enable the scheduling-clock interrupt
|
|
on entry to each and every system call.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
However, RCU must be reliably informed as to whether any given
|
|
CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>,
|
|
also whether that CPU is executing in usermode, as discussed
|
|
<a href="#Energy Efficiency">earlier</a>.
|
|
It also requires that the scheduling-clock interrupt be enabled when
|
|
RCU needs it to be:
|
|
|
|
<ol>
|
|
<li> If a CPU is either idle or executing in usermode, and RCU believes
|
|
it is non-idle, the scheduling-clock tick had better be running.
|
|
Otherwise, you will get RCU CPU stall warnings. Or at best,
|
|
very long (11-second) grace periods, with a pointless IPI waking
|
|
the CPU from time to time.
|
|
<li> If a CPU is in a portion of the kernel that executes RCU read-side
|
|
critical sections, and RCU believes this CPU to be idle, you will get
|
|
random memory corruption. <b>DON'T DO THIS!!!</b>
|
|
|
|
<br>This is one reason to test with lockdep, which will complain
|
|
about this sort of thing.
|
|
<li> If a CPU is in a portion of the kernel that is absolutely
|
|
positively no-joking guaranteed to never execute any RCU read-side
|
|
critical sections, and RCU believes this CPU to to be idle,
|
|
no problem. This sort of thing is used by some architectures
|
|
for light-weight exception handlers, which can then avoid the
|
|
overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt>
|
|
at exception entry and exit, respectively.
|
|
Some go further and avoid the entireties of <tt>irq_enter()</tt>
|
|
and <tt>irq_exit()</tt>.
|
|
|
|
<br>Just make very sure you are running some of your tests with
|
|
<tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths
|
|
was in fact joking about not doing RCU read-side critical sections.
|
|
<li> If a CPU is executing in the kernel with the scheduling-clock
|
|
interrupt disabled and RCU believes this CPU to be non-idle,
|
|
and if the CPU goes idle (from an RCU perspective) every few
|
|
jiffies, no problem. It is usually OK for there to be the
|
|
occasional gap between idle periods of up to a second or so.
|
|
|
|
<br>If the gap grows too long, you get RCU CPU stall warnings.
|
|
<li> If a CPU is either idle or executing in usermode, and RCU believes
|
|
it to be idle, of course no problem.
|
|
<li> If a CPU is executing in the kernel, the kernel code
|
|
path is passing through quiescent states at a reasonable
|
|
frequency (preferably about once per few jiffies, but the
|
|
occasional excursion to a second or so is usually OK) and the
|
|
scheduling-clock interrupt is enabled, of course no problem.
|
|
|
|
<br>If the gap between a successive pair of quiescent states grows
|
|
too long, you get RCU CPU stall warnings.
|
|
</ol>
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
But what if my driver has a hardware interrupt handler
|
|
that can run for many seconds?
|
|
I cannot invoke <tt>schedule()</tt> from an hardware
|
|
interrupt handler, after all!
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt>
|
|
every so often.
|
|
But given that long-running interrupt handlers can cause
|
|
other problems, not least for response time, shouldn't you
|
|
work to keep your interrupt handler's runtime within reasonable
|
|
bounds?
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
But as long as RCU is properly informed of kernel state transitions between
|
|
in-kernel execution, usermode execution, and idle, and as long as the
|
|
scheduling-clock interrupt is enabled when RCU needs it to be, you
|
|
can rest assured that the bugs you encounter will be in some other
|
|
part of RCU or some other part of the kernel!
|
|
|
|
<h3><a name="Memory Efficiency">Memory Efficiency</a></h3>
|
|
|
|
<p>
|
|
Although small-memory non-realtime systems can simply use Tiny RCU,
|
|
code size is only one aspect of memory efficiency.
|
|
Another aspect is the size of the <tt>rcu_head</tt> structure
|
|
used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>.
|
|
Although this structure contains nothing more than a pair of pointers,
|
|
it does appear in many RCU-protected data structures, including
|
|
some that are size critical.
|
|
The <tt>page</tt> structure is a case in point, as evidenced by
|
|
the many occurrences of the <tt>union</tt> keyword within that structure.
|
|
|
|
<p>
|
|
This need for memory efficiency is one reason that RCU uses hand-crafted
|
|
singly linked lists to track the <tt>rcu_head</tt> structures that
|
|
are waiting for a grace period to elapse.
|
|
It is also the reason why <tt>rcu_head</tt> structures do not contain
|
|
debug information, such as fields tracking the file and line of the
|
|
<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them.
|
|
Although this information might appear in debug-only kernel builds at some
|
|
point, in the meantime, the <tt>->func</tt> field will often provide
|
|
the needed debug information.
|
|
|
|
<p>
|
|
However, in some cases, the need for memory efficiency leads to even
|
|
more extreme measures.
|
|
Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field
|
|
shares storage with a great many other structures that are used at
|
|
various points in the corresponding page's lifetime.
|
|
In order to correctly resolve certain
|
|
<a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>,
|
|
the Linux kernel's memory-management subsystem needs a particular bit
|
|
to remain zero during all phases of grace-period processing,
|
|
and that bit happens to map to the bottom bit of the
|
|
<tt>rcu_head</tt> structure's <tt>->next</tt> field.
|
|
RCU makes this guarantee as long as <tt>call_rcu()</tt>
|
|
is used to post the callback, as opposed to <tt>kfree_rcu()</tt>
|
|
or some future “lazy”
|
|
variant of <tt>call_rcu()</tt> that might one day be created for
|
|
energy-efficiency purposes.
|
|
|
|
<p>
|
|
That said, there are limits.
|
|
RCU requires that the <tt>rcu_head</tt> structure be aligned to a
|
|
two-byte boundary, and passing a misaligned <tt>rcu_head</tt>
|
|
structure to one of the <tt>call_rcu()</tt> family of functions
|
|
will result in a splat.
|
|
It is therefore necessary to exercise caution when packing
|
|
structures containing fields of type <tt>rcu_head</tt>.
|
|
Why not a four-byte or even eight-byte alignment requirement?
|
|
Because the m68k architecture provides only two-byte alignment,
|
|
and thus acts as alignment's least common denominator.
|
|
|
|
<p>
|
|
The reason for reserving the bottom bit of pointers to
|
|
<tt>rcu_head</tt> structures is to leave the door open to
|
|
“lazy” callbacks whose invocations can safely be deferred.
|
|
Deferring invocation could potentially have energy-efficiency
|
|
benefits, but only if the rate of non-lazy callbacks decreases
|
|
significantly for some important workload.
|
|
In the meantime, reserving the bottom bit keeps this option open
|
|
in case it one day becomes useful.
|
|
|
|
<h3><a name="Performance, Scalability, Response Time, and Reliability">
|
|
Performance, Scalability, Response Time, and Reliability</a></h3>
|
|
|
|
<p>
|
|
Expanding on the
|
|
<a href="#Performance and Scalability">earlier discussion</a>,
|
|
RCU is used heavily by hot code paths in performance-critical
|
|
portions of the Linux kernel's networking, security, virtualization,
|
|
and scheduling code paths.
|
|
RCU must therefore use efficient implementations, especially in its
|
|
read-side primitives.
|
|
To that end, it would be good if preemptible RCU's implementation
|
|
of <tt>rcu_read_lock()</tt> could be inlined, however, doing
|
|
this requires resolving <tt>#include</tt> issues with the
|
|
<tt>task_struct</tt> structure.
|
|
|
|
<p>
|
|
The Linux kernel supports hardware configurations with up to
|
|
4096 CPUs, which means that RCU must be extremely scalable.
|
|
Algorithms that involve frequent acquisitions of global locks or
|
|
frequent atomic operations on global variables simply cannot be
|
|
tolerated within the RCU implementation.
|
|
RCU therefore makes heavy use of a combining tree based on the
|
|
<tt>rcu_node</tt> structure.
|
|
RCU is required to tolerate all CPUs continuously invoking any
|
|
combination of RCU's runtime primitives with minimal per-operation
|
|
overhead.
|
|
In fact, in many cases, increasing load must <i>decrease</i> the
|
|
per-operation overhead, witness the batching optimizations for
|
|
<tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>,
|
|
<tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>.
|
|
As a general rule, RCU must cheerfully accept whatever the
|
|
rest of the Linux kernel decides to throw at it.
|
|
|
|
<p>
|
|
The Linux kernel is used for real-time workloads, especially
|
|
in conjunction with the
|
|
<a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>.
|
|
The real-time-latency response requirements are such that the
|
|
traditional approach of disabling preemption across RCU
|
|
read-side critical sections is inappropriate.
|
|
Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore
|
|
use an RCU implementation that allows RCU read-side critical
|
|
sections to be preempted.
|
|
This requirement made its presence known after users made it
|
|
clear that an earlier
|
|
<a href="https://lwn.net/Articles/107930/">real-time patch</a>
|
|
did not meet their needs, in conjunction with some
|
|
<a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a>
|
|
encountered by a very early version of the -rt patchset.
|
|
|
|
<p>
|
|
In addition, RCU must make do with a sub-100-microsecond real-time latency
|
|
budget.
|
|
In fact, on smaller systems with the -rt patchset, the Linux kernel
|
|
provides sub-20-microsecond real-time latencies for the whole kernel,
|
|
including RCU.
|
|
RCU's scalability and latency must therefore be sufficient for
|
|
these sorts of configurations.
|
|
To my surprise, the sub-100-microsecond real-time latency budget
|
|
<a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf">
|
|
applies to even the largest systems [PDF]</a>,
|
|
up to and including systems with 4096 CPUs.
|
|
This real-time requirement motivated the grace-period kthread, which
|
|
also simplified handling of a number of race conditions.
|
|
|
|
<p>
|
|
RCU must avoid degrading real-time response for CPU-bound threads, whether
|
|
executing in usermode (which is one use case for
|
|
<tt>CONFIG_NO_HZ_FULL=y</tt>) or in the kernel.
|
|
That said, CPU-bound loops in the kernel must execute
|
|
<tt>cond_resched()</tt> at least once per few tens of milliseconds
|
|
in order to avoid receiving an IPI from RCU.
|
|
|
|
<p>
|
|
Finally, RCU's status as a synchronization primitive means that
|
|
any RCU failure can result in arbitrary memory corruption that can be
|
|
extremely difficult to debug.
|
|
This means that RCU must be extremely reliable, which in
|
|
practice also means that RCU must have an aggressive stress-test
|
|
suite.
|
|
This stress-test suite is called <tt>rcutorture</tt>.
|
|
|
|
<p>
|
|
Although the need for <tt>rcutorture</tt> was no surprise,
|
|
the current immense popularity of the Linux kernel is posing
|
|
interesting—and perhaps unprecedented—validation
|
|
challenges.
|
|
To see this, keep in mind that there are well over one billion
|
|
instances of the Linux kernel running today, given Android
|
|
smartphones, Linux-powered televisions, and servers.
|
|
This number can be expected to increase sharply with the advent of
|
|
the celebrated Internet of Things.
|
|
|
|
<p>
|
|
Suppose that RCU contains a race condition that manifests on average
|
|
once per million years of runtime.
|
|
This bug will be occurring about three times per <i>day</i> across
|
|
the installed base.
|
|
RCU could simply hide behind hardware error rates, given that no one
|
|
should really expect their smartphone to last for a million years.
|
|
However, anyone taking too much comfort from this thought should
|
|
consider the fact that in most jurisdictions, a successful multi-year
|
|
test of a given mechanism, which might include a Linux kernel,
|
|
suffices for a number of types of safety-critical certifications.
|
|
In fact, rumor has it that the Linux kernel is already being used
|
|
in production for safety-critical applications.
|
|
I don't know about you, but I would feel quite bad if a bug in RCU
|
|
killed someone.
|
|
Which might explain my recent focus on validation and verification.
|
|
|
|
<h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2>
|
|
|
|
<p>
|
|
One of the more surprising things about RCU is that there are now
|
|
no fewer than five <i>flavors</i>, or API families.
|
|
In addition, the primary flavor that has been the sole focus up to
|
|
this point has two different implementations, non-preemptible and
|
|
preemptible.
|
|
The other four flavors are listed below, with requirements for each
|
|
described in a separate section.
|
|
|
|
<ol>
|
|
<li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a>
|
|
<li> <a href="#Sched Flavor">Sched Flavor (Historical)</a>
|
|
<li> <a href="#Sleepable RCU">Sleepable RCU</a>
|
|
<li> <a href="#Tasks RCU">Tasks RCU</a>
|
|
</ol>
|
|
|
|
<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a></h3>
|
|
|
|
<p>
|
|
The RCU-bh flavor of RCU has since been expressed in terms of
|
|
the other RCU flavors as part of a consolidation of the three
|
|
flavors into a single flavor.
|
|
The read-side API remains, and continues to disable softirq and to
|
|
be accounted for by lockdep.
|
|
Much of the material in this section is therefore strictly historical
|
|
in nature.
|
|
|
|
<p>
|
|
The softirq-disable (AKA “bottom-half”,
|
|
hence the “_bh” abbreviations)
|
|
flavor of RCU, or <i>RCU-bh</i>, was developed by
|
|
Dipankar Sarma to provide a flavor of RCU that could withstand the
|
|
network-based denial-of-service attacks researched by Robert
|
|
Olsson.
|
|
These attacks placed so much networking load on the system
|
|
that some of the CPUs never exited softirq execution,
|
|
which in turn prevented those CPUs from ever executing a context switch,
|
|
which, in the RCU implementation of that time, prevented grace periods
|
|
from ever ending.
|
|
The result was an out-of-memory condition and a system hang.
|
|
|
|
<p>
|
|
The solution was the creation of RCU-bh, which does
|
|
<tt>local_bh_disable()</tt>
|
|
across its read-side critical sections, and which uses the transition
|
|
from one type of softirq processing to another as a quiescent state
|
|
in addition to context switch, idle, user mode, and offline.
|
|
This means that RCU-bh grace periods can complete even when some of
|
|
the CPUs execute in softirq indefinitely, thus allowing algorithms
|
|
based on RCU-bh to withstand network-based denial-of-service attacks.
|
|
|
|
<p>
|
|
Because
|
|
<tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt>
|
|
disable and re-enable softirq handlers, any attempt to start a softirq
|
|
handlers during the
|
|
RCU-bh read-side critical section will be deferred.
|
|
In this case, <tt>rcu_read_unlock_bh()</tt>
|
|
will invoke softirq processing, which can take considerable time.
|
|
One can of course argue that this softirq overhead should be associated
|
|
with the code following the RCU-bh read-side critical section rather
|
|
than <tt>rcu_read_unlock_bh()</tt>, but the fact
|
|
is that most profiling tools cannot be expected to make this sort
|
|
of fine distinction.
|
|
For example, suppose that a three-millisecond-long RCU-bh read-side
|
|
critical section executes during a time of heavy networking load.
|
|
There will very likely be an attempt to invoke at least one softirq
|
|
handler during that three milliseconds, but any such invocation will
|
|
be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>.
|
|
This can of course make it appear at first glance as if
|
|
<tt>rcu_read_unlock_bh()</tt> was executing very slowly.
|
|
|
|
<p>
|
|
The
|
|
<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a>
|
|
includes
|
|
<tt>rcu_read_lock_bh()</tt>,
|
|
<tt>rcu_read_unlock_bh()</tt>,
|
|
<tt>rcu_dereference_bh()</tt>,
|
|
<tt>rcu_dereference_bh_check()</tt>,
|
|
<tt>synchronize_rcu_bh()</tt>,
|
|
<tt>synchronize_rcu_bh_expedited()</tt>,
|
|
<tt>call_rcu_bh()</tt>,
|
|
<tt>rcu_barrier_bh()</tt>, and
|
|
<tt>rcu_read_lock_bh_held()</tt>.
|
|
However, the update-side APIs are now simple wrappers for other RCU
|
|
flavors, namely RCU-sched in CONFIG_PREEMPT=n kernels and RCU-preempt
|
|
otherwise.
|
|
|
|
<h3><a name="Sched Flavor">Sched Flavor (Historical)</a></h3>
|
|
|
|
<p>
|
|
The RCU-sched flavor of RCU has since been expressed in terms of
|
|
the other RCU flavors as part of a consolidation of the three
|
|
flavors into a single flavor.
|
|
The read-side API remains, and continues to disable preemption and to
|
|
be accounted for by lockdep.
|
|
Much of the material in this section is therefore strictly historical
|
|
in nature.
|
|
|
|
<p>
|
|
Before preemptible RCU, waiting for an RCU grace period had the
|
|
side effect of also waiting for all pre-existing interrupt
|
|
and NMI handlers.
|
|
However, there are legitimate preemptible-RCU implementations that
|
|
do not have this property, given that any point in the code outside
|
|
of an RCU read-side critical section can be a quiescent state.
|
|
Therefore, <i>RCU-sched</i> was created, which follows “classic”
|
|
RCU in that an RCU-sched grace period waits for for pre-existing
|
|
interrupt and NMI handlers.
|
|
In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched
|
|
APIs have identical implementations, while kernels built with
|
|
<tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each.
|
|
|
|
<p>
|
|
Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels,
|
|
<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt>
|
|
disable and re-enable preemption, respectively.
|
|
This means that if there was a preemption attempt during the
|
|
RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt>
|
|
will enter the scheduler, with all the latency and overhead entailed.
|
|
Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look
|
|
as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly.
|
|
However, the highest-priority task won't be preempted, so that task
|
|
will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations.
|
|
|
|
<p>
|
|
The
|
|
<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a>
|
|
includes
|
|
<tt>rcu_read_lock_sched()</tt>,
|
|
<tt>rcu_read_unlock_sched()</tt>,
|
|
<tt>rcu_read_lock_sched_notrace()</tt>,
|
|
<tt>rcu_read_unlock_sched_notrace()</tt>,
|
|
<tt>rcu_dereference_sched()</tt>,
|
|
<tt>rcu_dereference_sched_check()</tt>,
|
|
<tt>synchronize_sched()</tt>,
|
|
<tt>synchronize_rcu_sched_expedited()</tt>,
|
|
<tt>call_rcu_sched()</tt>,
|
|
<tt>rcu_barrier_sched()</tt>, and
|
|
<tt>rcu_read_lock_sched_held()</tt>.
|
|
However, anything that disables preemption also marks an RCU-sched
|
|
read-side critical section, including
|
|
<tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>,
|
|
<tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>,
|
|
and so on.
|
|
|
|
<h3><a name="Sleepable RCU">Sleepable RCU</a></h3>
|
|
|
|
<p>
|
|
For well over a decade, someone saying “I need to block within
|
|
an RCU read-side critical section” was a reliable indication
|
|
that this someone did not understand RCU.
|
|
After all, if you are always blocking in an RCU read-side critical
|
|
section, you can probably afford to use a higher-overhead synchronization
|
|
mechanism.
|
|
However, that changed with the advent of the Linux kernel's notifiers,
|
|
whose RCU read-side critical
|
|
sections almost never sleep, but sometimes need to.
|
|
This resulted in the introduction of
|
|
<a href="https://lwn.net/Articles/202847/">sleepable RCU</a>,
|
|
or <i>SRCU</i>.
|
|
|
|
<p>
|
|
SRCU allows different domains to be defined, with each such domain
|
|
defined by an instance of an <tt>srcu_struct</tt> structure.
|
|
A pointer to this structure must be passed in to each SRCU function,
|
|
for example, <tt>synchronize_srcu(&ss)</tt>, where
|
|
<tt>ss</tt> is the <tt>srcu_struct</tt> structure.
|
|
The key benefit of these domains is that a slow SRCU reader in one
|
|
domain does not delay an SRCU grace period in some other domain.
|
|
That said, one consequence of these domains is that read-side code
|
|
must pass a “cookie” from <tt>srcu_read_lock()</tt>
|
|
to <tt>srcu_read_unlock()</tt>, for example, as follows:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 int idx;
|
|
2
|
|
3 idx = srcu_read_lock(&ss);
|
|
4 do_something();
|
|
5 srcu_read_unlock(&ss, idx);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
As noted above, it is legal to block within SRCU read-side critical sections,
|
|
however, with great power comes great responsibility.
|
|
If you block forever in one of a given domain's SRCU read-side critical
|
|
sections, then that domain's grace periods will also be blocked forever.
|
|
Of course, one good way to block forever is to deadlock, which can
|
|
happen if any operation in a given domain's SRCU read-side critical
|
|
section can wait, either directly or indirectly, for that domain's
|
|
grace period to elapse.
|
|
For example, this results in a self-deadlock:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 int idx;
|
|
2
|
|
3 idx = srcu_read_lock(&ss);
|
|
4 do_something();
|
|
5 synchronize_srcu(&ss);
|
|
6 srcu_read_unlock(&ss, idx);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
However, if line 5 acquired a mutex that was held across
|
|
a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>,
|
|
deadlock would still be possible.
|
|
Furthermore, if line 5 acquired a mutex that was held across
|
|
a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>,
|
|
and if an <tt>ss1</tt>-domain SRCU read-side critical section
|
|
acquired another mutex that was held across as <tt>ss</tt>-domain
|
|
<tt>synchronize_srcu()</tt>,
|
|
deadlock would again be possible.
|
|
Such a deadlock cycle could extend across an arbitrarily large number
|
|
of different SRCU domains.
|
|
Again, with great power comes great responsibility.
|
|
|
|
<p>
|
|
Unlike the other RCU flavors, SRCU read-side critical sections can
|
|
run on idle and even offline CPUs.
|
|
This ability requires that <tt>srcu_read_lock()</tt> and
|
|
<tt>srcu_read_unlock()</tt> contain memory barriers, which means
|
|
that SRCU readers will run a bit slower than would RCU readers.
|
|
It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt>
|
|
API, which, in combination with <tt>srcu_read_unlock()</tt>,
|
|
guarantees a full memory barrier.
|
|
|
|
<p>
|
|
Also unlike other RCU flavors, <tt>synchronize_srcu()</tt> may <b>not</b>
|
|
be invoked from CPU-hotplug notifiers, due to the fact that SRCU grace
|
|
periods make use of timers and the possibility of timers being temporarily
|
|
“stranded” on the outgoing CPU.
|
|
This stranding of timers means that timers posted to the outgoing CPU
|
|
will not fire until late in the CPU-hotplug process.
|
|
The problem is that if a notifier is waiting on an SRCU grace period,
|
|
that grace period is waiting on a timer, and that timer is stranded on the
|
|
outgoing CPU, then the notifier will never be awakened, in other words,
|
|
deadlock has occurred.
|
|
This same situation of course also prohibits <tt>srcu_barrier()</tt>
|
|
from being invoked from CPU-hotplug notifiers.
|
|
|
|
<p>
|
|
SRCU also differs from other RCU flavors in that SRCU's expedited and
|
|
non-expedited grace periods are implemented by the same mechanism.
|
|
This means that in the current SRCU implementation, expediting a
|
|
future grace period has the side effect of expediting all prior
|
|
grace periods that have not yet completed.
|
|
(But please note that this is a property of the current implementation,
|
|
not necessarily of future implementations.)
|
|
In addition, if SRCU has been idle for longer than the interval
|
|
specified by the <tt>srcutree.exp_holdoff</tt> kernel boot parameter
|
|
(25 microseconds by default),
|
|
and if a <tt>synchronize_srcu()</tt> invocation ends this idle period,
|
|
that invocation will be automatically expedited.
|
|
|
|
<p>
|
|
As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating
|
|
a locking bottleneck present in prior kernel versions.
|
|
Although this will allow users to put much heavier stress on
|
|
<tt>call_srcu()</tt>, it is important to note that SRCU does not
|
|
yet take any special steps to deal with callback flooding.
|
|
So if you are posting (say) 10,000 SRCU callbacks per second per CPU,
|
|
you are probably totally OK, but if you intend to post (say) 1,000,000
|
|
SRCU callbacks per second per CPU, please run some tests first.
|
|
SRCU just might need a few adjustment to deal with that sort of load.
|
|
Of course, your mileage may vary based on the speed of your CPUs and
|
|
the size of your memory.
|
|
|
|
<p>
|
|
The
|
|
<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a>
|
|
includes
|
|
<tt>srcu_read_lock()</tt>,
|
|
<tt>srcu_read_unlock()</tt>,
|
|
<tt>srcu_dereference()</tt>,
|
|
<tt>srcu_dereference_check()</tt>,
|
|
<tt>synchronize_srcu()</tt>,
|
|
<tt>synchronize_srcu_expedited()</tt>,
|
|
<tt>call_srcu()</tt>,
|
|
<tt>srcu_barrier()</tt>, and
|
|
<tt>srcu_read_lock_held()</tt>.
|
|
It also includes
|
|
<tt>DEFINE_SRCU()</tt>,
|
|
<tt>DEFINE_STATIC_SRCU()</tt>, and
|
|
<tt>init_srcu_struct()</tt>
|
|
APIs for defining and initializing <tt>srcu_struct</tt> structures.
|
|
|
|
<h3><a name="Tasks RCU">Tasks RCU</a></h3>
|
|
|
|
<p>
|
|
Some forms of tracing use “trampolines” to handle the
|
|
binary rewriting required to install different types of probes.
|
|
It would be good to be able to free old trampolines, which sounds
|
|
like a job for some form of RCU.
|
|
However, because it is necessary to be able to install a trace
|
|
anywhere in the code, it is not possible to use read-side markers
|
|
such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>.
|
|
In addition, it does not work to have these markers in the trampoline
|
|
itself, because there would need to be instructions following
|
|
<tt>rcu_read_unlock()</tt>.
|
|
Although <tt>synchronize_rcu()</tt> would guarantee that execution
|
|
reached the <tt>rcu_read_unlock()</tt>, it would not be able to
|
|
guarantee that execution had completely left the trampoline.
|
|
|
|
<p>
|
|
The solution, in the form of
|
|
<a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>,
|
|
is to have implicit
|
|
read-side critical sections that are delimited by voluntary context
|
|
switches, that is, calls to <tt>schedule()</tt>,
|
|
<tt>cond_resched()</tt>, and
|
|
<tt>synchronize_rcu_tasks()</tt>.
|
|
In addition, transitions to and from userspace execution also delimit
|
|
tasks-RCU read-side critical sections.
|
|
|
|
<p>
|
|
The tasks-RCU API is quite compact, consisting only of
|
|
<tt>call_rcu_tasks()</tt>,
|
|
<tt>synchronize_rcu_tasks()</tt>, and
|
|
<tt>rcu_barrier_tasks()</tt>.
|
|
In <tt>CONFIG_PREEMPT=n</tt> kernels, trampolines cannot be preempted,
|
|
so these APIs map to
|
|
<tt>call_rcu()</tt>,
|
|
<tt>synchronize_rcu()</tt>, and
|
|
<tt>rcu_barrier()</tt>, respectively.
|
|
In <tt>CONFIG_PREEMPT=y</tt> kernels, trampolines can be preempted,
|
|
and these three APIs are therefore implemented by separate functions
|
|
that check for voluntary context switches.
|
|
|
|
<h2><a name="Possible Future Changes">Possible Future Changes</a></h2>
|
|
|
|
<p>
|
|
One of the tricks that RCU uses to attain update-side scalability is
|
|
to increase grace-period latency with increasing numbers of CPUs.
|
|
If this becomes a serious problem, it will be necessary to rework the
|
|
grace-period state machine so as to avoid the need for the additional
|
|
latency.
|
|
|
|
<p>
|
|
RCU disables CPU hotplug in a few places, perhaps most notably in the
|
|
<tt>rcu_barrier()</tt> operations.
|
|
If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug
|
|
notifiers, it will be necessary to avoid disabling CPU hotplug.
|
|
This would introduce some complexity, so there had better be a <i>very</i>
|
|
good reason.
|
|
|
|
<p>
|
|
The tradeoff between grace-period latency on the one hand and interruptions
|
|
of other CPUs on the other hand may need to be re-examined.
|
|
The desire is of course for zero grace-period latency as well as zero
|
|
interprocessor interrupts undertaken during an expedited grace period
|
|
operation.
|
|
While this ideal is unlikely to be achievable, it is quite possible that
|
|
further improvements can be made.
|
|
|
|
<p>
|
|
The multiprocessor implementations of RCU use a combining tree that
|
|
groups CPUs so as to reduce lock contention and increase cache locality.
|
|
However, this combining tree does not spread its memory across NUMA
|
|
nodes nor does it align the CPU groups with hardware features such
|
|
as sockets or cores.
|
|
Such spreading and alignment is currently believed to be unnecessary
|
|
because the hotpath read-side primitives do not access the combining
|
|
tree, nor does <tt>call_rcu()</tt> in the common case.
|
|
If you believe that your architecture needs such spreading and alignment,
|
|
then your architecture should also benefit from the
|
|
<tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set
|
|
to the number of CPUs in a socket, NUMA node, or whatever.
|
|
If the number of CPUs is too large, use a fraction of the number of
|
|
CPUs.
|
|
If the number of CPUs is a large prime number, well, that certainly
|
|
is an “interesting” architectural choice!
|
|
More flexible arrangements might be considered, but only if
|
|
<tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only
|
|
if the inadequacy has been demonstrated by a carefully run and
|
|
realistic system-level workload.
|
|
|
|
<p>
|
|
Please note that arrangements that require RCU to remap CPU numbers will
|
|
require extremely good demonstration of need and full exploration of
|
|
alternatives.
|
|
|
|
<p>
|
|
RCU's various kthreads are reasonably recent additions.
|
|
It is quite likely that adjustments will be required to more gracefully
|
|
handle extreme loads.
|
|
It might also be necessary to be able to relate CPU utilization by
|
|
RCU's kthreads and softirq handlers to the code that instigated this
|
|
CPU utilization.
|
|
For example, RCU callback overhead might be charged back to the
|
|
originating <tt>call_rcu()</tt> instance, though probably not
|
|
in production kernels.
|
|
|
|
<p>
|
|
Additional work may be required to provide reasonable forward-progress
|
|
guarantees under heavy load for grace periods and for callback
|
|
invocation.
|
|
|
|
<h2><a name="Summary">Summary</a></h2>
|
|
|
|
<p>
|
|
This document has presented more than two decade's worth of RCU
|
|
requirements.
|
|
Given that the requirements keep changing, this will not be the last
|
|
word on this subject, but at least it serves to get an important
|
|
subset of the requirements set forth.
|
|
|
|
<h2><a name="Acknowledgments">Acknowledgments</a></h2>
|
|
|
|
I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar,
|
|
Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and
|
|
Andy Lutomirski for their help in rendering
|
|
this article human readable, and to Michelle Rankin for her support
|
|
of this effort.
|
|
Other contributions are acknowledged in the Linux kernel's git archive.
|
|
|
|
</body></html>
|