3923 lines
207 KiB
Plaintext
3923 lines
207 KiB
Plaintext
This is ../../gmp/doc/gmp.info, produced by makeinfo version 4.8 from
|
||
../../gmp/doc/gmp.texi.
|
||
|
||
This manual describes how to install and use the GNU multiple
|
||
precision arithmetic library, version 6.1.0.
|
||
|
||
Copyright 1991, 1993-2015 Free Software Foundation, Inc.
|
||
|
||
Permission is granted to copy, distribute and/or modify this
|
||
document under the terms of the GNU Free Documentation License, Version
|
||
1.3 or any later version published by the Free Software Foundation;
|
||
with no Invariant Sections, with the Front-Cover Texts being "A GNU
|
||
Manual", and with the Back-Cover Texts being "You have freedom to copy
|
||
and modify this GNU Manual, like GNU software". A copy of the license
|
||
is included in *Note GNU Free Documentation License::.
|
||
|
||
INFO-DIR-SECTION GNU libraries
|
||
START-INFO-DIR-ENTRY
|
||
* gmp: (gmp). GNU Multiple Precision Arithmetic Library.
|
||
END-INFO-DIR-ENTRY
|
||
|
||
|
||
File: gmp.info, Node: Exact Division, Next: Exact Remainder, Prev: Block-Wise Barrett Division, Up: Division Algorithms
|
||
|
||
15.2.5 Exact Division
|
||
---------------------
|
||
|
||
A so-called exact division is when the dividend is known to be an exact
|
||
multiple of the divisor. Jebelean's exact division algorithm uses this
|
||
knowledge to make some significant optimizations (*note References::).
|
||
|
||
The idea can be illustrated in decimal for example with 368154
|
||
divided by 543. Because the low digit of the dividend is 4, the low
|
||
digit of the quotient must be 8. This is arrived at from 4*7 mod 10,
|
||
using the fact 7 is the modular inverse of 3 (the low digit of the
|
||
divisor), since 3*7 == 1 mod 10. So 8*543=4344 can be subtracted from
|
||
the dividend leaving 363810. Notice the low digit has become zero.
|
||
|
||
The procedure is repeated at the second digit, with the next
|
||
quotient digit 7 (7 == 1*7 mod 10), subtracting 7*543=3801, leaving
|
||
325800. And finally at the third digit with quotient digit 6 (8*7 mod
|
||
10), subtracting 6*543=3258 leaving 0. So the quotient is 678.
|
||
|
||
Notice however that the multiplies and subtractions don't need to
|
||
extend past the low three digits of the dividend, since that's enough
|
||
to determine the three quotient digits. For the last quotient digit no
|
||
subtraction is needed at all. On a 2NxN division like this one, only
|
||
about half the work of a normal basecase division is necessary.
|
||
|
||
For an NxM exact division producing Q=N-M quotient limbs, the saving
|
||
over a normal basecase division is in two parts. Firstly, each of the
|
||
Q quotient limbs needs only one multiply, not a 2x1 divide and
|
||
multiply. Secondly, the crossproducts are reduced when Q>M to
|
||
Q*M-M*(M+1)/2, or when Q<=M to Q*(Q-1)/2. Notice the savings are
|
||
complementary. If Q is big then many divisions are saved, or if Q is
|
||
small then the crossproducts reduce to a small number.
|
||
|
||
The modular inverse used is calculated efficiently by `binvert_limb'
|
||
in `gmp-impl.h'. This does four multiplies for a 32-bit limb, or six
|
||
for a 64-bit limb. `tune/modlinv.c' has some alternate implementations
|
||
that might suit processors better at bit twiddling than multiplying.
|
||
|
||
The sub-quadratic exact division described by Jebelean in "Exact
|
||
Division with Karatsuba Complexity" is not currently implemented. It
|
||
uses a rearrangement similar to the divide and conquer for normal
|
||
division (*note Divide and Conquer Division::), but operating from low
|
||
to high. A further possibility not currently implemented is
|
||
"Bidirectional Exact Integer Division" by Krandick and Jebelean which
|
||
forms quotient limbs from both the high and low ends of the dividend,
|
||
and can halve once more the number of crossproducts needed in a 2NxN
|
||
division.
|
||
|
||
A special case exact division by 3 exists in `mpn_divexact_by3',
|
||
supporting Toom-3 multiplication and `mpq' canonicalizations. It forms
|
||
quotient digits with a multiply by the modular inverse of 3 (which is
|
||
`0xAA..AAB') and uses two comparisons to determine a borrow for the next
|
||
limb. The multiplications don't need to be on the dependent chain, as
|
||
long as the effect of the borrows is applied, which can help chips with
|
||
pipelined multipliers.
|
||
|
||
|
||
File: gmp.info, Node: Exact Remainder, Next: Small Quotient Division, Prev: Exact Division, Up: Division Algorithms
|
||
|
||
15.2.6 Exact Remainder
|
||
----------------------
|
||
|
||
If the exact division algorithm is done with a full subtraction at each
|
||
stage and the dividend isn't a multiple of the divisor, then low zero
|
||
limbs are produced but with a remainder in the high limbs. For
|
||
dividend a, divisor d, quotient q, and b = 2^mp_bits_per_limb, this
|
||
remainder r is of the form
|
||
|
||
a = q*d + r*b^n
|
||
|
||
n represents the number of zero limbs produced by the subtractions,
|
||
that being the number of limbs produced for q. r will be in the range
|
||
0<=r<d and can be viewed as a remainder, but one shifted up by a factor
|
||
of b^n.
|
||
|
||
Carrying out full subtractions at each stage means the same number
|
||
of cross products must be done as a normal division, but there's still
|
||
some single limb divisions saved. When d is a single limb some
|
||
simplifications arise, providing good speedups on a number of
|
||
processors.
|
||
|
||
The functions `mpn_divexact_by3', `mpn_modexact_1_odd' and the
|
||
internal `mpn_redc_X' functions differ subtly in how they return r,
|
||
leading to some negations in the above formula, but all are essentially
|
||
the same.
|
||
|
||
Clearly r is zero when a is a multiple of d, and this leads to
|
||
divisibility or congruence tests which are potentially more efficient
|
||
than a normal division.
|
||
|
||
The factor of b^n on r can be ignored in a GCD when d is odd, hence
|
||
the use of `mpn_modexact_1_odd' by `mpn_gcd_1' and `mpz_kronecker_ui'
|
||
etc (*note Greatest Common Divisor Algorithms::).
|
||
|
||
Montgomery's REDC method for modular multiplications uses operands
|
||
of the form of x*b^-n and y*b^-n and on calculating (x*b^-n)*(y*b^-n)
|
||
uses the factor of b^n in the exact remainder to reach a product in the
|
||
same form (x*y)*b^-n (*note Modular Powering Algorithm::).
|
||
|
||
Notice that r generally gives no useful information about the
|
||
ordinary remainder a mod d since b^n mod d could be anything. If
|
||
however b^n == 1 mod d, then r is the negative of the ordinary
|
||
remainder. This occurs whenever d is a factor of b^n-1, as for example
|
||
with 3 in `mpn_divexact_by3'. For a 32 or 64 bit limb other such
|
||
factors include 5, 17 and 257, but no particular use has been found for
|
||
this.
|
||
|
||
|
||
File: gmp.info, Node: Small Quotient Division, Prev: Exact Remainder, Up: Division Algorithms
|
||
|
||
15.2.7 Small Quotient Division
|
||
------------------------------
|
||
|
||
An NxM division where the number of quotient limbs Q=N-M is small can
|
||
be optimized somewhat.
|
||
|
||
An ordinary basecase division normalizes the divisor by shifting it
|
||
to make the high bit set, shifting the dividend accordingly, and
|
||
shifting the remainder back down at the end of the calculation. This
|
||
is wasteful if only a few quotient limbs are to be formed. Instead a
|
||
division of just the top 2*Q limbs of the dividend by the top Q limbs
|
||
of the divisor can be used to form a trial quotient. This requires
|
||
only those limbs normalized, not the whole of the divisor and dividend.
|
||
|
||
A multiply and subtract then applies the trial quotient to the M-Q
|
||
unused limbs of the divisor and N-Q dividend limbs (which includes Q
|
||
limbs remaining from the trial quotient division). The starting trial
|
||
quotient can be 1 or 2 too big, but all cases of 2 too big and most
|
||
cases of 1 too big are detected by first comparing the most significant
|
||
limbs that will arise from the subtraction. An addback is done if the
|
||
quotient still turns out to be 1 too big.
|
||
|
||
This whole procedure is essentially the same as one step of the
|
||
basecase algorithm done in a Q limb base, though with the trial
|
||
quotient test done only with the high limbs, not an entire Q limb
|
||
"digit" product. The correctness of this weaker test can be
|
||
established by following the argument of Knuth section 4.3.1 exercise
|
||
20 but with the v2*q>b*r+u2 condition appropriately relaxed.
|
||
|
||
|
||
File: gmp.info, Node: Greatest Common Divisor Algorithms, Next: Powering Algorithms, Prev: Division Algorithms, Up: Algorithms
|
||
|
||
15.3 Greatest Common Divisor
|
||
============================
|
||
|
||
* Menu:
|
||
|
||
* Binary GCD::
|
||
* Lehmer's Algorithm::
|
||
* Subquadratic GCD::
|
||
* Extended GCD::
|
||
* Jacobi Symbol::
|
||
|
||
|
||
File: gmp.info, Node: Binary GCD, Next: Lehmer's Algorithm, Prev: Greatest Common Divisor Algorithms, Up: Greatest Common Divisor Algorithms
|
||
|
||
15.3.1 Binary GCD
|
||
-----------------
|
||
|
||
At small sizes GMP uses an O(N^2) binary style GCD. This is described
|
||
in many textbooks, for example Knuth section 4.5.2 algorithm B. It
|
||
simply consists of successively reducing odd operands a and b using
|
||
|
||
a,b = abs(a-b),min(a,b)
|
||
strip factors of 2 from a
|
||
|
||
The Euclidean GCD algorithm, as per Knuth algorithms E and A,
|
||
repeatedly computes the quotient q = floor(a/b) and replaces a,b by v,
|
||
u - q v. The binary algorithm has so far been found to be faster than
|
||
the Euclidean algorithm everywhere. One reason the binary method does
|
||
well is that the implied quotient at each step is usually small, so
|
||
often only one or two subtractions are needed to get the same effect as
|
||
a division. Quotients 1, 2 and 3 for example occur 67.7% of the time,
|
||
see Knuth section 4.5.3 Theorem E.
|
||
|
||
When the implied quotient is large, meaning b is much smaller than
|
||
a, then a division is worthwhile. This is the basis for the initial a
|
||
mod b reductions in `mpn_gcd' and `mpn_gcd_1' (the latter for both Nx1
|
||
and 1x1 cases). But after that initial reduction, big quotients occur
|
||
too rarely to make it worth checking for them.
|
||
|
||
|
||
The final 1x1 GCD in `mpn_gcd_1' is done in the generic C code as
|
||
described above. For two N-bit operands, the algorithm takes about
|
||
0.68 iterations per bit. For optimum performance some attention needs
|
||
to be paid to the way the factors of 2 are stripped from a.
|
||
|
||
Firstly it may be noted that in twos complement the number of low
|
||
zero bits on a-b is the same as b-a, so counting or testing can begin on
|
||
a-b without waiting for abs(a-b) to be determined.
|
||
|
||
A loop stripping low zero bits tends not to branch predict well,
|
||
since the condition is data dependent. But on average there's only a
|
||
few low zeros, so an option is to strip one or two bits arithmetically
|
||
then loop for more (as done for AMD K6). Or use a lookup table to get
|
||
a count for several bits then loop for more (as done for AMD K7). An
|
||
alternative approach is to keep just one of a or b odd and iterate
|
||
|
||
a,b = abs(a-b), min(a,b)
|
||
a = a/2 if even
|
||
b = b/2 if even
|
||
|
||
This requires about 1.25 iterations per bit, but stripping of a
|
||
single bit at each step avoids any branching. Repeating the bit strip
|
||
reduces to about 0.9 iterations per bit, which may be a worthwhile
|
||
tradeoff.
|
||
|
||
Generally with the above approaches a speed of perhaps 6 cycles per
|
||
bit can be achieved, which is still not terribly fast with for instance
|
||
a 64-bit GCD taking nearly 400 cycles. It's this sort of time which
|
||
means it's not usually advantageous to combine a set of divisibility
|
||
tests into a GCD.
|
||
|
||
Currently, the binary algorithm is used for GCD only when N < 3.
|
||
|
||
|
||
File: gmp.info, Node: Lehmer's Algorithm, Next: Subquadratic GCD, Prev: Binary GCD, Up: Greatest Common Divisor Algorithms
|
||
|
||
15.3.2 Lehmer's algorithm
|
||
-------------------------
|
||
|
||
Lehmer's improvement of the Euclidean algorithms is based on the
|
||
observation that the initial part of the quotient sequence depends only
|
||
on the most significant parts of the inputs. The variant of Lehmer's
|
||
algorithm used in GMP splits off the most significant two limbs, as
|
||
suggested, e.g., in "A Double-Digit Lehmer-Euclid Algorithm" by
|
||
Jebelean (*note References::). The quotients of two double-limb inputs
|
||
are collected as a 2 by 2 matrix with single-limb elements. This is
|
||
done by the function `mpn_hgcd2'. The resulting matrix is applied to
|
||
the inputs using `mpn_mul_1' and `mpn_submul_1'. Each iteration usually
|
||
reduces the inputs by almost one limb. In the rare case of a large
|
||
quotient, no progress can be made by examining just the most
|
||
significant two limbs, and the quotient is computed using plain
|
||
division.
|
||
|
||
The resulting algorithm is asymptotically O(N^2), just as the
|
||
Euclidean algorithm and the binary algorithm. The quadratic part of the
|
||
work are the calls to `mpn_mul_1' and `mpn_submul_1'. For small sizes,
|
||
the linear work is also significant. There are roughly N calls to the
|
||
`mpn_hgcd2' function. This function uses a couple of important
|
||
optimizations:
|
||
|
||
* It uses the same relaxed notion of correctness as `mpn_hgcd' (see
|
||
next section). This means that when called with the most
|
||
significant two limbs of two large numbers, the returned matrix
|
||
does not always correspond exactly to the initial quotient
|
||
sequence for the two large numbers; the final quotient may
|
||
sometimes be one off.
|
||
|
||
* It takes advantage of the fact the quotients are usually small.
|
||
The division operator is not used, since the corresponding
|
||
assembler instruction is very slow on most architectures. (This
|
||
code could probably be improved further, it uses many branches
|
||
that are unfriendly to prediction).
|
||
|
||
* It switches from double-limb calculations to single-limb
|
||
calculations half-way through, when the input numbers have been
|
||
reduced in size from two limbs to one and a half.
|
||
|
||
|
||
|
||
File: gmp.info, Node: Subquadratic GCD, Next: Extended GCD, Prev: Lehmer's Algorithm, Up: Greatest Common Divisor Algorithms
|
||
|
||
15.3.3 Subquadratic GCD
|
||
-----------------------
|
||
|
||
For inputs larger than `GCD_DC_THRESHOLD', GCD is computed via the HGCD
|
||
(Half GCD) function, as a generalization to Lehmer's algorithm.
|
||
|
||
Let the inputs a,b be of size N limbs each. Put S = floor(N/2) + 1.
|
||
Then HGCD(a,b) returns a transformation matrix T with non-negative
|
||
elements, and reduced numbers (c;d) = T^-1 (a;b). The reduced numbers
|
||
c,d must be larger than S limbs, while their difference abs(c-d) must
|
||
fit in S limbs. The matrix elements will also be of size roughly N/2.
|
||
|
||
The HGCD base case uses Lehmer's algorithm, but with the above stop
|
||
condition that returns reduced numbers and the corresponding
|
||
transformation matrix half-way through. For inputs larger than
|
||
`HGCD_THRESHOLD', HGCD is computed recursively, using the divide and
|
||
conquer algorithm in "On Scho"nhage's algorithm and subquadratic
|
||
integer GCD computation" by Mo"ller (*note References::). The recursive
|
||
algorithm consists of these main steps.
|
||
|
||
* Call HGCD recursively, on the most significant N/2 limbs. Apply the
|
||
resulting matrix T_1 to the full numbers, reducing them to a size
|
||
just above 3N/2.
|
||
|
||
* Perform a small number of division or subtraction steps to reduce
|
||
the numbers to size below 3N/2. This is essential mainly for the
|
||
unlikely case of large quotients.
|
||
|
||
* Call HGCD recursively, on the most significant N/2 limbs of the
|
||
reduced numbers. Apply the resulting matrix T_2 to the full
|
||
numbers, reducing them to a size just above N/2.
|
||
|
||
* Compute T = T_1 T_2.
|
||
|
||
* Perform a small number of division and subtraction steps to
|
||
satisfy the requirements, and return.
|
||
|
||
GCD is then implemented as a loop around HGCD, similarly to Lehmer's
|
||
algorithm. Where Lehmer repeatedly chops off the top two limbs, calls
|
||
`mpn_hgcd2', and applies the resulting matrix to the full numbers, the
|
||
sub-quadratic GCD chops off the most significant third of the limbs (the
|
||
proportion is a tuning parameter, and 1/3 seems to be more efficient
|
||
than, e.g, 1/2), calls `mpn_hgcd', and applies the resulting matrix.
|
||
Once the input numbers are reduced to size below `GCD_DC_THRESHOLD',
|
||
Lehmer's algorithm is used for the rest of the work.
|
||
|
||
The asymptotic running time of both HGCD and GCD is O(M(N)*log(N)),
|
||
where M(N) is the time for multiplying two N-limb numbers.
|
||
|
||
|
||
File: gmp.info, Node: Extended GCD, Next: Jacobi Symbol, Prev: Subquadratic GCD, Up: Greatest Common Divisor Algorithms
|
||
|
||
15.3.4 Extended GCD
|
||
-------------------
|
||
|
||
The extended GCD function, or GCDEXT, calculates gcd(a,b) and also
|
||
cofactors x and y satisfying a*x+b*y=gcd(a,b). All the algorithms used
|
||
for plain GCD are extended to handle this case. The binary algorithm is
|
||
used only for single-limb GCDEXT. Lehmer's algorithm is used for sizes
|
||
up to `GCDEXT_DC_THRESHOLD'. Above this threshold, GCDEXT is
|
||
implemented as a loop around HGCD, but with more book-keeping to keep
|
||
track of the cofactors. This gives the same asymptotic running time as
|
||
for GCD and HGCD, O(M(N)*log(N))
|
||
|
||
One difference to plain GCD is that while the inputs a and b are
|
||
reduced as the algorithm proceeds, the cofactors x and y grow in size.
|
||
This makes the tuning of the chopping-point more difficult. The current
|
||
code chops off the most significant half of the inputs for the call to
|
||
HGCD in the first iteration, and the most significant two thirds for
|
||
the remaining calls. This strategy could surely be improved. Also the
|
||
stop condition for the loop, where Lehmer's algorithm is invoked once
|
||
the inputs are reduced below `GCDEXT_DC_THRESHOLD', could maybe be
|
||
improved by taking into account the current size of the cofactors.
|
||
|
||
|
||
File: gmp.info, Node: Jacobi Symbol, Prev: Extended GCD, Up: Greatest Common Divisor Algorithms
|
||
|
||
15.3.5 Jacobi Symbol
|
||
--------------------
|
||
|
||
[This section is obsolete. The current Jacobi code actually uses a very
|
||
efficient algorithm.]
|
||
|
||
`mpz_jacobi' and `mpz_kronecker' are currently implemented with a
|
||
simple binary algorithm similar to that described for the GCDs (*note
|
||
Binary GCD::). They're not very fast when both inputs are large.
|
||
Lehmer's multi-step improvement or a binary based multi-step algorithm
|
||
is likely to be better.
|
||
|
||
When one operand fits a single limb, and that includes
|
||
`mpz_kronecker_ui' and friends, an initial reduction is done with
|
||
either `mpn_mod_1' or `mpn_modexact_1_odd', followed by the binary
|
||
algorithm on a single limb. The binary algorithm is well suited to a
|
||
single limb, and the whole calculation in this case is quite efficient.
|
||
|
||
In all the routines sign changes for the result are accumulated
|
||
using some bit twiddling, avoiding table lookups or conditional jumps.
|
||
|
||
|
||
File: gmp.info, Node: Powering Algorithms, Next: Root Extraction Algorithms, Prev: Greatest Common Divisor Algorithms, Up: Algorithms
|
||
|
||
15.4 Powering Algorithms
|
||
========================
|
||
|
||
* Menu:
|
||
|
||
* Normal Powering Algorithm::
|
||
* Modular Powering Algorithm::
|
||
|
||
|
||
File: gmp.info, Node: Normal Powering Algorithm, Next: Modular Powering Algorithm, Prev: Powering Algorithms, Up: Powering Algorithms
|
||
|
||
15.4.1 Normal Powering
|
||
----------------------
|
||
|
||
Normal `mpz' or `mpf' powering uses a simple binary algorithm,
|
||
successively squaring and then multiplying by the base when a 1 bit is
|
||
seen in the exponent, as per Knuth section 4.6.3. The "left to right"
|
||
variant described there is used rather than algorithm A, since it's
|
||
just as easy and can be done with somewhat less temporary memory.
|
||
|
||
|
||
File: gmp.info, Node: Modular Powering Algorithm, Prev: Normal Powering Algorithm, Up: Powering Algorithms
|
||
|
||
15.4.2 Modular Powering
|
||
-----------------------
|
||
|
||
Modular powering is implemented using a 2^k-ary sliding window
|
||
algorithm, as per "Handbook of Applied Cryptography" algorithm 14.85
|
||
(*note References::). k is chosen according to the size of the
|
||
exponent. Larger exponents use larger values of k, the choice being
|
||
made to minimize the average number of multiplications that must
|
||
supplement the squaring.
|
||
|
||
The modular multiplies and squarings use either a simple division or
|
||
the REDC method by Montgomery (*note References::). REDC is a little
|
||
faster, essentially saving N single limb divisions in a fashion similar
|
||
to an exact remainder (*note Exact Remainder::).
|
||
|
||
|
||
File: gmp.info, Node: Root Extraction Algorithms, Next: Radix Conversion Algorithms, Prev: Powering Algorithms, Up: Algorithms
|
||
|
||
15.5 Root Extraction Algorithms
|
||
===============================
|
||
|
||
* Menu:
|
||
|
||
* Square Root Algorithm::
|
||
* Nth Root Algorithm::
|
||
* Perfect Square Algorithm::
|
||
* Perfect Power Algorithm::
|
||
|
||
|
||
File: gmp.info, Node: Square Root Algorithm, Next: Nth Root Algorithm, Prev: Root Extraction Algorithms, Up: Root Extraction Algorithms
|
||
|
||
15.5.1 Square Root
|
||
------------------
|
||
|
||
Square roots are taken using the "Karatsuba Square Root" algorithm by
|
||
Paul Zimmermann (*note References::).
|
||
|
||
An input n is split into four parts of k bits each, so with b=2^k we
|
||
have n = a3*b^3 + a2*b^2 + a1*b + a0. Part a3 must be "normalized" so
|
||
that either the high or second highest bit is set. In GMP, k is kept
|
||
on a limb boundary and the input is left shifted (by an even number of
|
||
bits) to normalize.
|
||
|
||
The square root of the high two parts is taken, by recursive
|
||
application of the algorithm (bottoming out in a one-limb Newton's
|
||
method),
|
||
|
||
s1,r1 = sqrtrem (a3*b + a2)
|
||
|
||
This is an approximation to the desired root and is extended by a
|
||
division to give s,r,
|
||
|
||
q,u = divrem (r1*b + a1, 2*s1)
|
||
s = s1*b + q
|
||
r = u*b + a0 - q^2
|
||
|
||
The normalization requirement on a3 means at this point s is either
|
||
correct or 1 too big. r is negative in the latter case, so
|
||
|
||
if r < 0 then
|
||
r = r + 2*s - 1
|
||
s = s - 1
|
||
|
||
The algorithm is expressed in a divide and conquer form, but as
|
||
noted in the paper it can also be viewed as a discrete variant of
|
||
Newton's method, or as a variation on the schoolboy method (no longer
|
||
taught) for square roots two digits at a time.
|
||
|
||
If the remainder r is not required then usually only a few high limbs
|
||
of r and u need to be calculated to determine whether an adjustment to
|
||
s is required. This optimization is not currently implemented.
|
||
|
||
In the Karatsuba multiplication range this algorithm is
|
||
O(1.5*M(N/2)), where M(n) is the time to multiply two numbers of n
|
||
limbs. In the FFT multiplication range this grows to a bound of
|
||
O(6*M(N/2)). In practice a factor of about 1.5 to 1.8 is found in the
|
||
Karatsuba and Toom-3 ranges, growing to 2 or 3 in the FFT range.
|
||
|
||
The algorithm does all its calculations in integers and the resulting
|
||
`mpn_sqrtrem' is used for both `mpz_sqrt' and `mpf_sqrt'. The extended
|
||
precision given by `mpf_sqrt_ui' is obtained by padding with zero limbs.
|
||
|
||
|
||
File: gmp.info, Node: Nth Root Algorithm, Next: Perfect Square Algorithm, Prev: Square Root Algorithm, Up: Root Extraction Algorithms
|
||
|
||
15.5.2 Nth Root
|
||
---------------
|
||
|
||
Integer Nth roots are taken using Newton's method with the following
|
||
iteration, where A is the input and n is the root to be taken.
|
||
|
||
1 A
|
||
a[i+1] = - * ( --------- + (n-1)*a[i] )
|
||
n a[i]^(n-1)
|
||
|
||
The initial approximation a[1] is generated bitwise by successively
|
||
powering a trial root with or without new 1 bits, aiming to be just
|
||
above the true root. The iteration converges quadratically when
|
||
started from a good approximation. When n is large more initial bits
|
||
are needed to get good convergence. The current implementation is not
|
||
particularly well optimized.
|
||
|
||
|
||
File: gmp.info, Node: Perfect Square Algorithm, Next: Perfect Power Algorithm, Prev: Nth Root Algorithm, Up: Root Extraction Algorithms
|
||
|
||
15.5.3 Perfect Square
|
||
---------------------
|
||
|
||
A significant fraction of non-squares can be quickly identified by
|
||
checking whether the input is a quadratic residue modulo small integers.
|
||
|
||
`mpz_perfect_square_p' first tests the input mod 256, which means
|
||
just examining the low byte. Only 44 different values occur for
|
||
squares mod 256, so 82.8% of inputs can be immediately identified as
|
||
non-squares.
|
||
|
||
On a 32-bit system similar tests are done mod 9, 5, 7, 13 and 17,
|
||
for a total 99.25% of inputs identified as non-squares. On a 64-bit
|
||
system 97 is tested too, for a total 99.62%.
|
||
|
||
These moduli are chosen because they're factors of 2^24-1 (or 2^48-1
|
||
for 64-bits), and such a remainder can be quickly taken just using
|
||
additions (see `mpn_mod_34lsub1').
|
||
|
||
When nails are in use moduli are instead selected by the `gen-psqr.c'
|
||
program and applied with an `mpn_mod_1'. The same 2^24-1 or 2^48-1
|
||
could be done with nails using some extra bit shifts, but this is not
|
||
currently implemented.
|
||
|
||
In any case each modulus is applied to the `mpn_mod_34lsub1' or
|
||
`mpn_mod_1' remainder and a table lookup identifies non-squares. By
|
||
using a "modexact" style calculation, and suitably permuted tables,
|
||
just one multiply each is required, see the code for details. Moduli
|
||
are also combined to save operations, so long as the lookup tables
|
||
don't become too big. `gen-psqr.c' does all the pre-calculations.
|
||
|
||
A square root must still be taken for any value that passes these
|
||
tests, to verify it's really a square and not one of the small fraction
|
||
of non-squares that get through (i.e. a pseudo-square to all the tested
|
||
bases).
|
||
|
||
Clearly more residue tests could be done, `mpz_perfect_square_p' only
|
||
uses a compact and efficient set. Big inputs would probably benefit
|
||
from more residue testing, small inputs might be better off with less.
|
||
The assumed distribution of squares versus non-squares in the input
|
||
would affect such considerations.
|
||
|
||
|
||
File: gmp.info, Node: Perfect Power Algorithm, Prev: Perfect Square Algorithm, Up: Root Extraction Algorithms
|
||
|
||
15.5.4 Perfect Power
|
||
--------------------
|
||
|
||
Detecting perfect powers is required by some factorization algorithms.
|
||
Currently `mpz_perfect_power_p' is implemented using repeated Nth root
|
||
extractions, though naturally only prime roots need to be considered.
|
||
(*Note Nth Root Algorithm::.)
|
||
|
||
If a prime divisor p with multiplicity e can be found, then only
|
||
roots which are divisors of e need to be considered, much reducing the
|
||
work necessary. To this end divisibility by a set of small primes is
|
||
checked.
|
||
|
||
|
||
File: gmp.info, Node: Radix Conversion Algorithms, Next: Other Algorithms, Prev: Root Extraction Algorithms, Up: Algorithms
|
||
|
||
15.6 Radix Conversion
|
||
=====================
|
||
|
||
Radix conversions are less important than other algorithms. A program
|
||
dominated by conversions should probably use a different data
|
||
representation.
|
||
|
||
* Menu:
|
||
|
||
* Binary to Radix::
|
||
* Radix to Binary::
|
||
|
||
|
||
File: gmp.info, Node: Binary to Radix, Next: Radix to Binary, Prev: Radix Conversion Algorithms, Up: Radix Conversion Algorithms
|
||
|
||
15.6.1 Binary to Radix
|
||
----------------------
|
||
|
||
Conversions from binary to a power-of-2 radix use a simple and fast
|
||
O(N) bit extraction algorithm.
|
||
|
||
Conversions from binary to other radices use one of two algorithms.
|
||
Sizes below `GET_STR_PRECOMPUTE_THRESHOLD' use a basic O(N^2) method.
|
||
Repeated divisions by b^n are made, where b is the radix and n is the
|
||
biggest power that fits in a limb. But instead of simply using the
|
||
remainder r from such divisions, an extra divide step is done to give a
|
||
fractional limb representing r/b^n. The digits of r can then be
|
||
extracted using multiplications by b rather than divisions. Special
|
||
case code is provided for decimal, allowing multiplications by 10 to
|
||
optimize to shifts and adds.
|
||
|
||
Above `GET_STR_PRECOMPUTE_THRESHOLD' a sub-quadratic algorithm is
|
||
used. For an input t, powers b^(n*2^i) of the radix are calculated,
|
||
until a power between t and sqrt(t) is reached. t is then divided by
|
||
that largest power, giving a quotient which is the digits above that
|
||
power, and a remainder which is those below. These two parts are in
|
||
turn divided by the second highest power, and so on recursively. When
|
||
a piece has been divided down to less than `GET_STR_DC_THRESHOLD'
|
||
limbs, the basecase algorithm described above is used.
|
||
|
||
The advantage of this algorithm is that big divisions can make use
|
||
of the sub-quadratic divide and conquer division (*note Divide and
|
||
Conquer Division::), and big divisions tend to have less overheads than
|
||
lots of separate single limb divisions anyway. But in any case the
|
||
cost of calculating the powers b^(n*2^i) must first be overcome.
|
||
|
||
`GET_STR_PRECOMPUTE_THRESHOLD' and `GET_STR_DC_THRESHOLD' represent
|
||
the same basic thing, the point where it becomes worth doing a big
|
||
division to cut the input in half. `GET_STR_PRECOMPUTE_THRESHOLD'
|
||
includes the cost of calculating the radix power required, whereas
|
||
`GET_STR_DC_THRESHOLD' assumes that's already available, which is the
|
||
case when recursing.
|
||
|
||
Since the base case produces digits from least to most significant
|
||
but they want to be stored from most to least, it's necessary to
|
||
calculate in advance how many digits there will be, or at least be sure
|
||
not to underestimate that. For GMP the number of input bits is
|
||
multiplied by `chars_per_bit_exactly' from `mp_bases', rounding up.
|
||
The result is either correct or one too big.
|
||
|
||
Examining some of the high bits of the input could increase the
|
||
chance of getting the exact number of digits, but an exact result every
|
||
time would not be practical, since in general the difference between
|
||
numbers 100... and 99... is only in the last few bits and the work to
|
||
identify 99... might well be almost as much as a full conversion.
|
||
|
||
The r/b^n scheme described above for using multiplications to bring
|
||
out digits might be useful for more than a single limb. Some brief
|
||
experiments with it on the base case when recursing didn't give a
|
||
noticeable improvement, but perhaps that was only due to the
|
||
implementation. Something similar would work for the sub-quadratic
|
||
divisions too, though there would be the cost of calculating a bigger
|
||
radix power.
|
||
|
||
Another possible improvement for the sub-quadratic part would be to
|
||
arrange for radix powers that balanced the sizes of quotient and
|
||
remainder produced, i.e. the highest power would be an b^(n*k)
|
||
approximately equal to sqrt(t), not restricted to a 2^i factor. That
|
||
ought to smooth out a graph of times against sizes, but may or may not
|
||
be a net speedup.
|
||
|
||
|
||
File: gmp.info, Node: Radix to Binary, Prev: Binary to Radix, Up: Radix Conversion Algorithms
|
||
|
||
15.6.2 Radix to Binary
|
||
----------------------
|
||
|
||
*This section needs to be rewritten, it currently describes the
|
||
algorithms used before GMP 4.3.*
|
||
|
||
Conversions from a power-of-2 radix into binary use a simple and fast
|
||
O(N) bitwise concatenation algorithm.
|
||
|
||
Conversions from other radices use one of two algorithms. Sizes
|
||
below `SET_STR_PRECOMPUTE_THRESHOLD' use a basic O(N^2) method. Groups
|
||
of n digits are converted to limbs, where n is the biggest power of the
|
||
base b which will fit in a limb, then those groups are accumulated into
|
||
the result by multiplying by b^n and adding. This saves
|
||
multi-precision operations, as per Knuth section 4.4 part E (*note
|
||
References::). Some special case code is provided for decimal, giving
|
||
the compiler a chance to optimize multiplications by 10.
|
||
|
||
Above `SET_STR_PRECOMPUTE_THRESHOLD' a sub-quadratic algorithm is
|
||
used. First groups of n digits are converted into limbs. Then adjacent
|
||
limbs are combined into limb pairs with x*b^n+y, where x and y are the
|
||
limbs. Adjacent limb pairs are combined into quads similarly with
|
||
x*b^(2n)+y. This continues until a single block remains, that being
|
||
the result.
|
||
|
||
The advantage of this method is that the multiplications for each x
|
||
are big blocks, allowing Karatsuba and higher algorithms to be used.
|
||
But the cost of calculating the powers b^(n*2^i) must be overcome.
|
||
`SET_STR_PRECOMPUTE_THRESHOLD' usually ends up quite big, around 5000
|
||
digits, and on some processors much bigger still.
|
||
|
||
`SET_STR_PRECOMPUTE_THRESHOLD' is based on the input digits (and
|
||
tuned for decimal), though it might be better based on a limb count, so
|
||
as to be independent of the base. But that sort of count isn't used by
|
||
the base case and so would need some sort of initial calculation or
|
||
estimate.
|
||
|
||
The main reason `SET_STR_PRECOMPUTE_THRESHOLD' is so much bigger
|
||
than the corresponding `GET_STR_PRECOMPUTE_THRESHOLD' is that
|
||
`mpn_mul_1' is much faster than `mpn_divrem_1' (often by a factor of 5,
|
||
or more).
|
||
|
||
|
||
File: gmp.info, Node: Other Algorithms, Next: Assembly Coding, Prev: Radix Conversion Algorithms, Up: Algorithms
|
||
|
||
15.7 Other Algorithms
|
||
=====================
|
||
|
||
* Menu:
|
||
|
||
* Prime Testing Algorithm::
|
||
* Factorial Algorithm::
|
||
* Binomial Coefficients Algorithm::
|
||
* Fibonacci Numbers Algorithm::
|
||
* Lucas Numbers Algorithm::
|
||
* Random Number Algorithms::
|
||
|
||
|
||
File: gmp.info, Node: Prime Testing Algorithm, Next: Factorial Algorithm, Prev: Other Algorithms, Up: Other Algorithms
|
||
|
||
15.7.1 Prime Testing
|
||
--------------------
|
||
|
||
The primality testing in `mpz_probab_prime_p' (*note Number Theoretic
|
||
Functions::) first does some trial division by small factors and then
|
||
uses the Miller-Rabin probabilistic primality testing algorithm, as
|
||
described in Knuth section 4.5.4 algorithm P (*note References::).
|
||
|
||
For an odd input n, and with n = q*2^k+1 where q is odd, this
|
||
algorithm selects a random base x and tests whether x^q mod n is 1 or
|
||
-1, or an x^(q*2^j) mod n is 1, for 1<=j<=k. If so then n is probably
|
||
prime, if not then n is definitely composite.
|
||
|
||
Any prime n will pass the test, but some composites do too. Such
|
||
composites are known as strong pseudoprimes to base x. No n is a
|
||
strong pseudoprime to more than 1/4 of all bases (see Knuth exercise
|
||
22), hence with x chosen at random there's no more than a 1/4 chance a
|
||
"probable prime" will in fact be composite.
|
||
|
||
In fact strong pseudoprimes are quite rare, making the test much more
|
||
powerful than this analysis would suggest, but 1/4 is all that's proven
|
||
for an arbitrary n.
|
||
|
||
|
||
File: gmp.info, Node: Factorial Algorithm, Next: Binomial Coefficients Algorithm, Prev: Prime Testing Algorithm, Up: Other Algorithms
|
||
|
||
15.7.2 Factorial
|
||
----------------
|
||
|
||
Factorials are calculated by a combination of two algorithms. An idea is
|
||
shared among them: to compute the odd part of the factorial; a final
|
||
step takes account of the power of 2 term, by shifting.
|
||
|
||
For small n, the odd factor of n! is computed with the simple
|
||
observation that it is equal to the product of all positive odd numbers
|
||
smaller than n times the odd factor of [n/2]!, where [x] is the integer
|
||
part of x, and so on recursively. The procedure can be best illustrated
|
||
with an example,
|
||
|
||
23! = (23.21.19.17.15.13.11.9.7.5.3)(11.9.7.5.3)(5.3)2^19
|
||
|
||
Current code collects all the factors in a single list, with a loop
|
||
and no recursion, and compute the product, with no special care for
|
||
repeated chunks.
|
||
|
||
When n is larger, computation pass trough prime sieving. An helper
|
||
function is used, as suggested by Peter Luschny:
|
||
|
||
n
|
||
-----
|
||
n! | | L(p,n)
|
||
msf(n) = -------------- = | | p
|
||
[n/2]!^2.2^k p=3
|
||
|
||
Where p ranges on odd prime numbers. The exponent k is chosen to
|
||
obtain an odd integer number: k is the number of 1 bits in the binary
|
||
representation of [n/2]. The function L(p,n) can be defined as zero
|
||
when p is composite, and, for any prime p, it is computed with:
|
||
|
||
---
|
||
\ n
|
||
L(p,n) = / [---] mod 2 <= log (n) .
|
||
--- p^i p
|
||
i>0
|
||
|
||
With this helper function, we are able to compute the odd part of n!
|
||
using the recursion implied by n!=[n/2]!^2*msf(n)*2^k. The recursion
|
||
stops using the small-n algorithm on some [n/2^i].
|
||
|
||
Both the above algorithms use binary splitting to compute the
|
||
product of many small factors. At first as many products as possible
|
||
are accumulated in a single register, generating a list of factors that
|
||
fit in a machine word. This list is then split into halves, and the
|
||
product is computed recursively.
|
||
|
||
Such splitting is more efficient than repeated Nx1 multiplies since
|
||
it forms big multiplies, allowing Karatsuba and higher algorithms to be
|
||
used. And even below the Karatsuba threshold a big block of work can
|
||
be more efficient for the basecase algorithm.
|
||
|
||
|
||
File: gmp.info, Node: Binomial Coefficients Algorithm, Next: Fibonacci Numbers Algorithm, Prev: Factorial Algorithm, Up: Other Algorithms
|
||
|
||
15.7.3 Binomial Coefficients
|
||
----------------------------
|
||
|
||
Binomial coefficients C(n,k) are calculated by first arranging k <= n/2
|
||
using C(n,k) = C(n,n-k) if necessary, and then evaluating the following
|
||
product simply from i=2 to i=k.
|
||
|
||
k (n-k+i)
|
||
C(n,k) = (n-k+1) * prod -------
|
||
i=2 i
|
||
|
||
It's easy to show that each denominator i will divide the product so
|
||
far, so the exact division algorithm is used (*note Exact Division::).
|
||
|
||
The numerators n-k+i and denominators i are first accumulated into
|
||
as many fit a limb, to save multi-precision operations, though for
|
||
`mpz_bin_ui' this applies only to the divisors, since n is an `mpz_t'
|
||
and n-k+i in general won't fit in a limb at all.
|
||
|
||
|
||
File: gmp.info, Node: Fibonacci Numbers Algorithm, Next: Lucas Numbers Algorithm, Prev: Binomial Coefficients Algorithm, Up: Other Algorithms
|
||
|
||
15.7.4 Fibonacci Numbers
|
||
------------------------
|
||
|
||
The Fibonacci functions `mpz_fib_ui' and `mpz_fib2_ui' are designed for
|
||
calculating isolated F[n] or F[n],F[n-1] values efficiently.
|
||
|
||
For small n, a table of single limb values in `__gmp_fib_table' is
|
||
used. On a 32-bit limb this goes up to F[47], or on a 64-bit limb up
|
||
to F[93]. For convenience the table starts at F[-1].
|
||
|
||
Beyond the table, values are generated with a binary powering
|
||
algorithm, calculating a pair F[n] and F[n-1] working from high to low
|
||
across the bits of n. The formulas used are
|
||
|
||
F[2k+1] = 4*F[k]^2 - F[k-1]^2 + 2*(-1)^k
|
||
F[2k-1] = F[k]^2 + F[k-1]^2
|
||
|
||
F[2k] = F[2k+1] - F[2k-1]
|
||
|
||
At each step, k is the high b bits of n. If the next bit of n is 0
|
||
then F[2k],F[2k-1] is used, or if it's a 1 then F[2k+1],F[2k] is used,
|
||
and the process repeated until all bits of n are incorporated. Notice
|
||
these formulas require just two squares per bit of n.
|
||
|
||
It'd be possible to handle the first few n above the single limb
|
||
table with simple additions, using the defining Fibonacci recurrence
|
||
F[k+1]=F[k]+F[k-1], but this is not done since it usually turns out to
|
||
be faster for only about 10 or 20 values of n, and including a block of
|
||
code for just those doesn't seem worthwhile. If they really mattered
|
||
it'd be better to extend the data table.
|
||
|
||
Using a table avoids lots of calculations on small numbers, and
|
||
makes small n go fast. A bigger table would make more small n go fast,
|
||
it's just a question of balancing size against desired speed. For GMP
|
||
the code is kept compact, with the emphasis primarily on a good
|
||
powering algorithm.
|
||
|
||
`mpz_fib2_ui' returns both F[n] and F[n-1], but `mpz_fib_ui' is only
|
||
interested in F[n]. In this case the last step of the algorithm can
|
||
become one multiply instead of two squares. One of the following two
|
||
formulas is used, according as n is odd or even.
|
||
|
||
F[2k] = F[k]*(F[k]+2F[k-1])
|
||
|
||
F[2k+1] = (2F[k]+F[k-1])*(2F[k]-F[k-1]) + 2*(-1)^k
|
||
|
||
F[2k+1] here is the same as above, just rearranged to be a multiply.
|
||
For interest, the 2*(-1)^k term both here and above can be applied
|
||
just to the low limb of the calculation, without a carry or borrow into
|
||
further limbs, which saves some code size. See comments with
|
||
`mpz_fib_ui' and the internal `mpn_fib2_ui' for how this is done.
|
||
|
||
|
||
File: gmp.info, Node: Lucas Numbers Algorithm, Next: Random Number Algorithms, Prev: Fibonacci Numbers Algorithm, Up: Other Algorithms
|
||
|
||
15.7.5 Lucas Numbers
|
||
--------------------
|
||
|
||
`mpz_lucnum2_ui' derives a pair of Lucas numbers from a pair of
|
||
Fibonacci numbers with the following simple formulas.
|
||
|
||
L[k] = F[k] + 2*F[k-1]
|
||
L[k-1] = 2*F[k] - F[k-1]
|
||
|
||
`mpz_lucnum_ui' is only interested in L[n], and some work can be
|
||
saved. Trailing zero bits on n can be handled with a single square
|
||
each.
|
||
|
||
L[2k] = L[k]^2 - 2*(-1)^k
|
||
|
||
And the lowest 1 bit can be handled with one multiply of a pair of
|
||
Fibonacci numbers, similar to what `mpz_fib_ui' does.
|
||
|
||
L[2k+1] = 5*F[k-1]*(2*F[k]+F[k-1]) - 4*(-1)^k
|
||
|
||
|
||
File: gmp.info, Node: Random Number Algorithms, Prev: Lucas Numbers Algorithm, Up: Other Algorithms
|
||
|
||
15.7.6 Random Numbers
|
||
---------------------
|
||
|
||
For the `urandomb' functions, random numbers are generated simply by
|
||
concatenating bits produced by the generator. As long as the generator
|
||
has good randomness properties this will produce well-distributed N bit
|
||
numbers.
|
||
|
||
For the `urandomm' functions, random numbers in a range 0<=R<N are
|
||
generated by taking values R of ceil(log2(N)) bits each until one
|
||
satisfies R<N. This will normally require only one or two attempts,
|
||
but the attempts are limited in case the generator is somehow
|
||
degenerate and produces only 1 bits or similar.
|
||
|
||
The Mersenne Twister generator is by Matsumoto and Nishimura (*note
|
||
References::). It has a non-repeating period of 2^19937-1, which is a
|
||
Mersenne prime, hence the name of the generator. The state is 624
|
||
words of 32-bits each, which is iterated with one XOR and shift for each
|
||
32-bit word generated, making the algorithm very fast. Randomness
|
||
properties are also very good and this is the default algorithm used by
|
||
GMP.
|
||
|
||
Linear congruential generators are described in many text books, for
|
||
instance Knuth volume 2 (*note References::). With a modulus M and
|
||
parameters A and C, an integer state S is iterated by the formula S <-
|
||
A*S+C mod M. At each step the new state is a linear function of the
|
||
previous, mod M, hence the name of the generator.
|
||
|
||
In GMP only moduli of the form 2^N are supported, and the current
|
||
implementation is not as well optimized as it could be. Overheads are
|
||
significant when N is small, and when N is large clearly the multiply
|
||
at each step will become slow. This is not a big concern, since the
|
||
Mersenne Twister generator is better in every respect and is therefore
|
||
recommended for all normal applications.
|
||
|
||
For both generators the current state can be deduced by observing
|
||
enough output and applying some linear algebra (over GF(2) in the case
|
||
of the Mersenne Twister). This generally means raw output is
|
||
unsuitable for cryptographic applications without further hashing or
|
||
the like.
|
||
|
||
|
||
File: gmp.info, Node: Assembly Coding, Prev: Other Algorithms, Up: Algorithms
|
||
|
||
15.8 Assembly Coding
|
||
====================
|
||
|
||
The assembly subroutines in GMP are the most significant source of
|
||
speed at small to moderate sizes. At larger sizes algorithm selection
|
||
becomes more important, but of course speedups in low level routines
|
||
will still speed up everything proportionally.
|
||
|
||
Carry handling and widening multiplies that are important for GMP
|
||
can't be easily expressed in C. GCC `asm' blocks help a lot and are
|
||
provided in `longlong.h', but hand coding low level routines invariably
|
||
offers a speedup over generic C by a factor of anything from 2 to 10.
|
||
|
||
* Menu:
|
||
|
||
* Assembly Code Organisation::
|
||
* Assembly Basics::
|
||
* Assembly Carry Propagation::
|
||
* Assembly Cache Handling::
|
||
* Assembly Functional Units::
|
||
* Assembly Floating Point::
|
||
* Assembly SIMD Instructions::
|
||
* Assembly Software Pipelining::
|
||
* Assembly Loop Unrolling::
|
||
* Assembly Writing Guide::
|
||
|
||
|
||
File: gmp.info, Node: Assembly Code Organisation, Next: Assembly Basics, Prev: Assembly Coding, Up: Assembly Coding
|
||
|
||
15.8.1 Code Organisation
|
||
------------------------
|
||
|
||
The various `mpn' subdirectories contain machine-dependent code, written
|
||
in C or assembly. The `mpn/generic' subdirectory contains default code,
|
||
used when there's no machine-specific version of a particular file.
|
||
|
||
Each `mpn' subdirectory is for an ISA family. Generally 32-bit and
|
||
64-bit variants in a family cannot share code and have separate
|
||
directories. Within a family further subdirectories may exist for CPU
|
||
variants.
|
||
|
||
In each directory a `nails' subdirectory may exist, holding code with
|
||
nails support for that CPU variant. A `NAILS_SUPPORT' directive in each
|
||
file indicates the nails values the code handles. Nails code only
|
||
exists where it's faster, or promises to be faster, than plain code.
|
||
There's no effort put into nails if they're not going to enhance a
|
||
given CPU.
|
||
|
||
|
||
File: gmp.info, Node: Assembly Basics, Next: Assembly Carry Propagation, Prev: Assembly Code Organisation, Up: Assembly Coding
|
||
|
||
15.8.2 Assembly Basics
|
||
----------------------
|
||
|
||
`mpn_addmul_1' and `mpn_submul_1' are the most important routines for
|
||
overall GMP performance. All multiplications and divisions come down to
|
||
repeated calls to these. `mpn_add_n', `mpn_sub_n', `mpn_lshift' and
|
||
`mpn_rshift' are next most important.
|
||
|
||
On some CPUs assembly versions of the internal functions
|
||
`mpn_mul_basecase' and `mpn_sqr_basecase' give significant speedups,
|
||
mainly through avoiding function call overheads. They can also
|
||
potentially make better use of a wide superscalar processor, as can
|
||
bigger primitives like `mpn_addmul_2' or `mpn_addmul_4'.
|
||
|
||
The restrictions on overlaps between sources and destinations (*note
|
||
Low-level Functions::) are designed to facilitate a variety of
|
||
implementations. For example, knowing `mpn_add_n' won't have partly
|
||
overlapping sources and destination means reading can be done far ahead
|
||
of writing on superscalar processors, and loops can be vectorized on a
|
||
vector processor, depending on the carry handling.
|
||
|
||
|
||
File: gmp.info, Node: Assembly Carry Propagation, Next: Assembly Cache Handling, Prev: Assembly Basics, Up: Assembly Coding
|
||
|
||
15.8.3 Carry Propagation
|
||
------------------------
|
||
|
||
The problem that presents most challenges in GMP is propagating carries
|
||
from one limb to the next. In functions like `mpn_addmul_1' and
|
||
`mpn_add_n', carries are the only dependencies between limb operations.
|
||
|
||
On processors with carry flags, a straightforward CISC style `adc' is
|
||
generally best. AMD K6 `mpn_addmul_1' however is an example of an
|
||
unusual set of circumstances where a branch works out better.
|
||
|
||
On RISC processors generally an add and compare for overflow is
|
||
used. This sort of thing can be seen in `mpn/generic/aors_n.c'. Some
|
||
carry propagation schemes require 4 instructions, meaning at least 4
|
||
cycles per limb, but other schemes may use just 1 or 2. On wide
|
||
superscalar processors performance may be completely determined by the
|
||
number of dependent instructions between carry-in and carry-out for
|
||
each limb.
|
||
|
||
On vector processors good use can be made of the fact that a carry
|
||
bit only very rarely propagates more than one limb. When adding a
|
||
single bit to a limb, there's only a carry out if that limb was
|
||
`0xFF...FF' which on random data will be only 1 in 2^mp_bits_per_limb.
|
||
`mpn/cray/add_n.c' is an example of this, it adds all limbs in
|
||
parallel, adds one set of carry bits in parallel and then only rarely
|
||
needs to fall through to a loop propagating further carries.
|
||
|
||
On the x86s, GCC (as of version 2.95.2) doesn't generate
|
||
particularly good code for the RISC style idioms that are necessary to
|
||
handle carry bits in C. Often conditional jumps are generated where
|
||
`adc' or `sbb' forms would be better. And so unfortunately almost any
|
||
loop involving carry bits needs to be coded in assembly for best
|
||
results.
|
||
|
||
|
||
File: gmp.info, Node: Assembly Cache Handling, Next: Assembly Functional Units, Prev: Assembly Carry Propagation, Up: Assembly Coding
|
||
|
||
15.8.4 Cache Handling
|
||
---------------------
|
||
|
||
GMP aims to perform well both on operands that fit entirely in L1 cache
|
||
and those which don't.
|
||
|
||
Basic routines like `mpn_add_n' or `mpn_lshift' are often used on
|
||
large operands, so L2 and main memory performance is important for them.
|
||
`mpn_mul_1' and `mpn_addmul_1' are mostly used for multiply and square
|
||
basecases, so L1 performance matters most for them, unless assembly
|
||
versions of `mpn_mul_basecase' and `mpn_sqr_basecase' exist, in which
|
||
case the remaining uses are mostly for larger operands.
|
||
|
||
For L2 or main memory operands, memory access times will almost
|
||
certainly be more than the calculation time. The aim therefore is to
|
||
maximize memory throughput, by starting a load of the next cache line
|
||
while processing the contents of the previous one. Clearly this is
|
||
only possible if the chip has a lock-up free cache or some sort of
|
||
prefetch instruction. Most current chips have both these features.
|
||
|
||
Prefetching sources combines well with loop unrolling, since a
|
||
prefetch can be initiated once per unrolled loop (or more than once if
|
||
the loop covers more than one cache line).
|
||
|
||
On CPUs without write-allocate caches, prefetching destinations will
|
||
ensure individual stores don't go further down the cache hierarchy,
|
||
limiting bandwidth. Of course for calculations which are slow anyway,
|
||
like `mpn_divrem_1', write-throughs might be fine.
|
||
|
||
The distance ahead to prefetch will be determined by memory latency
|
||
versus throughput. The aim of course is to have data arriving
|
||
continuously, at peak throughput. Some CPUs have limits on the number
|
||
of fetches or prefetches in progress.
|
||
|
||
If a special prefetch instruction doesn't exist then a plain load
|
||
can be used, but in that case care must be taken not to attempt to read
|
||
past the end of an operand, since that might produce a segmentation
|
||
violation.
|
||
|
||
Some CPUs or systems have hardware that detects sequential memory
|
||
accesses and initiates suitable cache movements automatically, making
|
||
life easy.
|
||
|
||
|
||
File: gmp.info, Node: Assembly Functional Units, Next: Assembly Floating Point, Prev: Assembly Cache Handling, Up: Assembly Coding
|
||
|
||
15.8.5 Functional Units
|
||
-----------------------
|
||
|
||
When choosing an approach for an assembly loop, consideration is given
|
||
to what operations can execute simultaneously and what throughput can
|
||
thereby be achieved. In some cases an algorithm can be tweaked to
|
||
accommodate available resources.
|
||
|
||
Loop control will generally require a counter and pointer updates,
|
||
costing as much as 5 instructions, plus any delays a branch introduces.
|
||
CPU addressing modes might reduce pointer updates, perhaps by allowing
|
||
just one updating pointer and others expressed as offsets from it, or
|
||
on CISC chips with all addressing done with the loop counter as a
|
||
scaled index.
|
||
|
||
The final loop control cost can be amortised by processing several
|
||
limbs in each iteration (*note Assembly Loop Unrolling::). This at
|
||
least ensures loop control isn't a big fraction the work done.
|
||
|
||
Memory throughput is always a limit. If perhaps only one load or
|
||
one store can be done per cycle then 3 cycles/limb will the top speed
|
||
for "binary" operations like `mpn_add_n', and any code achieving that
|
||
is optimal.
|
||
|
||
Integer resources can be freed up by having the loop counter in a
|
||
float register, or by pressing the float units into use for some
|
||
multiplying, perhaps doing every second limb on the float side (*note
|
||
Assembly Floating Point::).
|
||
|
||
Float resources can be freed up by doing carry propagation on the
|
||
integer side, or even by doing integer to float conversions in integers
|
||
using bit twiddling.
|
||
|
||
|
||
File: gmp.info, Node: Assembly Floating Point, Next: Assembly SIMD Instructions, Prev: Assembly Functional Units, Up: Assembly Coding
|
||
|
||
15.8.6 Floating Point
|
||
---------------------
|
||
|
||
Floating point arithmetic is used in GMP for multiplications on CPUs
|
||
with poor integer multipliers. It's mostly useful for `mpn_mul_1',
|
||
`mpn_addmul_1' and `mpn_submul_1' on 64-bit machines, and
|
||
`mpn_mul_basecase' on both 32-bit and 64-bit machines.
|
||
|
||
With IEEE 53-bit double precision floats, integer multiplications
|
||
producing up to 53 bits will give exact results. Breaking a 64x64
|
||
multiplication into eight 16x32->48 bit pieces is convenient. With
|
||
some care though six 21x32->53 bit products can be used, if one of the
|
||
lower two 21-bit pieces also uses the sign bit.
|
||
|
||
For the `mpn_mul_1' family of functions on a 64-bit machine, the
|
||
invariant single limb is split at the start, into 3 or 4 pieces.
|
||
Inside the loop, the bignum operand is split into 32-bit pieces. Fast
|
||
conversion of these unsigned 32-bit pieces to floating point is highly
|
||
machine-dependent. In some cases, reading the data into the integer
|
||
unit, zero-extending to 64-bits, then transferring to the floating
|
||
point unit back via memory is the only option.
|
||
|
||
Converting partial products back to 64-bit limbs is usually best
|
||
done as a signed conversion. Since all values are smaller than 2^53,
|
||
signed and unsigned are the same, but most processors lack unsigned
|
||
conversions.
|
||
|
||
|
||
|
||
Here is a diagram showing 16x32 bit products for an `mpn_mul_1' or
|
||
`mpn_addmul_1' with a 64-bit limb. The single limb operand V is split
|
||
into four 16-bit parts. The multi-limb operand U is split in the loop
|
||
into two 32-bit parts.
|
||
|
||
+---+---+---+---+
|
||
|v48|v32|v16|v00| V operand
|
||
+---+---+---+---+
|
||
|
||
+-------+---+---+
|
||
x | u32 | u00 | U operand (one limb)
|
||
+---------------+
|
||
|
||
---------------------------------
|
||
|
||
+-----------+
|
||
| u00 x v00 | p00 48-bit products
|
||
+-----------+
|
||
+-----------+
|
||
| u00 x v16 | p16
|
||
+-----------+
|
||
+-----------+
|
||
| u00 x v32 | p32
|
||
+-----------+
|
||
+-----------+
|
||
| u00 x v48 | p48
|
||
+-----------+
|
||
+-----------+
|
||
| u32 x v00 | r32
|
||
+-----------+
|
||
+-----------+
|
||
| u32 x v16 | r48
|
||
+-----------+
|
||
+-----------+
|
||
| u32 x v32 | r64
|
||
+-----------+
|
||
+-----------+
|
||
| u32 x v48 | r80
|
||
+-----------+
|
||
|
||
p32 and r32 can be summed using floating-point addition, and
|
||
likewise p48 and r48. p00 and p16 can be summed with r64 and r80 from
|
||
the previous iteration.
|
||
|
||
For each loop then, four 49-bit quantities are transferred to the
|
||
integer unit, aligned as follows,
|
||
|
||
|-----64bits----|-----64bits----|
|
||
+------------+
|
||
| p00 + r64' | i00
|
||
+------------+
|
||
+------------+
|
||
| p16 + r80' | i16
|
||
+------------+
|
||
+------------+
|
||
| p32 + r32 | i32
|
||
+------------+
|
||
+------------+
|
||
| p48 + r48 | i48
|
||
+------------+
|
||
|
||
The challenge then is to sum these efficiently and add in a carry
|
||
limb, generating a low 64-bit result limb and a high 33-bit carry limb
|
||
(i48 extends 33 bits into the high half).
|
||
|
||
|
||
File: gmp.info, Node: Assembly SIMD Instructions, Next: Assembly Software Pipelining, Prev: Assembly Floating Point, Up: Assembly Coding
|
||
|
||
15.8.7 SIMD Instructions
|
||
------------------------
|
||
|
||
The single-instruction multiple-data support in current microprocessors
|
||
is aimed at signal processing algorithms where each data point can be
|
||
treated more or less independently. There's generally not much support
|
||
for propagating the sort of carries that arise in GMP.
|
||
|
||
SIMD multiplications of say four 16x16 bit multiplies only do as much
|
||
work as one 32x32 from GMP's point of view, and need some shifts and
|
||
adds besides. But of course if say the SIMD form is fully pipelined
|
||
and uses less instruction decoding then it may still be worthwhile.
|
||
|
||
On the x86 chips, MMX has so far found a use in `mpn_rshift' and
|
||
`mpn_lshift', and is used in a special case for 16-bit multipliers in
|
||
the P55 `mpn_mul_1'. SSE2 is used for Pentium 4 `mpn_mul_1',
|
||
`mpn_addmul_1', and `mpn_submul_1'.
|
||
|
||
|
||
File: gmp.info, Node: Assembly Software Pipelining, Next: Assembly Loop Unrolling, Prev: Assembly SIMD Instructions, Up: Assembly Coding
|
||
|
||
15.8.8 Software Pipelining
|
||
--------------------------
|
||
|
||
Software pipelining consists of scheduling instructions around the
|
||
branch point in a loop. For example a loop might issue a load not for
|
||
use in the present iteration but the next, thereby allowing extra
|
||
cycles for the data to arrive from memory.
|
||
|
||
Naturally this is wanted only when doing things like loads or
|
||
multiplies that take several cycles to complete, and only where a CPU
|
||
has multiple functional units so that other work can be done in the
|
||
meantime.
|
||
|
||
A pipeline with several stages will have a data value in progress at
|
||
each stage and each loop iteration moves them along one stage. This is
|
||
like juggling.
|
||
|
||
If the latency of some instruction is greater than the loop time
|
||
then it will be necessary to unroll, so one register has a result ready
|
||
to use while another (or multiple others) are still in progress.
|
||
(*note Assembly Loop Unrolling::).
|
||
|
||
|
||
File: gmp.info, Node: Assembly Loop Unrolling, Next: Assembly Writing Guide, Prev: Assembly Software Pipelining, Up: Assembly Coding
|
||
|
||
15.8.9 Loop Unrolling
|
||
---------------------
|
||
|
||
Loop unrolling consists of replicating code so that several limbs are
|
||
processed in each loop. At a minimum this reduces loop overheads by a
|
||
corresponding factor, but it can also allow better register usage, for
|
||
example alternately using one register combination and then another.
|
||
Judicious use of `m4' macros can help avoid lots of duplication in the
|
||
source code.
|
||
|
||
Any amount of unrolling can be handled with a loop counter that's
|
||
decremented by N each time, stopping when the remaining count is less
|
||
than the further N the loop will process. Or by subtracting N at the
|
||
start, the termination condition becomes when the counter C is less
|
||
than 0 (and the count of remaining limbs is C+N).
|
||
|
||
Alternately for a power of 2 unroll the loop count and remainder can
|
||
be established with a shift and mask. This is convenient if also
|
||
making a computed jump into the middle of a large loop.
|
||
|
||
The limbs not a multiple of the unrolling can be handled in various
|
||
ways, for example
|
||
|
||
* A simple loop at the end (or the start) to process the excess.
|
||
Care will be wanted that it isn't too much slower than the
|
||
unrolled part.
|
||
|
||
* A set of binary tests, for example after an 8-limb unrolling, test
|
||
for 4 more limbs to process, then a further 2 more or not, and
|
||
finally 1 more or not. This will probably take more code space
|
||
than a simple loop.
|
||
|
||
* A `switch' statement, providing separate code for each possible
|
||
excess, for example an 8-limb unrolling would have separate code
|
||
for 0 remaining, 1 remaining, etc, up to 7 remaining. This might
|
||
take a lot of code, but may be the best way to optimize all cases
|
||
in combination with a deep pipelined loop.
|
||
|
||
* A computed jump into the middle of the loop, thus making the first
|
||
iteration handle the excess. This should make times smoothly
|
||
increase with size, which is attractive, but setups for the jump
|
||
and adjustments for pointers can be tricky and could become quite
|
||
difficult in combination with deep pipelining.
|
||
|
||
|
||
File: gmp.info, Node: Assembly Writing Guide, Prev: Assembly Loop Unrolling, Up: Assembly Coding
|
||
|
||
15.8.10 Writing Guide
|
||
---------------------
|
||
|
||
This is a guide to writing software pipelined loops for processing limb
|
||
vectors in assembly.
|
||
|
||
First determine the algorithm and which instructions are needed.
|
||
Code it without unrolling or scheduling, to make sure it works. On a
|
||
3-operand CPU try to write each new value to a new register, this will
|
||
greatly simplify later steps.
|
||
|
||
Then note for each instruction the functional unit and/or issue port
|
||
requirements. If an instruction can use either of two units, like U0
|
||
or U1 then make a category "U0/U1". Count the total using each unit
|
||
(or combined unit), and count all instructions.
|
||
|
||
Figure out from those counts the best possible loop time. The goal
|
||
will be to find a perfect schedule where instruction latencies are
|
||
completely hidden. The total instruction count might be the limiting
|
||
factor, or perhaps a particular functional unit. It might be possible
|
||
to tweak the instructions to help the limiting factor.
|
||
|
||
Suppose the loop time is N, then make N issue buckets, with the
|
||
final loop branch at the end of the last. Now fill the buckets with
|
||
dummy instructions using the functional units desired. Run this to
|
||
make sure the intended speed is reached.
|
||
|
||
Now replace the dummy instructions with the real instructions from
|
||
the slow but correct loop you started with. The first will typically
|
||
be a load instruction. Then the instruction using that value is placed
|
||
in a bucket an appropriate distance down. Run the loop again, to check
|
||
it still runs at target speed.
|
||
|
||
Keep placing instructions, frequently measuring the loop. After a
|
||
few you will need to wrap around from the last bucket back to the top
|
||
of the loop. If you used the new-register for new-value strategy above
|
||
then there will be no register conflicts. If not then take care not to
|
||
clobber something already in use. Changing registers at this time is
|
||
very error prone.
|
||
|
||
The loop will overlap two or more of the original loop iterations,
|
||
and the computation of one vector element result will be started in one
|
||
iteration of the new loop, and completed one or several iterations
|
||
later.
|
||
|
||
The final step is to create feed-in and wind-down code for the loop.
|
||
A good way to do this is to make a copy (or copies) of the loop at the
|
||
start and delete those instructions which don't have valid antecedents,
|
||
and at the end replicate and delete those whose results are unwanted
|
||
(including any further loads).
|
||
|
||
The loop will have a minimum number of limbs loaded and processed,
|
||
so the feed-in code must test if the request size is smaller and skip
|
||
either to a suitable part of the wind-down or to special code for small
|
||
sizes.
|
||
|
||
|
||
File: gmp.info, Node: Internals, Next: Contributors, Prev: Algorithms, Up: Top
|
||
|
||
16 Internals
|
||
************
|
||
|
||
*This chapter is provided only for informational purposes and the
|
||
various internals described here may change in future GMP releases.
|
||
Applications expecting to be compatible with future releases should use
|
||
only the documented interfaces described in previous chapters.*
|
||
|
||
* Menu:
|
||
|
||
* Integer Internals::
|
||
* Rational Internals::
|
||
* Float Internals::
|
||
* Raw Output Internals::
|
||
* C++ Interface Internals::
|
||
|
||
|
||
File: gmp.info, Node: Integer Internals, Next: Rational Internals, Prev: Internals, Up: Internals
|
||
|
||
16.1 Integer Internals
|
||
======================
|
||
|
||
`mpz_t' variables represent integers using sign and magnitude, in space
|
||
dynamically allocated and reallocated. The fields are as follows.
|
||
|
||
`_mp_size'
|
||
The number of limbs, or the negative of that when representing a
|
||
negative integer. Zero is represented by `_mp_size' set to zero,
|
||
in which case the `_mp_d' data is unused.
|
||
|
||
`_mp_d'
|
||
A pointer to an array of limbs which is the magnitude. These are
|
||
stored "little endian" as per the `mpn' functions, so `_mp_d[0]'
|
||
is the least significant limb and `_mp_d[ABS(_mp_size)-1]' is the
|
||
most significant. Whenever `_mp_size' is non-zero, the most
|
||
significant limb is non-zero.
|
||
|
||
Currently there's always at least one limb allocated, so for
|
||
instance `mpz_set_ui' never needs to reallocate, and `mpz_get_ui'
|
||
can fetch `_mp_d[0]' unconditionally (though its value is then
|
||
only wanted if `_mp_size' is non-zero).
|
||
|
||
`_mp_alloc'
|
||
`_mp_alloc' is the number of limbs currently allocated at `_mp_d',
|
||
and naturally `_mp_alloc >= ABS(_mp_size)'. When an `mpz' routine
|
||
is about to (or might be about to) increase `_mp_size', it checks
|
||
`_mp_alloc' to see whether there's enough space, and reallocates
|
||
if not. `MPZ_REALLOC' is generally used for this.
|
||
|
||
The various bitwise logical functions like `mpz_and' behave as if
|
||
negative values were twos complement. But sign and magnitude is always
|
||
used internally, and necessary adjustments are made during the
|
||
calculations. Sometimes this isn't pretty, but sign and magnitude are
|
||
best for other routines.
|
||
|
||
Some internal temporary variables are setup with `MPZ_TMP_INIT' and
|
||
these have `_mp_d' space obtained from `TMP_ALLOC' rather than the
|
||
memory allocation functions. Care is taken to ensure that these are
|
||
big enough that no reallocation is necessary (since it would have
|
||
unpredictable consequences).
|
||
|
||
`_mp_size' and `_mp_alloc' are `int', although `mp_size_t' is
|
||
usually a `long'. This is done to make the fields just 32 bits on some
|
||
64 bits systems, thereby saving a few bytes of data space but still
|
||
providing plenty of range.
|
||
|
||
|
||
File: gmp.info, Node: Rational Internals, Next: Float Internals, Prev: Integer Internals, Up: Internals
|
||
|
||
16.2 Rational Internals
|
||
=======================
|
||
|
||
`mpq_t' variables represent rationals using an `mpz_t' numerator and
|
||
denominator (*note Integer Internals::).
|
||
|
||
The canonical form adopted is denominator positive (and non-zero),
|
||
no common factors between numerator and denominator, and zero uniquely
|
||
represented as 0/1.
|
||
|
||
It's believed that casting out common factors at each stage of a
|
||
calculation is best in general. A GCD is an O(N^2) operation so it's
|
||
better to do a few small ones immediately than to delay and have to do
|
||
a big one later. Knowing the numerator and denominator have no common
|
||
factors can be used for example in `mpq_mul' to make only two cross
|
||
GCDs necessary, not four.
|
||
|
||
This general approach to common factors is badly sub-optimal in the
|
||
presence of simple factorizations or little prospect for cancellation,
|
||
but GMP has no way to know when this will occur. As per *Note
|
||
Efficiency::, that's left to applications. The `mpq_t' framework might
|
||
still suit, with `mpq_numref' and `mpq_denref' for direct access to the
|
||
numerator and denominator, or of course `mpz_t' variables can be used
|
||
directly.
|
||
|
||
|
||
File: gmp.info, Node: Float Internals, Next: Raw Output Internals, Prev: Rational Internals, Up: Internals
|
||
|
||
16.3 Float Internals
|
||
====================
|
||
|
||
Efficient calculation is the primary aim of GMP floats and the use of
|
||
whole limbs and simple rounding facilitates this.
|
||
|
||
`mpf_t' floats have a variable precision mantissa and a single
|
||
machine word signed exponent. The mantissa is represented using sign
|
||
and magnitude.
|
||
|
||
most least
|
||
significant significant
|
||
limb limb
|
||
|
||
_mp_d
|
||
|---- _mp_exp ---> |
|
||
_____ _____ _____ _____ _____
|
||
|_____|_____|_____|_____|_____|
|
||
. <------------ radix point
|
||
|
||
<-------- _mp_size --------->
|
||
|
||
The fields are as follows.
|
||
|
||
`_mp_size'
|
||
The number of limbs currently in use, or the negative of that when
|
||
representing a negative value. Zero is represented by `_mp_size'
|
||
and `_mp_exp' both set to zero, and in that case the `_mp_d' data
|
||
is unused. (In the future `_mp_exp' might be undefined when
|
||
representing zero.)
|
||
|
||
`_mp_prec'
|
||
The precision of the mantissa, in limbs. In any calculation the
|
||
aim is to produce `_mp_prec' limbs of result (the most significant
|
||
being non-zero).
|
||
|
||
`_mp_d'
|
||
A pointer to the array of limbs which is the absolute value of the
|
||
mantissa. These are stored "little endian" as per the `mpn'
|
||
functions, so `_mp_d[0]' is the least significant limb and
|
||
`_mp_d[ABS(_mp_size)-1]' the most significant.
|
||
|
||
The most significant limb is always non-zero, but there are no
|
||
other restrictions on its value, in particular the highest 1 bit
|
||
can be anywhere within the limb.
|
||
|
||
`_mp_prec+1' limbs are allocated to `_mp_d', the extra limb being
|
||
for convenience (see below). There are no reallocations during a
|
||
calculation, only in a change of precision with `mpf_set_prec'.
|
||
|
||
`_mp_exp'
|
||
The exponent, in limbs, determining the location of the implied
|
||
radix point. Zero means the radix point is just above the most
|
||
significant limb. Positive values mean a radix point offset
|
||
towards the lower limbs and hence a value >= 1, as for example in
|
||
the diagram above. Negative exponents mean a radix point further
|
||
above the highest limb.
|
||
|
||
Naturally the exponent can be any value, it doesn't have to fall
|
||
within the limbs as the diagram shows, it can be a long way above
|
||
or a long way below. Limbs other than those included in the
|
||
`{_mp_d,_mp_size}' data are treated as zero.
|
||
|
||
The `_mp_size' and `_mp_prec' fields are `int', although the
|
||
`mp_size_t' type is usually a `long'. The `_mp_exp' field is usually
|
||
`long'. This is done to make some fields just 32 bits on some 64 bits
|
||
systems, thereby saving a few bytes of data space but still providing
|
||
plenty of precision and a very large range.
|
||
|
||
|
||
The following various points should be noted.
|
||
|
||
Low Zeros
|
||
The least significant limbs `_mp_d[0]' etc can be zero, though
|
||
such low zeros can always be ignored. Routines likely to produce
|
||
low zeros check and avoid them to save time in subsequent
|
||
calculations, but for most routines they're quite unlikely and
|
||
aren't checked.
|
||
|
||
Mantissa Size Range
|
||
The `_mp_size' count of limbs in use can be less than `_mp_prec' if
|
||
the value can be represented in less. This means low precision
|
||
values or small integers stored in a high precision `mpf_t' can
|
||
still be operated on efficiently.
|
||
|
||
`_mp_size' can also be greater than `_mp_prec'. Firstly a value is
|
||
allowed to use all of the `_mp_prec+1' limbs available at `_mp_d',
|
||
and secondly when `mpf_set_prec_raw' lowers `_mp_prec' it leaves
|
||
`_mp_size' unchanged and so the size can be arbitrarily bigger than
|
||
`_mp_prec'.
|
||
|
||
Rounding
|
||
All rounding is done on limb boundaries. Calculating `_mp_prec'
|
||
limbs with the high non-zero will ensure the application requested
|
||
minimum precision is obtained.
|
||
|
||
The use of simple "trunc" rounding towards zero is efficient,
|
||
since there's no need to examine extra limbs and increment or
|
||
decrement.
|
||
|
||
Bit Shifts
|
||
Since the exponent is in limbs, there are no bit shifts in basic
|
||
operations like `mpf_add' and `mpf_mul'. When differing exponents
|
||
are encountered all that's needed is to adjust pointers to line up
|
||
the relevant limbs.
|
||
|
||
Of course `mpf_mul_2exp' and `mpf_div_2exp' will require bit
|
||
shifts, but the choice is between an exponent in limbs which
|
||
requires shifts there, or one in bits which requires them almost
|
||
everywhere else.
|
||
|
||
Use of `_mp_prec+1' Limbs
|
||
The extra limb on `_mp_d' (`_mp_prec+1' rather than just
|
||
`_mp_prec') helps when an `mpf' routine might get a carry from its
|
||
operation. `mpf_add' for instance will do an `mpn_add' of
|
||
`_mp_prec' limbs. If there's no carry then that's the result, but
|
||
if there is a carry then it's stored in the extra limb of space and
|
||
`_mp_size' becomes `_mp_prec+1'.
|
||
|
||
Whenever `_mp_prec+1' limbs are held in a variable, the low limb
|
||
is not needed for the intended precision, only the `_mp_prec' high
|
||
limbs. But zeroing it out or moving the rest down is unnecessary.
|
||
Subsequent routines reading the value will simply take the high
|
||
limbs they need, and this will be `_mp_prec' if their target has
|
||
that same precision. This is no more than a pointer adjustment,
|
||
and must be checked anyway since the destination precision can be
|
||
different from the sources.
|
||
|
||
Copy functions like `mpf_set' will retain a full `_mp_prec+1' limbs
|
||
if available. This ensures that a variable which has `_mp_size'
|
||
equal to `_mp_prec+1' will get its full exact value copied.
|
||
Strictly speaking this is unnecessary since only `_mp_prec' limbs
|
||
are needed for the application's requested precision, but it's
|
||
considered that an `mpf_set' from one variable into another of the
|
||
same precision ought to produce an exact copy.
|
||
|
||
Application Precisions
|
||
`__GMPF_BITS_TO_PREC' converts an application requested precision
|
||
to an `_mp_prec'. The value in bits is rounded up to a whole limb
|
||
then an extra limb is added since the most significant limb of
|
||
`_mp_d' is only non-zero and therefore might contain only one bit.
|
||
|
||
`__GMPF_PREC_TO_BITS' does the reverse conversion, and removes the
|
||
extra limb from `_mp_prec' before converting to bits. The net
|
||
effect of reading back with `mpf_get_prec' is simply the precision
|
||
rounded up to a multiple of `mp_bits_per_limb'.
|
||
|
||
Note that the extra limb added here for the high only being
|
||
non-zero is in addition to the extra limb allocated to `_mp_d'.
|
||
For example with a 32-bit limb, an application request for 250
|
||
bits will be rounded up to 8 limbs, then an extra added for the
|
||
high being only non-zero, giving an `_mp_prec' of 9. `_mp_d' then
|
||
gets 10 limbs allocated. Reading back with `mpf_get_prec' will
|
||
take `_mp_prec' subtract 1 limb and multiply by 32, giving 256
|
||
bits.
|
||
|
||
Strictly speaking, the fact the high limb has at least one bit
|
||
means that a float with, say, 3 limbs of 32-bits each will be
|
||
holding at least 65 bits, but for the purposes of `mpf_t' it's
|
||
considered simply to be 64 bits, a nice multiple of the limb size.
|
||
|
||
|
||
File: gmp.info, Node: Raw Output Internals, Next: C++ Interface Internals, Prev: Float Internals, Up: Internals
|
||
|
||
16.4 Raw Output Internals
|
||
=========================
|
||
|
||
`mpz_out_raw' uses the following format.
|
||
|
||
+------+------------------------+
|
||
| size | data bytes |
|
||
+------+------------------------+
|
||
|
||
The size is 4 bytes written most significant byte first, being the
|
||
number of subsequent data bytes, or the twos complement negative of
|
||
that when a negative integer is represented. The data bytes are the
|
||
absolute value of the integer, written most significant byte first.
|
||
|
||
The most significant data byte is always non-zero, so the output is
|
||
the same on all systems, irrespective of limb size.
|
||
|
||
In GMP 1, leading zero bytes were written to pad the data bytes to a
|
||
multiple of the limb size. `mpz_inp_raw' will still accept this, for
|
||
compatibility.
|
||
|
||
The use of "big endian" for both the size and data fields is
|
||
deliberate, it makes the data easy to read in a hex dump of a file.
|
||
Unfortunately it also means that the limb data must be reversed when
|
||
reading or writing, so neither a big endian nor little endian system
|
||
can just read and write `_mp_d'.
|
||
|
||
|
||
File: gmp.info, Node: C++ Interface Internals, Prev: Raw Output Internals, Up: Internals
|
||
|
||
16.5 C++ Interface Internals
|
||
============================
|
||
|
||
A system of expression templates is used to ensure something like
|
||
`a=b+c' turns into a simple call to `mpz_add' etc. For `mpf_class' the
|
||
scheme also ensures the precision of the final destination is used for
|
||
any temporaries within a statement like `f=w*x+y*z'. These are
|
||
important features which a naive implementation cannot provide.
|
||
|
||
A simplified description of the scheme follows. The true scheme is
|
||
complicated by the fact that expressions have different return types.
|
||
For detailed information, refer to the source code.
|
||
|
||
To perform an operation, say, addition, we first define a "function
|
||
object" evaluating it,
|
||
|
||
struct __gmp_binary_plus
|
||
{
|
||
static void eval(mpf_t f, const mpf_t g, const mpf_t h)
|
||
{
|
||
mpf_add(f, g, h);
|
||
}
|
||
};
|
||
|
||
And an "additive expression" object,
|
||
|
||
__gmp_expr<__gmp_binary_expr<mpf_class, mpf_class, __gmp_binary_plus> >
|
||
operator+(const mpf_class &f, const mpf_class &g)
|
||
{
|
||
return __gmp_expr
|
||
<__gmp_binary_expr<mpf_class, mpf_class, __gmp_binary_plus> >(f, g);
|
||
}
|
||
|
||
The seemingly redundant `__gmp_expr<__gmp_binary_expr<...>>' is used
|
||
to encapsulate any possible kind of expression into a single template
|
||
type. In fact even `mpf_class' etc are `typedef' specializations of
|
||
`__gmp_expr'.
|
||
|
||
Next we define assignment of `__gmp_expr' to `mpf_class'.
|
||
|
||
template <class T>
|
||
mpf_class & mpf_class::operator=(const __gmp_expr<T> &expr)
|
||
{
|
||
expr.eval(this->get_mpf_t(), this->precision());
|
||
return *this;
|
||
}
|
||
|
||
template <class Op>
|
||
void __gmp_expr<__gmp_binary_expr<mpf_class, mpf_class, Op> >::eval
|
||
(mpf_t f, mp_bitcnt_t precision)
|
||
{
|
||
Op::eval(f, expr.val1.get_mpf_t(), expr.val2.get_mpf_t());
|
||
}
|
||
|
||
where `expr.val1' and `expr.val2' are references to the expression's
|
||
operands (here `expr' is the `__gmp_binary_expr' stored within the
|
||
`__gmp_expr').
|
||
|
||
This way, the expression is actually evaluated only at the time of
|
||
assignment, when the required precision (that of `f') is known.
|
||
Furthermore the target `mpf_t' is now available, thus we can call
|
||
`mpf_add' directly with `f' as the output argument.
|
||
|
||
Compound expressions are handled by defining operators taking
|
||
subexpressions as their arguments, like this:
|
||
|
||
template <class T, class U>
|
||
__gmp_expr
|
||
<__gmp_binary_expr<__gmp_expr<T>, __gmp_expr<U>, __gmp_binary_plus> >
|
||
operator+(const __gmp_expr<T> &expr1, const __gmp_expr<U> &expr2)
|
||
{
|
||
return __gmp_expr
|
||
<__gmp_binary_expr<__gmp_expr<T>, __gmp_expr<U>, __gmp_binary_plus> >
|
||
(expr1, expr2);
|
||
}
|
||
|
||
And the corresponding specializations of `__gmp_expr::eval':
|
||
|
||
template <class T, class U, class Op>
|
||
void __gmp_expr
|
||
<__gmp_binary_expr<__gmp_expr<T>, __gmp_expr<U>, Op> >::eval
|
||
(mpf_t f, mp_bitcnt_t precision)
|
||
{
|
||
// declare two temporaries
|
||
mpf_class temp1(expr.val1, precision), temp2(expr.val2, precision);
|
||
Op::eval(f, temp1.get_mpf_t(), temp2.get_mpf_t());
|
||
}
|
||
|
||
The expression is thus recursively evaluated to any level of
|
||
complexity and all subexpressions are evaluated to the precision of `f'.
|
||
|
||
|
||
File: gmp.info, Node: Contributors, Next: References, Prev: Internals, Up: Top
|
||
|
||
Appendix A Contributors
|
||
***********************
|
||
|
||
Torbjo"rn Granlund wrote the original GMP library and is still the main
|
||
developer. Code not explicitly attributed to others, was contributed by
|
||
Torbjo"rn. Several other individuals and organizations have contributed
|
||
GMP. Here is a list in chronological order on first contribution:
|
||
|
||
Gunnar Sjo"din and Hans Riesel helped with mathematical problems in
|
||
early versions of the library.
|
||
|
||
Richard Stallman helped with the interface design and revised the
|
||
first version of this manual.
|
||
|
||
Brian Beuning and Doug Lea helped with testing of early versions of
|
||
the library and made creative suggestions.
|
||
|
||
John Amanatides of York University in Canada contributed the function
|
||
`mpz_probab_prime_p'.
|
||
|
||
Paul Zimmermann wrote the REDC-based mpz_powm code, the
|
||
Scho"nhage-Strassen FFT multiply code, and the Karatsuba square root
|
||
code. He also improved the Toom3 code for GMP 4.2. Paul sparked the
|
||
development of GMP 2, with his comparisons between bignum packages.
|
||
The ECMNET project Paul is organizing was a driving force behind many
|
||
of the optimizations in GMP 3. Paul also wrote the new GMP 4.3 nth
|
||
root code (with Torbjo"rn).
|
||
|
||
Ken Weber (Kent State University, Universidade Federal do Rio Grande
|
||
do Sul) contributed now defunct versions of `mpz_gcd', `mpz_divexact',
|
||
`mpn_gcd', and `mpn_bdivmod', partially supported by CNPq (Brazil)
|
||
grant 301314194-2.
|
||
|
||
Per Bothner of Cygnus Support helped to set up GMP to use Cygnus'
|
||
configure. He has also made valuable suggestions and tested numerous
|
||
intermediary releases.
|
||
|
||
Joachim Hollman was involved in the design of the `mpf' interface,
|
||
and in the `mpz' design revisions for version 2.
|
||
|
||
Bennet Yee contributed the initial versions of `mpz_jacobi' and
|
||
`mpz_legendre'.
|
||
|
||
Andreas Schwab contributed the files `mpn/m68k/lshift.S' and
|
||
`mpn/m68k/rshift.S' (now in `.asm' form).
|
||
|
||
Robert Harley of Inria, France and David Seal of ARM, England,
|
||
suggested clever improvements for population count. Robert also wrote
|
||
highly optimized Karatsuba and 3-way Toom multiplication functions for
|
||
GMP 3, and contributed the ARM assembly code.
|
||
|
||
Torsten Ekedahl of the Mathematical department of Stockholm
|
||
University provided significant inspiration during several phases of
|
||
the GMP development. His mathematical expertise helped improve several
|
||
algorithms.
|
||
|
||
Linus Nordberg wrote the new configure system based on autoconf and
|
||
implemented the new random functions.
|
||
|
||
Kevin Ryde worked on a large number of things: optimized x86 code,
|
||
m4 asm macros, parameter tuning, speed measuring, the configure system,
|
||
function inlining, divisibility tests, bit scanning, Jacobi symbols,
|
||
Fibonacci and Lucas number functions, printf and scanf functions, perl
|
||
interface, demo expression parser, the algorithms chapter in the
|
||
manual, `gmpasm-mode.el', and various miscellaneous improvements
|
||
elsewhere.
|
||
|
||
Kent Boortz made the Mac OS 9 port.
|
||
|
||
Steve Root helped write the optimized alpha 21264 assembly code.
|
||
|
||
Gerardo Ballabio wrote the `gmpxx.h' C++ class interface and the C++
|
||
`istream' input routines.
|
||
|
||
Jason Moxham rewrote `mpz_fac_ui'.
|
||
|
||
Pedro Gimeno implemented the Mersenne Twister and made other random
|
||
number improvements.
|
||
|
||
Niels Mo"ller wrote the sub-quadratic GCD, extended GCD and jacobi
|
||
code, the quadratic Hensel division code, and (with Torbjo"rn) the new
|
||
divide and conquer division code for GMP 4.3. Niels also helped
|
||
implement the new Toom multiply code for GMP 4.3 and implemented helper
|
||
functions to simplify Toom evaluations for GMP 5.0. He wrote the
|
||
original version of mpn_mulmod_bnm1, and he is the main author of the
|
||
mini-gmp package used for gmp bootstrapping.
|
||
|
||
Alberto Zanoni and Marco Bodrato suggested the unbalanced multiply
|
||
strategy, and found the optimal strategies for evaluation and
|
||
interpolation in Toom multiplication.
|
||
|
||
Marco Bodrato helped implement the new Toom multiply code for GMP
|
||
4.3 and implemented most of the new Toom multiply and squaring code for
|
||
5.0. He is the main author of the current mpn_mulmod_bnm1,
|
||
mpn_mullo_n, and mpn_sqrlo. Marco also wrote the functions mpn_invert
|
||
and mpn_invertappr, and improved the speed of integer root extraction.
|
||
He is the author of the current combinatorial functions: binomial,
|
||
factorial, multifactorial, primorial.
|
||
|
||
David Harvey suggested the internal function `mpn_bdiv_dbm1',
|
||
implementing division relevant to Toom multiplication. He also worked
|
||
on fast assembly sequences, in particular on a fast AMD64
|
||
`mpn_mul_basecase'. He wrote the internal middle product functions
|
||
`mpn_mulmid_basecase', `mpn_toom42_mulmid', `mpn_mulmid_n' and related
|
||
helper routines.
|
||
|
||
Martin Boij wrote `mpn_perfect_power_p'.
|
||
|
||
Marc Glisse improved `gmpxx.h': use fewer temporaries (faster),
|
||
specializations of `numeric_limits' and `common_type', C++11 features
|
||
(move constructors, explicit bool conversion, UDL), make the conversion
|
||
from `mpq_class' to `mpz_class' explicit, optimize operations where one
|
||
argument is a small compile-time constant, replace some heap
|
||
allocations by stack allocations. He also fixed the eofbit handling of
|
||
C++ streams, and removed one division from `mpq/aors.c'.
|
||
|
||
David S Miller wrote assembly code for SPARC T3 and T4.
|
||
|
||
Mark Sofroniou cleaned up the types of mul_fft.c, letting it work
|
||
for huge operands.
|
||
|
||
Ulrich Weigand ported GMP to the powerpc64le ABI.
|
||
|
||
(This list is chronological, not ordered after significance. If you
|
||
have contributed to GMP but are not listed above, please tell
|
||
<gmp-devel@gmplib.org> about the omission!)
|
||
|
||
The development of floating point functions of GNU MP 2, were
|
||
supported in part by the ESPRIT-BRA (Basic Research Activities) 6846
|
||
project POSSO (POlynomial System SOlving).
|
||
|
||
The development of GMP 2, 3, and 4.0 was supported in part by the
|
||
IDA Center for Computing Sciences.
|
||
|
||
The development of GMP 4.3, 5.0, and 5.1 was supported in part by
|
||
the Swedish Foundation for Strategic Research.
|
||
|
||
Thanks go to Hans Thorsen for donating an SGI system for the GMP
|
||
test system environment.
|
||
|
||
|
||
File: gmp.info, Node: References, Next: GNU Free Documentation License, Prev: Contributors, Up: Top
|
||
|
||
Appendix B References
|
||
*********************
|
||
|
||
B.1 Books
|
||
=========
|
||
|
||
* Jonathan M. Borwein and Peter B. Borwein, "Pi and the AGM: A Study
|
||
in Analytic Number Theory and Computational Complexity", Wiley,
|
||
1998.
|
||
|
||
* Richard Crandall and Carl Pomerance, "Prime Numbers: A
|
||
Computational Perspective", 2nd edition, Springer-Verlag, 2005.
|
||
`http://www.math.dartmouth.edu/~carlp/'
|
||
|
||
* Henri Cohen, "A Course in Computational Algebraic Number Theory",
|
||
Graduate Texts in Mathematics number 138, Springer-Verlag, 1993.
|
||
`http://www.math.u-bordeaux.fr/~cohen/'
|
||
|
||
* Donald E. Knuth, "The Art of Computer Programming", volume 2,
|
||
"Seminumerical Algorithms", 3rd edition, Addison-Wesley, 1998.
|
||
`http://www-cs-faculty.stanford.edu/~knuth/taocp.html'
|
||
|
||
* John D. Lipson, "Elements of Algebra and Algebraic Computing", The
|
||
Benjamin Cummings Publishing Company Inc, 1981.
|
||
|
||
* Alfred J. Menezes, Paul C. van Oorschot and Scott A. Vanstone,
|
||
"Handbook of Applied Cryptography",
|
||
`http://www.cacr.math.uwaterloo.ca/hac/'
|
||
|
||
* Richard M. Stallman and the GCC Developer Community, "Using the
|
||
GNU Compiler Collection", Free Software Foundation, 2008,
|
||
available online `https://gcc.gnu.org/onlinedocs/', and in the GCC
|
||
package `https://ftp.gnu.org/gnu/gcc/'
|
||
|
||
B.2 Papers
|
||
==========
|
||
|
||
* Yves Bertot, Nicolas Magaud and Paul Zimmermann, "A Proof of GMP
|
||
Square Root", Journal of Automated Reasoning, volume 29, 2002, pp.
|
||
225-252. Also available online as INRIA Research Report 4475,
|
||
June 2002, `http://hal.inria.fr/docs/00/07/21/13/PDF/RR-4475.pdf'
|
||
|
||
* Christoph Burnikel and Joachim Ziegler, "Fast Recursive Division",
|
||
Max-Planck-Institut fuer Informatik Research Report MPI-I-98-1-022,
|
||
`http://data.mpi-sb.mpg.de/internet/reports.nsf/NumberView/1998-1-022'
|
||
|
||
* Torbjo"rn Granlund and Peter L. Montgomery, "Division by Invariant
|
||
Integers using Multiplication", in Proceedings of the SIGPLAN
|
||
PLDI'94 Conference, June 1994. Also available
|
||
`https://gmplib.org/~tege/divcnst-pldi94.pdf'.
|
||
|
||
* Niels Mo"ller and Torbjo"rn Granlund, "Improved division by
|
||
invariant integers", IEEE Transactions on Computers, 11 June 2010.
|
||
`https://gmplib.org/~tege/division-paper.pdf'
|
||
|
||
* Torbjo"rn Granlund and Niels Mo"ller, "Division of integers large
|
||
and small", to appear.
|
||
|
||
* Tudor Jebelean, "An algorithm for exact division", Journal of
|
||
Symbolic Computation, volume 15, 1993, pp. 169-180. Research
|
||
report version available
|
||
`ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1992/92-35.ps.gz'
|
||
|
||
* Tudor Jebelean, "Exact Division with Karatsuba Complexity -
|
||
Extended Abstract", RISC-Linz technical report 96-31,
|
||
`ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1996/96-31.ps.gz'
|
||
|
||
* Tudor Jebelean, "Practical Integer Division with Karatsuba
|
||
Complexity", ISSAC 97, pp. 339-341. Technical report available
|
||
`ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1996/96-29.ps.gz'
|
||
|
||
* Tudor Jebelean, "A Generalization of the Binary GCD Algorithm",
|
||
ISSAC 93, pp. 111-116. Technical report version available
|
||
`ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1993/93-01.ps.gz'
|
||
|
||
* Tudor Jebelean, "A Double-Digit Lehmer-Euclid Algorithm for
|
||
Finding the GCD of Long Integers", Journal of Symbolic
|
||
Computation, volume 19, 1995, pp. 145-157. Technical report
|
||
version also available
|
||
`ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1992/92-69.ps.gz'
|
||
|
||
* Werner Krandick and Tudor Jebelean, "Bidirectional Exact Integer
|
||
Division", Journal of Symbolic Computation, volume 21, 1996, pp.
|
||
441-455. Early technical report version also available
|
||
`ftp://ftp.risc.uni-linz.ac.at/pub/techreports/1994/94-50.ps.gz'
|
||
|
||
* Makoto Matsumoto and Takuji Nishimura, "Mersenne Twister: A
|
||
623-dimensionally equidistributed uniform pseudorandom number
|
||
generator", ACM Transactions on Modelling and Computer Simulation,
|
||
volume 8, January 1998, pp. 3-30. Available online
|
||
`http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/mt.ps.gz'
|
||
(or .pdf)
|
||
|
||
* R. Moenck and A. Borodin, "Fast Modular Transforms via Division",
|
||
Proceedings of the 13th Annual IEEE Symposium on Switching and
|
||
Automata Theory, October 1972, pp. 90-96. Reprinted as "Fast
|
||
Modular Transforms", Journal of Computer and System Sciences,
|
||
volume 8, number 3, June 1974, pp. 366-386.
|
||
|
||
* Niels Mo"ller, "On Scho"nhage's algorithm and subquadratic integer
|
||
GCD computation", in Mathematics of Computation, volume 77,
|
||
January 2008, pp. 589-607.
|
||
|
||
* Peter L. Montgomery, "Modular Multiplication Without Trial
|
||
Division", in Mathematics of Computation, volume 44, number 170,
|
||
April 1985.
|
||
|
||
* Arnold Scho"nhage and Volker Strassen, "Schnelle Multiplikation
|
||
grosser Zahlen", Computing 7, 1971, pp. 281-292.
|
||
|
||
* Kenneth Weber, "The accelerated integer GCD algorithm", ACM
|
||
Transactions on Mathematical Software, volume 21, number 1, March
|
||
1995, pp. 111-122.
|
||
|
||
* Paul Zimmermann, "Karatsuba Square Root", INRIA Research Report
|
||
3805, November 1999,
|
||
`http://hal.inria.fr/inria-00072854/PDF/RR-3805.pdf'
|
||
|
||
* Paul Zimmermann, "A Proof of GMP Fast Division and Square Root
|
||
Implementations",
|
||
`http://www.loria.fr/~zimmerma/papers/proof-div-sqrt.ps.gz'
|
||
|
||
* Dan Zuras, "On Squaring and Multiplying Large Integers", ARITH-11:
|
||
IEEE Symposium on Computer Arithmetic, 1993, pp. 260 to 271.
|
||
Reprinted as "More on Multiplying and Squaring Large Integers",
|
||
IEEE Transactions on Computers, volume 43, number 8, August 1994,
|
||
pp. 899-908.
|
||
|
||
|
||
File: gmp.info, Node: GNU Free Documentation License, Next: Concept Index, Prev: References, Up: Top
|
||
|
||
Appendix C GNU Free Documentation License
|
||
*****************************************
|
||
|
||
Version 1.3, 3 November 2008
|
||
|
||
Copyright (C) 2000-2002, 2007, 2008 Free Software Foundation, Inc.
|
||
`http://fsf.org/'
|
||
|
||
Everyone is permitted to copy and distribute verbatim copies
|
||
of this license document, but changing it is not allowed.
|
||
|
||
0. PREAMBLE
|
||
|
||
The purpose of this License is to make a manual, textbook, or other
|
||
functional and useful document "free" in the sense of freedom: to
|
||
assure everyone the effective freedom to copy and redistribute it,
|
||
with or without modifying it, either commercially or
|
||
noncommercially. Secondarily, this License preserves for the
|
||
author and publisher a way to get credit for their work, while not
|
||
being considered responsible for modifications made by others.
|
||
|
||
This License is a kind of "copyleft", which means that derivative
|
||
works of the document must themselves be free in the same sense.
|
||
It complements the GNU General Public License, which is a copyleft
|
||
license designed for free software.
|
||
|
||
We have designed this License in order to use it for manuals for
|
||
free software, because free software needs free documentation: a
|
||
free program should come with manuals providing the same freedoms
|
||
that the software does. But this License is not limited to
|
||
software manuals; it can be used for any textual work, regardless
|
||
of subject matter or whether it is published as a printed book.
|
||
We recommend this License principally for works whose purpose is
|
||
instruction or reference.
|
||
|
||
1. APPLICABILITY AND DEFINITIONS
|
||
|
||
This License applies to any manual or other work, in any medium,
|
||
that contains a notice placed by the copyright holder saying it
|
||
can be distributed under the terms of this License. Such a notice
|
||
grants a world-wide, royalty-free license, unlimited in duration,
|
||
to use that work under the conditions stated herein. The
|
||
"Document", below, refers to any such manual or work. Any member
|
||
of the public is a licensee, and is addressed as "you". You
|
||
accept the license if you copy, modify or distribute the work in a
|
||
way requiring permission under copyright law.
|
||
|
||
A "Modified Version" of the Document means any work containing the
|
||
Document or a portion of it, either copied verbatim, or with
|
||
modifications and/or translated into another language.
|
||
|
||
A "Secondary Section" is a named appendix or a front-matter section
|
||
of the Document that deals exclusively with the relationship of the
|
||
publishers or authors of the Document to the Document's overall
|
||
subject (or to related matters) and contains nothing that could
|
||
fall directly within that overall subject. (Thus, if the Document
|
||
is in part a textbook of mathematics, a Secondary Section may not
|
||
explain any mathematics.) The relationship could be a matter of
|
||
historical connection with the subject or with related matters, or
|
||
of legal, commercial, philosophical, ethical or political position
|
||
regarding them.
|
||
|
||
The "Invariant Sections" are certain Secondary Sections whose
|
||
titles are designated, as being those of Invariant Sections, in
|
||
the notice that says that the Document is released under this
|
||
License. If a section does not fit the above definition of
|
||
Secondary then it is not allowed to be designated as Invariant.
|
||
The Document may contain zero Invariant Sections. If the Document
|
||
does not identify any Invariant Sections then there are none.
|
||
|
||
The "Cover Texts" are certain short passages of text that are
|
||
listed, as Front-Cover Texts or Back-Cover Texts, in the notice
|
||
that says that the Document is released under this License. A
|
||
Front-Cover Text may be at most 5 words, and a Back-Cover Text may
|
||
be at most 25 words.
|
||
|
||
A "Transparent" copy of the Document means a machine-readable copy,
|
||
represented in a format whose specification is available to the
|
||
general public, that is suitable for revising the document
|
||
straightforwardly with generic text editors or (for images
|
||
composed of pixels) generic paint programs or (for drawings) some
|
||
widely available drawing editor, and that is suitable for input to
|
||
text formatters or for automatic translation to a variety of
|
||
formats suitable for input to text formatters. A copy made in an
|
||
otherwise Transparent file format whose markup, or absence of
|
||
markup, has been arranged to thwart or discourage subsequent
|
||
modification by readers is not Transparent. An image format is
|
||
not Transparent if used for any substantial amount of text. A
|
||
copy that is not "Transparent" is called "Opaque".
|
||
|
||
Examples of suitable formats for Transparent copies include plain
|
||
ASCII without markup, Texinfo input format, LaTeX input format,
|
||
SGML or XML using a publicly available DTD, and
|
||
standard-conforming simple HTML, PostScript or PDF designed for
|
||
human modification. Examples of transparent image formats include
|
||
PNG, XCF and JPG. Opaque formats include proprietary formats that
|
||
can be read and edited only by proprietary word processors, SGML or
|
||
XML for which the DTD and/or processing tools are not generally
|
||
available, and the machine-generated HTML, PostScript or PDF
|
||
produced by some word processors for output purposes only.
|
||
|
||
The "Title Page" means, for a printed book, the title page itself,
|
||
plus such following pages as are needed to hold, legibly, the
|
||
material this License requires to appear in the title page. For
|
||
works in formats which do not have any title page as such, "Title
|
||
Page" means the text near the most prominent appearance of the
|
||
work's title, preceding the beginning of the body of the text.
|
||
|
||
The "publisher" means any person or entity that distributes copies
|
||
of the Document to the public.
|
||
|
||
A section "Entitled XYZ" means a named subunit of the Document
|
||
whose title either is precisely XYZ or contains XYZ in parentheses
|
||
following text that translates XYZ in another language. (Here XYZ
|
||
stands for a specific section name mentioned below, such as
|
||
"Acknowledgements", "Dedications", "Endorsements", or "History".)
|
||
To "Preserve the Title" of such a section when you modify the
|
||
Document means that it remains a section "Entitled XYZ" according
|
||
to this definition.
|
||
|
||
The Document may include Warranty Disclaimers next to the notice
|
||
which states that this License applies to the Document. These
|
||
Warranty Disclaimers are considered to be included by reference in
|
||
this License, but only as regards disclaiming warranties: any other
|
||
implication that these Warranty Disclaimers may have is void and
|
||
has no effect on the meaning of this License.
|
||
|
||
2. VERBATIM COPYING
|
||
|
||
You may copy and distribute the Document in any medium, either
|
||
commercially or noncommercially, provided that this License, the
|
||
copyright notices, and the license notice saying this License
|
||
applies to the Document are reproduced in all copies, and that you
|
||
add no other conditions whatsoever to those of this License. You
|
||
may not use technical measures to obstruct or control the reading
|
||
or further copying of the copies you make or distribute. However,
|
||
you may accept compensation in exchange for copies. If you
|
||
distribute a large enough number of copies you must also follow
|
||
the conditions in section 3.
|
||
|
||
You may also lend copies, under the same conditions stated above,
|
||
and you may publicly display copies.
|
||
|
||
3. COPYING IN QUANTITY
|
||
|
||
If you publish printed copies (or copies in media that commonly
|
||
have printed covers) of the Document, numbering more than 100, and
|
||
the Document's license notice requires Cover Texts, you must
|
||
enclose the copies in covers that carry, clearly and legibly, all
|
||
these Cover Texts: Front-Cover Texts on the front cover, and
|
||
Back-Cover Texts on the back cover. Both covers must also clearly
|
||
and legibly identify you as the publisher of these copies. The
|
||
front cover must present the full title with all words of the
|
||
title equally prominent and visible. You may add other material
|
||
on the covers in addition. Copying with changes limited to the
|
||
covers, as long as they preserve the title of the Document and
|
||
satisfy these conditions, can be treated as verbatim copying in
|
||
other respects.
|
||
|
||
If the required texts for either cover are too voluminous to fit
|
||
legibly, you should put the first ones listed (as many as fit
|
||
reasonably) on the actual cover, and continue the rest onto
|
||
adjacent pages.
|
||
|
||
If you publish or distribute Opaque copies of the Document
|
||
numbering more than 100, you must either include a
|
||
machine-readable Transparent copy along with each Opaque copy, or
|
||
state in or with each Opaque copy a computer-network location from
|
||
which the general network-using public has access to download
|
||
using public-standard network protocols a complete Transparent
|
||
copy of the Document, free of added material. If you use the
|
||
latter option, you must take reasonably prudent steps, when you
|
||
begin distribution of Opaque copies in quantity, to ensure that
|
||
this Transparent copy will remain thus accessible at the stated
|
||
location until at least one year after the last time you
|
||
distribute an Opaque copy (directly or through your agents or
|
||
retailers) of that edition to the public.
|
||
|
||
It is requested, but not required, that you contact the authors of
|
||
the Document well before redistributing any large number of
|
||
copies, to give them a chance to provide you with an updated
|
||
version of the Document.
|
||
|
||
4. MODIFICATIONS
|
||
|
||
You may copy and distribute a Modified Version of the Document
|
||
under the conditions of sections 2 and 3 above, provided that you
|
||
release the Modified Version under precisely this License, with
|
||
the Modified Version filling the role of the Document, thus
|
||
licensing distribution and modification of the Modified Version to
|
||
whoever possesses a copy of it. In addition, you must do these
|
||
things in the Modified Version:
|
||
|
||
A. Use in the Title Page (and on the covers, if any) a title
|
||
distinct from that of the Document, and from those of
|
||
previous versions (which should, if there were any, be listed
|
||
in the History section of the Document). You may use the
|
||
same title as a previous version if the original publisher of
|
||
that version gives permission.
|
||
|
||
B. List on the Title Page, as authors, one or more persons or
|
||
entities responsible for authorship of the modifications in
|
||
the Modified Version, together with at least five of the
|
||
principal authors of the Document (all of its principal
|
||
authors, if it has fewer than five), unless they release you
|
||
from this requirement.
|
||
|
||
C. State on the Title page the name of the publisher of the
|
||
Modified Version, as the publisher.
|
||
|
||
D. Preserve all the copyright notices of the Document.
|
||
|
||
E. Add an appropriate copyright notice for your modifications
|
||
adjacent to the other copyright notices.
|
||
|
||
F. Include, immediately after the copyright notices, a license
|
||
notice giving the public permission to use the Modified
|
||
Version under the terms of this License, in the form shown in
|
||
the Addendum below.
|
||
|
||
G. Preserve in that license notice the full lists of Invariant
|
||
Sections and required Cover Texts given in the Document's
|
||
license notice.
|
||
|
||
H. Include an unaltered copy of this License.
|
||
|
||
I. Preserve the section Entitled "History", Preserve its Title,
|
||
and add to it an item stating at least the title, year, new
|
||
authors, and publisher of the Modified Version as given on
|
||
the Title Page. If there is no section Entitled "History" in
|
||
the Document, create one stating the title, year, authors,
|
||
and publisher of the Document as given on its Title Page,
|
||
then add an item describing the Modified Version as stated in
|
||
the previous sentence.
|
||
|
||
J. Preserve the network location, if any, given in the Document
|
||
for public access to a Transparent copy of the Document, and
|
||
likewise the network locations given in the Document for
|
||
previous versions it was based on. These may be placed in
|
||
the "History" section. You may omit a network location for a
|
||
work that was published at least four years before the
|
||
Document itself, or if the original publisher of the version
|
||
it refers to gives permission.
|
||
|
||
K. For any section Entitled "Acknowledgements" or "Dedications",
|
||
Preserve the Title of the section, and preserve in the
|
||
section all the substance and tone of each of the contributor
|
||
acknowledgements and/or dedications given therein.
|
||
|
||
L. Preserve all the Invariant Sections of the Document,
|
||
unaltered in their text and in their titles. Section numbers
|
||
or the equivalent are not considered part of the section
|
||
titles.
|
||
|
||
M. Delete any section Entitled "Endorsements". Such a section
|
||
may not be included in the Modified Version.
|
||
|
||
N. Do not retitle any existing section to be Entitled
|
||
"Endorsements" or to conflict in title with any Invariant
|
||
Section.
|
||
|
||
O. Preserve any Warranty Disclaimers.
|
||
|
||
If the Modified Version includes new front-matter sections or
|
||
appendices that qualify as Secondary Sections and contain no
|
||
material copied from the Document, you may at your option
|
||
designate some or all of these sections as invariant. To do this,
|
||
add their titles to the list of Invariant Sections in the Modified
|
||
Version's license notice. These titles must be distinct from any
|
||
other section titles.
|
||
|
||
You may add a section Entitled "Endorsements", provided it contains
|
||
nothing but endorsements of your Modified Version by various
|
||
parties--for example, statements of peer review or that the text
|
||
has been approved by an organization as the authoritative
|
||
definition of a standard.
|
||
|
||
You may add a passage of up to five words as a Front-Cover Text,
|
||
and a passage of up to 25 words as a Back-Cover Text, to the end
|
||
of the list of Cover Texts in the Modified Version. Only one
|
||
passage of Front-Cover Text and one of Back-Cover Text may be
|
||
added by (or through arrangements made by) any one entity. If the
|
||
Document already includes a cover text for the same cover,
|
||
previously added by you or by arrangement made by the same entity
|
||
you are acting on behalf of, you may not add another; but you may
|
||
replace the old one, on explicit permission from the previous
|
||
publisher that added the old one.
|
||
|
||
The author(s) and publisher(s) of the Document do not by this
|
||
License give permission to use their names for publicity for or to
|
||
assert or imply endorsement of any Modified Version.
|
||
|
||
5. COMBINING DOCUMENTS
|
||
|
||
You may combine the Document with other documents released under
|
||
this License, under the terms defined in section 4 above for
|
||
modified versions, provided that you include in the combination
|
||
all of the Invariant Sections of all of the original documents,
|
||
unmodified, and list them all as Invariant Sections of your
|
||
combined work in its license notice, and that you preserve all
|
||
their Warranty Disclaimers.
|
||
|
||
The combined work need only contain one copy of this License, and
|
||
multiple identical Invariant Sections may be replaced with a single
|
||
copy. If there are multiple Invariant Sections with the same name
|
||
but different contents, make the title of each such section unique
|
||
by adding at the end of it, in parentheses, the name of the
|
||
original author or publisher of that section if known, or else a
|
||
unique number. Make the same adjustment to the section titles in
|
||
the list of Invariant Sections in the license notice of the
|
||
combined work.
|
||
|
||
In the combination, you must combine any sections Entitled
|
||
"History" in the various original documents, forming one section
|
||
Entitled "History"; likewise combine any sections Entitled
|
||
"Acknowledgements", and any sections Entitled "Dedications". You
|
||
must delete all sections Entitled "Endorsements."
|
||
|
||
6. COLLECTIONS OF DOCUMENTS
|
||
|
||
You may make a collection consisting of the Document and other
|
||
documents released under this License, and replace the individual
|
||
copies of this License in the various documents with a single copy
|
||
that is included in the collection, provided that you follow the
|
||
rules of this License for verbatim copying of each of the
|
||
documents in all other respects.
|
||
|
||
You may extract a single document from such a collection, and
|
||
distribute it individually under this License, provided you insert
|
||
a copy of this License into the extracted document, and follow
|
||
this License in all other respects regarding verbatim copying of
|
||
that document.
|
||
|
||
7. AGGREGATION WITH INDEPENDENT WORKS
|
||
|
||
A compilation of the Document or its derivatives with other
|
||
separate and independent documents or works, in or on a volume of
|
||
a storage or distribution medium, is called an "aggregate" if the
|
||
copyright resulting from the compilation is not used to limit the
|
||
legal rights of the compilation's users beyond what the individual
|
||
works permit. When the Document is included in an aggregate, this
|
||
License does not apply to the other works in the aggregate which
|
||
are not themselves derivative works of the Document.
|
||
|
||
If the Cover Text requirement of section 3 is applicable to these
|
||
copies of the Document, then if the Document is less than one half
|
||
of the entire aggregate, the Document's Cover Texts may be placed
|
||
on covers that bracket the Document within the aggregate, or the
|
||
electronic equivalent of covers if the Document is in electronic
|
||
form. Otherwise they must appear on printed covers that bracket
|
||
the whole aggregate.
|
||
|
||
8. TRANSLATION
|
||
|
||
Translation is considered a kind of modification, so you may
|
||
distribute translations of the Document under the terms of section
|
||
4. Replacing Invariant Sections with translations requires special
|
||
permission from their copyright holders, but you may include
|
||
translations of some or all Invariant Sections in addition to the
|
||
original versions of these Invariant Sections. You may include a
|
||
translation of this License, and all the license notices in the
|
||
Document, and any Warranty Disclaimers, provided that you also
|
||
include the original English version of this License and the
|
||
original versions of those notices and disclaimers. In case of a
|
||
disagreement between the translation and the original version of
|
||
this License or a notice or disclaimer, the original version will
|
||
prevail.
|
||
|
||
If a section in the Document is Entitled "Acknowledgements",
|
||
"Dedications", or "History", the requirement (section 4) to
|
||
Preserve its Title (section 1) will typically require changing the
|
||
actual title.
|
||
|
||
9. TERMINATION
|
||
|
||
You may not copy, modify, sublicense, or distribute the Document
|
||
except as expressly provided under this License. Any attempt
|
||
otherwise to copy, modify, sublicense, or distribute it is void,
|
||
and will automatically terminate your rights under this License.
|
||
|
||
However, if you cease all violation of this License, then your
|
||
license from a particular copyright holder is reinstated (a)
|
||
provisionally, unless and until the copyright holder explicitly
|
||
and finally terminates your license, and (b) permanently, if the
|
||
copyright holder fails to notify you of the violation by some
|
||
reasonable means prior to 60 days after the cessation.
|
||
|
||
Moreover, your license from a particular copyright holder is
|
||
reinstated permanently if the copyright holder notifies you of the
|
||
violation by some reasonable means, this is the first time you have
|
||
received notice of violation of this License (for any work) from
|
||
that copyright holder, and you cure the violation prior to 30 days
|
||
after your receipt of the notice.
|
||
|
||
Termination of your rights under this section does not terminate
|
||
the licenses of parties who have received copies or rights from
|
||
you under this License. If your rights have been terminated and
|
||
not permanently reinstated, receipt of a copy of some or all of
|
||
the same material does not give you any rights to use it.
|
||
|
||
10. FUTURE REVISIONS OF THIS LICENSE
|
||
|
||
The Free Software Foundation may publish new, revised versions of
|
||
the GNU Free Documentation License from time to time. Such new
|
||
versions will be similar in spirit to the present version, but may
|
||
differ in detail to address new problems or concerns. See
|
||
`https://www.gnu.org/copyleft/'.
|
||
|
||
Each version of the License is given a distinguishing version
|
||
number. If the Document specifies that a particular numbered
|
||
version of this License "or any later version" applies to it, you
|
||
have the option of following the terms and conditions either of
|
||
that specified version or of any later version that has been
|
||
published (not as a draft) by the Free Software Foundation. If
|
||
the Document does not specify a version number of this License,
|
||
you may choose any version ever published (not as a draft) by the
|
||
Free Software Foundation. If the Document specifies that a proxy
|
||
can decide which future versions of this License can be used, that
|
||
proxy's public statement of acceptance of a version permanently
|
||
authorizes you to choose that version for the Document.
|
||
|
||
11. RELICENSING
|
||
|
||
"Massive Multiauthor Collaboration Site" (or "MMC Site") means any
|
||
World Wide Web server that publishes copyrightable works and also
|
||
provides prominent facilities for anybody to edit those works. A
|
||
public wiki that anybody can edit is an example of such a server.
|
||
A "Massive Multiauthor Collaboration" (or "MMC") contained in the
|
||
site means any set of copyrightable works thus published on the MMC
|
||
site.
|
||
|
||
"CC-BY-SA" means the Creative Commons Attribution-Share Alike 3.0
|
||
license published by Creative Commons Corporation, a not-for-profit
|
||
corporation with a principal place of business in San Francisco,
|
||
California, as well as future copyleft versions of that license
|
||
published by that same organization.
|
||
|
||
"Incorporate" means to publish or republish a Document, in whole or
|
||
in part, as part of another Document.
|
||
|
||
An MMC is "eligible for relicensing" if it is licensed under this
|
||
License, and if all works that were first published under this
|
||
License somewhere other than this MMC, and subsequently
|
||
incorporated in whole or in part into the MMC, (1) had no cover
|
||
texts or invariant sections, and (2) were thus incorporated prior
|
||
to November 1, 2008.
|
||
|
||
The operator of an MMC Site may republish an MMC contained in the
|
||
site under CC-BY-SA on the same site at any time before August 1,
|
||
2009, provided the MMC is eligible for relicensing.
|
||
|
||
|
||
ADDENDUM: How to use this License for your documents
|
||
====================================================
|
||
|
||
To use this License in a document you have written, include a copy of
|
||
the License in the document and put the following copyright and license
|
||
notices just after the title page:
|
||
|
||
Copyright (C) YEAR YOUR NAME.
|
||
Permission is granted to copy, distribute and/or modify this document
|
||
under the terms of the GNU Free Documentation License, Version 1.3
|
||
or any later version published by the Free Software Foundation;
|
||
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
|
||
Texts. A copy of the license is included in the section entitled ``GNU
|
||
Free Documentation License''.
|
||
|
||
If you have Invariant Sections, Front-Cover Texts and Back-Cover
|
||
Texts, replace the "with...Texts." line with this:
|
||
|
||
with the Invariant Sections being LIST THEIR TITLES, with
|
||
the Front-Cover Texts being LIST, and with the Back-Cover Texts
|
||
being LIST.
|
||
|
||
If you have Invariant Sections without Cover Texts, or some other
|
||
combination of the three, merge those two alternatives to suit the
|
||
situation.
|
||
|
||
If your document contains nontrivial examples of program code, we
|
||
recommend releasing these examples in parallel under your choice of
|
||
free software license, such as the GNU General Public License, to
|
||
permit their use in free software.
|
||
|
||
|
||
File: gmp.info, Node: Concept Index, Next: Function Index, Prev: GNU Free Documentation License, Up: Top
|
||
|
||
Concept Index
|
||
*************
|
||
|
||
|