arf.h – arbitrary-precision floating-point numbers¶
A variable of type arf_t
holds an arbitrary-precision binary
floating-point number: that is, a rational number of the form
\(x \cdot 2^y\) where \(x, y \in \mathbb{Z}\) and \(x\) is odd,
or one of the special values zero, plus infinity, minus infinity,
or NaN (not-a-number).
There is currently no support for negative zero, unsigned infinity,
or a NaN with a payload.
The exponent of a finite and nonzero floating-point number can be
defined in different
ways: for example, as the component y above, or as the unique
integer e such that
\(x \cdot 2^y = m \cdot 2^e\) where \(0.5 \le |m| < 1\).
The internal representation of an arf_t
stores the
exponent in the latter format.
Except where otherwise noted, functions have the following semantics:
Functions taking prec and rnd parameters at the end of the argument list and returning an
int
flag round the result in the output variable to prec bits in the direction specified by rnd. The return flag is 0 if the result is exact (not rounded) and 1 if the result is inexact (rounded). Correct rounding is guaranteed: the result is the floating-point number obtained by viewing the inputs as exact numbers, in principle carrying out the mathematical operation exactly, and rounding the resulting real number to the nearest representable floating-point number whose mantissa has at most the specified number of bits, in the specified direction of rounding. In particular, the error is at most 1 ulp with directed rounding modes and 0.5 ulp when rounding to nearest.Other functions perform the operation exactly.
Since exponents are bignums, overflow or underflow cannot occur.
Types, macros and constants¶
-
type arf_struct¶
-
type arf_t¶
An
arf_struct
contains four words: anfmpz
exponent (exp), a size field tracking the number of limbs used (one bit of this field is also used for the sign of the number), and two more words. The last two words hold the value directly if there are at most two limbs, and otherwise contain one alloc field (tracking the total number of allocated limbs, not all of which might be used) and a pointer to the actual limbs. Thus, up to 128 bits on a 64-bit machine and 64 bits on a 32-bit machine, no space outside of thearf_struct
is used.An
arf_t
is defined as an array of length one of typearf_struct
, permitting anarf_t
to be passed by reference.
-
type arf_rnd_t¶
Specifies the rounding mode for the result of an approximate operation.
-
ARF_RND_DOWN¶
Specifies that the result of an operation should be rounded to the nearest representable number in the direction towards zero.
-
ARF_RND_UP¶
Specifies that the result of an operation should be rounded to the nearest representable number in the direction away from zero.
-
ARF_RND_FLOOR¶
Specifies that the result of an operation should be rounded to the nearest representable number in the direction towards minus infinity.
-
ARF_RND_CEIL¶
Specifies that the result of an operation should be rounded to the nearest representable number in the direction towards plus infinity.
-
ARF_RND_NEAR¶
Specifies that the result of an operation should be rounded to the nearest representable number, rounding to even if there is a tie between two values.
-
ARF_PREC_EXACT¶
If passed as the precision parameter to a function, indicates that no rounding is to be performed. Warning: use of this value is unsafe in general. It must only be passed as input under the following two conditions:
The operation in question can inherently be viewed as an exact operation in \(\mathbb{Z}[\tfrac{1}{2}]\) for all possible inputs, provided that the precision is large enough. Examples include addition, multiplication, conversion from integer types to arbitrary-precision floating-point types, and evaluation of some integer-valued functions.
The exact result of the operation will certainly fit in memory. Note that, for example, adding two numbers whose exponents are far apart can easily produce an exact result that is far too large to store in memory.
The typical use case is to work with small integer values, double precision constants, and the like. It is also useful when writing test code. If in doubt, simply try with some convenient high precision instead of using this special value, and check that the result is exact.
Memory management¶
Special values¶
-
int arf_is_nan(const arf_t x)¶
Returns nonzero iff x respectively equals 0, 1, \(+\infty\), \(-\infty\), NaN.
-
int arf_is_normal(const arf_t x)¶
Returns nonzero iff x is a finite, nonzero floating-point value, i.e. not one of the special values 0, \(+\infty\), \(-\infty\), NaN.
-
int arf_is_special(const arf_t x)¶
Returns nonzero iff x is one of the special values 0, \(+\infty\), \(-\infty\), NaN, i.e. not a finite, nonzero floating-point value.
-
int arf_is_finite(const arf_t x)¶
Returns nonzero iff x is a finite floating-point value, i.e. not one of the values \(+\infty\), \(-\infty\), NaN. (Note that this is not equivalent to the negation of
arf_is_inf()
.)
Assignment, rounding and conversions¶
-
int arf_set_round_fmpz(arf_t res, const fmpz_t x, slong prec, arf_rnd_t rnd)¶
Sets res to x, rounded to prec bits in the direction specified by rnd.
-
int arf_set_round_fmpz_2exp(arf_t res, const fmpz_t x, const fmpz_t e, slong prec, arf_rnd_t rnd)¶
Sets res to \(x \cdot 2^e\), rounded to prec bits in the direction specified by rnd.
-
void arf_get_fmpz_2exp(fmpz_t m, fmpz_t e, const arf_t x)¶
Sets m and e to the unique integers such that \(x = m \cdot 2^e\) and m is odd, provided that x is a nonzero finite fraction. If x is zero, both m and e are set to zero. If x is infinite or NaN, the result is undefined.
-
void arf_frexp(arf_t m, fmpz_t e, const arf_t x)¶
Writes x as \(m \cdot 2^e\), where \(0.5 \le |m| < 1\) if x is a normal value. If x is a special value, copies this to m and sets e to zero. Note: for the inverse operation (ldexp), use
arf_mul_2exp_fmpz()
.
-
double arf_get_d(const arf_t x, arf_rnd_t rnd)¶
Returns x rounded to a double in the direction specified by rnd. This method rounds correctly when overflowing or underflowing the double exponent range (this was not the case in an earlier version).
-
int arf_get_mpfr(mpfr_t res, const arf_t x, mpfr_rnd_t rnd)¶
Sets the MPFR variable res to the value of x. If the precision of x is too small to allow res to be represented exactly, it is rounded in the specified MPFR rounding mode. The return value (-1, 0 or 1) indicates the direction of rounding, following the convention of the MPFR library.
If x has an exponent too large or small to fit in the MPFR type, the result overflows to an infinity or underflows to a (signed) zero, and the corresponding MPFR exception flags are set.
-
int arf_get_fmpz(fmpz_t res, const arf_t x, arf_rnd_t rnd)¶
Sets res to x rounded to the nearest integer in the direction specified by rnd. If rnd is ARF_RND_NEAR, rounds to the nearest even integer in case of a tie. Returns inexact (beware: accordingly returns whether x is not an integer).
This method aborts if x is infinite or NaN, or if the exponent of x is so large that allocating memory for the result fails.
Warning: this method will allocate a huge amount of memory to store the result if the exponent of x is huge. Memory allocation could succeed even if the required space is far larger than the physical memory available on the machine, resulting in swapping. It is recommended to check that x is within a reasonable range before calling this method.
-
slong arf_get_si(const arf_t x, arf_rnd_t rnd)¶
Returns x rounded to the nearest integer in the direction specified by rnd. If rnd is ARF_RND_NEAR, rounds to the nearest even integer in case of a tie. Aborts if x is infinite, NaN, or the value is too large to fit in a slong.
-
int arf_get_fmpz_fixed_si(fmpz_t res, const arf_t x, slong e)¶
Converts x to a mantissa with predetermined exponent, i.e. sets res to an integer y such that \(y \times 2^e \approx x\), truncating if necessary. Returns 0 if exact and 1 if truncation occurred.
The warnings for
arf_get_fmpz()
apply.
-
void arf_ceil(arf_t res, const arf_t x)¶
Sets res to \(\lfloor x \rfloor\) and \(\lceil x \rceil\) respectively. The result is always represented exactly, requiring no more bits to store than the input. To round the result to a floating-point number with a lower precision, call
arf_set_round()
afterwards.
Comparisons and bounds¶
-
int arf_equal(const arf_t x, const arf_t y)¶
-
int arf_equal_si(const arf_t x, slong y)¶
-
int arf_equal_ui(const arf_t x, ulong y)¶
-
int arf_equal_d(const arf_t x, double y)¶
Returns nonzero iff x and y are exactly equal. NaN is not treated specially, i.e. NaN compares as equal to itself.
For comparison with a double, the values -0 and +0 are both treated as zero, and all NaN values are treated as identical.
-
int arf_cmp_d(const arf_t x, double y)¶
Returns negative, zero, or positive, depending on whether x is respectively smaller, equal, or greater compared to y. Comparison with NaN is undefined.
-
int arf_cmpabs_2exp_si(const arf_t x, slong e)¶
Compares x (respectively its absolute value) with \(2^e\).
-
int arf_sgn(const arf_t x)¶
Returns \(-1\), \(0\) or \(+1\) according to the sign of x. The sign of NaN is undefined.
-
void arf_max(arf_t res, const arf_t a, const arf_t b)¶
Sets res respectively to the minimum and the maximum of a and b.
-
slong arf_bits(const arf_t x)¶
Returns the number of bits needed to represent the absolute value of the mantissa of x, i.e. the minimum precision sufficient to represent x exactly. Returns 0 if x is a special value.
-
int arf_is_int_2exp_si(const arf_t x, slong e)¶
Returns nonzero iff x equals \(n 2^e\) for some integer n.
-
void arf_abs_bound_lt_2exp_fmpz(fmpz_t res, const arf_t x)¶
Sets res to the smallest integer b such that \(|x| < 2^b\). If x is zero, infinity or NaN, the result is undefined.
Magnitude functions¶
-
void arf_get_mag_lower(mag_t res, const arf_t x)¶
Sets res to a lower bound for the absolute value of x.
-
void mag_init_set_arf(mag_t res, const arf_t x)¶
Initializes res and sets it to an upper bound for x.
-
void mag_fast_init_set_arf(mag_t res, const arf_t x)¶
Initializes res and sets it to an upper bound for x. Assumes that the exponent of res is small (this function is unsafe).
-
void arf_mag_set_ulp(mag_t res, const arf_t x, slong prec)¶
Sets res to the magnitude of the unit in the last place (ulp) of x at precision prec.
Shallow assignment¶
-
void arf_init_set_mag_shallow(arf_t z, const mag_t x)¶
Initializes z to a shallow copy of x. A shallow copy just involves copying struct data (no heap allocation is performed).
The target variable z may not be cleared or modified in any way (it can only be used as constant input to functions), and may not be used after x has been cleared. Moreover, after x has been assigned shallowly to z, no modification of x is permitted as slong as z is in use.
Random number generation¶
-
void arf_randtest(arf_t res, flint_rand_t state, slong bits, slong mag_bits)¶
Generates a finite random number whose mantissa has precision at most bits and whose exponent has at most mag_bits bits. The values are distributed non-uniformly: special bit patterns are generated with high probability in order to allow the test code to exercise corner cases.
-
void arf_randtest_not_zero(arf_t res, flint_rand_t state, slong bits, slong mag_bits)¶
Identical to
arf_randtest()
, except that zero is never produced as an output.
-
void arf_randtest_special(arf_t res, flint_rand_t state, slong bits, slong mag_bits)¶
Identical to
arf_randtest()
, except that the output occasionally is set to an infinity or NaN.
-
void arf_urandom(arf_t res, flint_rand_t state, slong bits, arf_rnd_t rnd)¶
Sets res to a uniformly distributed random number in the interval \([0, 1]\). The method uses rounding from integers to floats based on the rounding mode rnd.
Input and output¶
-
void arf_printd(const arf_t x, slong d)¶
Prints x as a decimal floating-point number, rounding to d digits. Rounding is faithful (at most 1 ulp error).
-
char *arf_get_str(const arf_t x, slong d)¶
Returns x as a decimal floating-point number, rounding to d digits. Rounding is faithful (at most 1 ulp error).
-
void arf_fprint(FILE *file, const arf_t x)¶
Prints x as an integer mantissa and exponent to the stream file.
-
void arf_fprintd(FILE *file, const arf_t y, slong d)¶
Prints x as a decimal floating-point number to the stream file, rounding to d digits. Rounding is faithful (at most 1 ulp error).
-
char *arf_dump_str(const arf_t x)¶
Allocates a string and writes a binary representation of x to it that can be read by
arf_load_str()
. The returned string needs to be deallocated with flint_free.
-
int arf_load_str(arf_t x, const char *str)¶
Parses str into x. Returns a nonzero value if str is not formatted correctly.
-
int arf_dump_file(FILE *stream, const arf_t x)¶
Writes a binary representation of x to stream that can be read by
arf_load_file()
. Returns a nonzero value if the data could not be written.
-
int arf_load_file(arf_t x, FILE *stream)¶
Reads x from stream. Returns a nonzero value if the data is not formatted correctly or the read failed. Note that the data is assumed to be delimited by a whitespace or end-of-file, i.e., when writing multiple values with
arf_dump_file()
make sure to insert a whitespace to separate consecutive values.
Addition and multiplication¶
-
int arf_add_fmpz(arf_t res, const arf_t x, const fmpz_t y, slong prec, arf_rnd_t rnd)¶
Sets res to \(x + y\).
-
int arf_add_fmpz_2exp(arf_t res, const arf_t x, const fmpz_t y, const fmpz_t e, slong prec, arf_rnd_t rnd)¶
Sets res to \(x + y 2^e\).
-
int arf_sub_fmpz(arf_t res, const arf_t x, const fmpz_t y, slong prec, arf_rnd_t rnd)¶
Sets res to \(x - y\).
-
int arf_mul_fmpz(arf_t res, const arf_t x, const fmpz_t y, slong prec, arf_rnd_t rnd)¶
Sets res to \(x \cdot y\).
-
int arf_addmul_fmpz(arf_t z, const arf_t x, const fmpz_t y, slong prec, arf_rnd_t rnd)¶
Performs a fused multiply-add \(z = z + x \cdot y\), updating z in-place.
-
int arf_submul_fmpz(arf_t z, const arf_t x, const fmpz_t y, slong prec, arf_rnd_t rnd)¶
Performs a fused multiply-subtract \(z = z - x \cdot y\), updating z in-place.
Summation¶
-
int arf_sum(arf_t res, arf_srcptr terms, slong len, slong prec, arf_rnd_t rnd)¶
Sets res to the sum of the array terms of length len, rounded to prec bits in the direction specified by rnd. The sum is computed as if done without any intermediate rounding error, with only a single rounding applied to the final result. Unlike repeated calls to
arf_add()
with infinite precision, this function does not overflow if the magnitudes of the terms are far apart. Warning: this function is implemented naively, and the running time is quadratic with respect to len in the worst case.
Dot products¶
-
void arf_approx_dot(arf_t res, const arf_t initial, int subtract, arf_srcptr x, slong xstep, arf_srcptr y, slong ystep, slong len, slong prec, arf_rnd_t rnd)¶
Computes an approximate dot product, with the same meaning of the parameters as
arb_dot()
. This operation is not correctly rounded: the final rounding is done in the directionrnd
but intermediate roundings are implementation-defined.
Division¶
Square roots¶
-
int arf_sqrt_fmpz(arf_t res, const fmpz_t x, slong prec, arf_rnd_t rnd)¶
Sets res to \(\sqrt{x}\). The result is NaN if x is negative.
-
int arf_rsqrt(arf_t res, const arf_t x, slong prec, arf_rnd_t rnd)¶
Sets res to \(1/\sqrt{x}\). The result is NaN if x is negative, and \(+\infty\) if x is zero.
-
int arf_root(arf_t res, const arf_t x, ulong k, slong prec, arf_rnd_t rnd)¶
Sets res to \(x^{1/k}\). The result is NaN if x is negative. Warning: this function is a wrapper around the MPFR root function. It gets slow and uses much memory for large k. Consider working with
arb_root_ui()
for large k instead of using this function directly.
Complex arithmetic¶
-
int arf_complex_mul(arf_t e, arf_t f, const arf_t a, const arf_t b, const arf_t c, const arf_t d, slong prec, arf_rnd_t rnd)¶
-
int arf_complex_mul_fallback(arf_t e, arf_t f, const arf_t a, const arf_t b, const arf_t c, const arf_t d, slong prec, arf_rnd_t rnd)¶
Computes the complex product \(e + fi = (a + bi)(c + di)\), rounding both \(e\) and \(f\) correctly to prec bits in the direction specified by rnd. The first bit in the return code indicates inexactness of \(e\), and the second bit indicates inexactness of \(f\).
If any of the components a, b, c, d is zero, two real multiplications and no additions are done. This convention is used even if any other part contains an infinity or NaN, and the behavior with infinite/NaN input is defined accordingly.
The fallback version is implemented naively, for testing purposes. No squaring optimization is implemented.
Low-level methods¶
-
int _arf_get_integer_mpn(nn_ptr y, nn_srcptr xp, slong xn, slong exp)¶
Given a floating-point number x represented by xn limbs at xp and an exponent exp, writes the integer part of x to y, returning whether the result is inexact. The correct number of limbs is written (no limbs are written if the integer part of x is zero). Assumes that
xp[0]
is nonzero and that the top bit ofxp[xn-1]
is set.
-
int _arf_set_mpn_fixed(arf_t z, nn_srcptr xp, slong xn, slong fixn, int negative, slong prec, arf_rnd_t rnd)¶
Sets z to the fixed-point number having xn total limbs and fixn fractional limbs, negated if negative is set, rounding z to prec bits in the direction rnd and returning whether the result is inexact. Both xn and fixn must be nonnegative and not so large that the bit shift would overflow an slong, but otherwise no assumptions are made about the input.
-
int _arf_set_round_ui(arf_t z, ulong x, int sgnbit, slong prec, arf_rnd_t rnd)¶
Sets z to the integer x, negated if sgnbit is 1, rounded to prec bits in the direction specified by rnd. There are no assumptions on x.
-
int _arf_set_round_uiui(arf_t z, slong *fix, ulong hi, ulong lo, int sgnbit, slong prec, arf_rnd_t rnd)¶
Sets the mantissa of z to the two-limb mantissa given by hi and lo, negated if sgnbit is 1, rounded to prec bits in the direction specified by rnd. Requires that not both hi and lo are zero. Writes the exponent shift to fix without writing the exponent of z directly.
-
int _arf_set_round_mpn(arf_t z, slong *exp_shift, nn_srcptr x, slong xn, int sgnbit, slong prec, arf_rnd_t rnd)¶
Sets the mantissa of z to the mantissa given by the xn limbs in x, negated if sgnbit is 1, rounded to prec bits in the direction specified by rnd. Returns the inexact flag. Requires that xn is positive and that the top limb of x is nonzero. If x has leading zero bits, writes the shift to exp_shift. This method does not write the exponent of z directly. Requires that x does not point to the limbs of z.