Differences in cumulative sum.

I have a poser for Monday morning.

As part of a rebinning/regrouping algorithm I have an array of data
that I work out the cumulative sum for:

Wave var_init
make/n=(numpnts(var_init))/d cum_var_init = 0
cum_var_init = sum(var_init, 0, p)


Please note that all values in var_init are positive (variance) and
are double precision, meaning that values in cum_var_init should be
strictly monotonically increasing. Then I need to work out
differences between different (interpolated) point numbers in that
accumulated wave, e.g.

Wave W_var
W_var[4] = cum_var_init[7.75] - cum_var_init[5.1]


We now come to the interesting part. Some successive points in
cum_var_init should be the same, as var_init contains a lot of zeroes.
However, when the interpolated subtraction is carried out in the
region where cum_var_init is unchanging some of the values in W_var
are negative and are on the order of -1e-14. This is approximately
two orders of magnitude higher than machine precision. This presents
a problem because W_var represents a variance and should never be
negative.
So how does the discrepancy arise - the cumulative sum function, the
interpolated subtractions, should I be doing such an interpolation in
a different way?

One might say that oh, it doesn't matter, it's only 1 part in 10^14,
just put a tolerance expression in there. However, what would happen
when I need to deal with variances on the order of 1e-14?
test_overflow.pxp (56.27 KB)
OK- I tracked down the code that does interpolated look-ups from waves. The problem is pretty certainly floating-point truncation errors. In the test case you posted, the numbers involved are on the order of 40,000 and the negative number is -7.27596e-12, or one part in 5e15.

The interpolation is not done in the straightforward way involving a subtraction; it's done as a weighted average. Here's the code:

A= lhpoint+1-pnum;
B= 1-A;

DBL(dest[0])= A*DBL(dest[0]) + B*DBL(rhYval[0]);

On the right side, dest has already been filled with the number from the wave point below the requested fractional point. rhYval contains the value from the wave point above. Don't worry about the DBL macro, it is involved with adapting the code to different number types and games to make things fast. The point here is that even if dest and rhYval contain identical numbers, the arithmetic involved cannot guarantee to add up exactly to the original number.

I'm sure that this is done to save some floating-point arithmetic, and the time that would take.

You basically have three choices:

1) Do the interpolation yourself, using a subtraction of the two point values as the first step. If the numbers are truly identical, it will result in a clean zero and subsequent computations should come out to exactly zero.

2) Check for the result being some acceptably small fraction of the numbers you're subtracting, and if small, then assign a clean zero.

3) Simply take the absolute value. That will result in some numbers that are vanishingly small and positive rather than being vanishingly small and negative.

I haven't asked Larry, but I suspect he would resist changing our code because the problem is rare and only consequential in rare cases, and it would add extra operations in a fundamental operation that is often in the innermost part of a loop, where execution speed is required, and changing it would cause computations that people have used in the past to change in subtle ways.

John Weeks
WaveMetrics, Inc.
support@wavemetrics.com
johnweeks wrote:
OK- I tracked down the code that does interpolated look-ups from waves. The problem is pretty certainly floating-point truncation errors. In the test case you posted, the numbers involved are on the order of 40,000 and the negative number is -7.27596e-12, or one part in 5e15.


the numbers are on that magnitude, but I'm interested in differences between nearby points, 4 orders of magnitude lower.


johnweeks wrote:
The interpolation is done as a weighted average.
A= lhpoint+1-pnum;
B= 1-A;
DBL(dest[0])= A*DBL(dest[0]) + B*DBL(rhYval[0]);

On the right side, dest has already been filled with the number from the wave point below the requested fractional point. rhYval contains the value from the wave point above.

I'm sure that this is done to save some floating-point arithmetic, and the time that would take.


As far as I can tell, with the formula you show you have you have 2 additions, 2 subtractions, 2 multiplications.
If you do it like this:
dest[0] = dest[0] + pnum * (rhYval[0] - dest[0])
you have 1 multiplication, 1 addition, 1 subtraction. This is three floating point operations less (I'm assuming that pnum is the fraction of the point that is required to add on).
I'm not trying to make trouble, only to understand.

John Weeks wrote:
3) Simply take the absolute value. That will result in some numbers that are vanishingly small and positive rather than being vanishingly small and negative.

I've done this.

John Weeks wrote:
the problem is rare and only consequential in rare cases, and it would add extra operations in a fundamental operation that is often in the innermost part of a loop, where execution speed is required, and changing it would cause computations that people have used in the past to change in subtle ways.

I disagree here. If I came across the problem, other people should've, would've seen this. Perhaps I don't understand why, but I don't think it takes more fp operations. Whilst it may cause computations that people have used in the past to change slightly, one could argue they were probably being done in slightly the wrong way. Whilst i agree that the magnitude of the problem is vanishingly small, the sign really does matter where someone is going to apply a function that will result in undefined behaviour, like a sqrt or log.
andyfaff wrote:
As far as I can tell, with the formula you show you have you have 2 additions, 2 subtractions, 2 multiplications.
If you do it like this:
dest[0] = dest[0] + pnum * (rhYval[0] - dest[0])
you have 1 multiplication, 1 addition, 1 subtraction. This is three floating point operations less (I'm assuming that pnum is the fraction of the point that is required to add on).
I'm not trying to make trouble, only to understand.

Except that it doesn't work... You need to replace pnum with a fraction of the point difference. That will add three more operations:
frac = (p - p0)/(p1 - p0)

Another thing to think about is that any scheme that involves subtracting the two Y values, and then adding something to one of the Y values will suffer from floating-point truncation. It's the small differences of large numbers problem- the difference between the Y values is likely to be much smaller than the Y values themselves. So the weighted average scheme should be more accurate in general, since multiplication involves only the fractional last-bit error, whereas subtraction and then addition suffers from an error that gets larger as the numbers get closer.
Quote:
John Weeks wrote:
the problem is rare and only consequential in rare cases, and it would add extra operations in a fundamental operation that is often in the innermost part of a loop, where execution speed is required, and changing it would cause computations that people have used in the past to change in subtle ways.

I disagree here. If I came across the problem, other people should've, would've seen this. Perhaps I don't understand why, but I don't think it takes more fp operations. Whilst it may cause computations that people have used in the past to change slightly, one could argue they were probably being done in slightly the wrong way. Whilst i agree that the magnitude of the problem is vanishingly small, the sign really does matter where someone is going to apply a function that will result in undefined behaviour, like a sqrt or log.

I'm afraid the bottom line is that depending on floating-point operations being accurately zero, or having accurate sign near zero is something you just shouldn't do. The message in this case is that floating-point truncation can bite you unexpectedly.

John Weeks
WaveMetrics, Inc.
support@wavemetrics.com