Multithreading in accessing the same wave

I want to speed up my program, maybe by using threadsafe and multithreading. However, my code with a threadsafe structure only speeds up by 10%. I guess the bottleneck is that the worker needs to wait until the other worker finishes accessing the wave w. Do you have any suggestions on how to enable real multithreading in my situation?

 

Function Test()
    
    make/n=1000/O/D w, tt
    w = 0   // initialize
    tt = 1.05^p // create a log spaced t
 
    variable t1 = startMSTimer
 
    variable i
    for(i=1; i<100000; i++)
        worker(w, tt, i)
    endfor
 
    print stopMSTimer(t1)
 
 
End
 
ThreadSafe Function worker(w, tt, i)
    wave w, tt
    variable i
    
    w += 1/i^2*exp(-i^2*tt/10)
 
End

 

You are iterating explicitly on wave tt over i and summing implicitly into wave w over p.

A few fixes might increase speed in a minor way.

* feed the function i^2 rather than calculating i^2 inside the function
* feed the function tt/10 rather than calculating tt/10 inside the function

I imagine this might solved by a clever approach using MatrixOP and a matrix. I see a SUM(WAVE) function may also be useful. Finally, perhaps the order of operations should be reversed. Sum the function over i into a single value V(tt[p], i) and fill that sum (implicitly) into w[p].

Here is a 57% speed gain as JJ predicted:

Function Test()
    make/n=1000/O/D w, tt
    w = 0       // initialize
    tt = 1.05^p // create a log spaced t
 
    variable i, t1 = startMSTimer
    
    tt /= 10
    for(i=1; i<100000; i++)
        worker(w, tt, i^2, 1/i^2)
    endfor
 
    print stopMSTimer(t1)
End
 
Function worker(wave w, wave tt, variable i, variable inv_i)
    multithread w += inv_i*exp(-i*tt)
End

Maybe using MatrixOp in an intelligent way could speed this up even more. The bottleneck is surely the per-point wave assignment now.

Thank you, JJ and Chozo. I implemented the fixes, and the execution speed indeed increased by more than 50%. The problem happens when I increase the iteration number of the for loop. The multithreading on wave assignment is not enough to keep the execution time in a reasonable range. I will try to implement the MatrixOP with the sum(wave). Thank you for the advice.

"The problem happens when I increase the iteration number of the for loop. The multithreading on wave assignment is not enough to keep the execution time in a reasonable range."

 

Yes, if your wave has only 1000 points using multithreaded wave assignment is not very efficient.  You need to parallelize the for loop.  Below is my code thar runs 6 times faster on my pc.  However I have a question (for WM): since all threads use the same assignment:

w += 1/i^2*exp(-i^2*tt/10)

it will only be correct if the "+=" uses a lock (or an atomic operation) during the assignment at each point of the wave, is that how it works ? (I guess this should be the meaning of "threadsafe ?)

Edit: Just out of curiosity I made a test of simultaneous wave assignment by different threads, it looks like there are some artifacts of a race condition. 

 

Function Test()
    
   make/n=1000/O/D w, tt
   w= 0   // initialize
   tt = 1.05^p // create a log spaced t
 
    variable t1 = startMSTimer
 
    variable i
    variable nloops=100000
    
    Variable nthreads= ThreadProcessorCount
     Variable threadGroupID= ThreadGroupCreate(nthreads)
    
    variable loopsperthread=floor(nloops/nthreads)
    variable it
 
    
    for(it=0; it<nthreads; it+=1)
     
           variable i1=1+it*loopsperthread 
            variable i2   
            if(it<nthreads-1)
              i2=i1+loopsperthread-1
              else
              i2=nloops
            endif  
                     
            ThreadStart threadGroupID,it, worker(w, tt, i1, i2)
            
    endfor
    
    do
            Variable threadGroupStatus = ThreadGroupWait(threadGroupID,100)
     while( threadGroupStatus != 0 )
 
    print stopMSTimer(t1)
 
 
End
 
 
ThreadSafe Function worker(w, tt, i1, i2)
    wave w, tt
    variable i1,i2
   
   variable i
   for(i=i1; i<=i2; i++) 
   w += 1/i^2*exp(-i^2*tt/10)
   endfor
 
End

 

> it will only be correct if the "+=" uses a lock (or an atomic operation) during the assignment at each point of the wave, is that how it works ? (I guess this should be the meaning of "threadsafe ?)

Exactly this "+=" assignment has me entirely sideways. I believe the faster (and more secure) approach might instead be a) create a set of columns to store either each explicit i iteration step or each implicit p iteration step and b) sum the columns along rows to recover the "w +=" equivalent behavior. IOW, rather than lumping everything into one step (aka python expert mode) to try to multi-thread, chunk the major processing into two steps and stream-line + multi-thread each one.

You are correct to worry about writing to the same wave in multiple threads. That is likely at some point to cause a crash if two threads write to the same element of the wave simultaneously. If you haven't seen a crash, it's because you got lucky and the threads didn't write simultaneously to the same wave element.

The Multithread keyword divides up the destination wave into chunks and assigns each chunk to a thread. That way, the threads all operate on different wave elements and they cannot write simultaneously to the same wave element. (There are ways to violate that, of course: if your worker function uses creative wave indexing, maybe w[i+1], then you will likely cause a crash.

DisplayHelpTopic "Automatic Parallel Processing with MultiThread"

Look at the end of that topic for "If the right-hand expression involves calling user-defined functions, those functions must be threadsafe (see ThreadSafe Functions) and must also follow these rules:"

The right way to write your function would be to make a separate copy of `w` for each thread and pass the copies to the worker. After the threads have all finished their assigned work, you would need more code to add each of the worker waves into the final result wave. It might be possible to speed up that last bit with Multithread. Naturally, making multiple waves would involve overhead. Making them free wave would save some time, though as of Igor 8 it might not be much time. At Igor 8 I implemented fast wave name lookup using hash tables instead of walking the list of waves.

Hi John,

Thanks you for your quick reply and explanations !

"Look at the end of that topic for "If the right-hand expression involves calling user-defined functions, those functions must be threadsafe (see ThreadSafe Functions) and must also follow these rules:"

Yes, I had looked at it, but it was still not clear for me how it works when multiple threads want to access the same data, in my opinion the manual lacks some explanations about it. At some point I thought that in a "threadsafe" function a statement

w +=  ...

would be compiled as :

Interlocked w += ...

(where I'm using a non-existent Igor keyword, just invented it :) but could be implemented maybe ? )

best

Pawel

 

 

 

 

 

 

Here again, I sense that you should split out the "w += " portion of the calculations. Use more memory to build calculations that put results into rows or columns in a matrix, where results for any one position in each row or column are independent of results for an equivalent position in every other row or column. Multi-thread those calculations to the maximum. At the end, sum the rows or columns appropriately (i.e. collapse the matrix by one dimension along rows or columns) to create your "w += " equivalent.

Interlocked w += ...

That would require locking and unlocking a mutex for pretty much every element in the wave, since we don't have a good idea of what a user function might do. Locking and unlocking a mutex is a very expensive thing to do, so it would probably be a performance reducer.

 

From the cited list of rules:

1.    Do not do anything to waves that are passed as parameters that might disturb memory. For example, do not change the number of points in the wave or change its data type or kill it or write to a text wave.

Changing number type or redimensioning both most likely involve moving the wave's data in memory, which invalidates any pointers to the wave held by another thread. We don't have a way for a thread's wave references to know they need to refresh what they point to. Reducing the number of points in a wave probably leaves the data in place in memory, but other threads don't know that they can't index all the way to what used to be the end of the wave. Text waves are special- they either are allocated as a huge blob of bytes or as many pointers to the text bytes. Either way, writing new text will invalidate pointers to the text that are being used by other threads.

2.    Do not write to a variable that is passed by reference.

A variable passed by reference is a pointer to the variable's data, so if two threads simultaneously write to that variable, they are contending for the same bytes in memory. At least a scramble of bytes will result, in many cases a crash.

3.    Note that any waves or global variables created by the function will disappear when then wave assignment is finished.
4.    Each thread has its own private data folder tree. You can not use WAVE, NVAR or SVAR to access objects in the main thread.

To avoid lots of issues with threads accessing waves, each thread has it's own data folder tree, which also means it's own lists of waves. If a function creates a global wave or variable, since it is in the thread-specific data folder tree, no other threads can find it. And when the thread is finished (you or Multithread has call ThreadGroupWait) the data folder tree is destroyed, so that wave or variable is also destroyed. Likewise, WAVE, NVAR or SVAR have access only to the thread's private data folder tree, and no access to the root folder's tree. This architecture was designed to prevent access to waves and variables from multiple threads simultaneously, which is sure to result in a crash. And waves that might appear in a graph should not be messed with by a thread!

I hope my little tutorial on thread safety is helpful. I can't say I'm any kind of expert on parallel processing, but I've been dealing with in Igor's internals for quite a while. Parallel processing is difficult to understand and do right! It has all the complexity of regular programming, except that your code running in multiple threads has potential for that code to interact with itself in ways that break the linear execution order you rely on.

Interlocked w += ...

That would require locking and unlocking a mutex for pretty much every element in the wave, since we don't have a good idea of what a user function might do. Locking and unlocking a mutex is a very expensive thing to do, so it would probably be a performance reducer.

 

From the cited list of rules:

1.    Do not do anything to waves that are passed as parameters that might disturb memory. For example, do not change the number of points in the wave or change its data type or kill it or write to a text wave.

Changing number type or redimensioning both most likely involve moving the wave's data in memory, which invalidates any pointers to the wave held by another thread. We don't have a way for a thread's wave references to know they need to refresh what they point to. Reducing the number of points in a wave probably leaves the data in place in memory, but other threads don't know that they can't index all the way to what used to be the end of the wave. Text waves are special- they either are allocated as a huge blob of bytes or as many pointers to the text bytes. Either way, writing new text will invalidate pointers to the text that are being used by other threads.

2.    Do not write to a variable that is passed by reference.

A variable passed by reference is a pointer to the variable's data, so if two threads simultaneously write to that variable, they are contending for the same bytes in memory. At least a scramble of bytes will result, in many cases a crash.

3.    Note that any waves or global variables created by the function will disappear when the wave assignment is finished.
4.    Each thread has its own private data folder tree. You can not use WAVE, NVAR or SVAR to access objects in the main thread.

To avoid lots of issues with threads accessing waves, each thread has it's own data folder tree, which also means it's own lists of waves. If a function creates a global wave or variable, since it is in the thread-specific data folder tree, no other threads can find it. And when the thread is finished (you or Multithread has call ThreadGroupWait) the data folder tree is destroyed, so that wave or variable is also destroyed. Likewise, WAVE, NVAR or SVAR have access only to the thread's private data folder tree, and no access to the root folder's tree. This architecture was designed to prevent access to waves and variables from multiple threads simultaneously, which is sure to result in a crash. And waves that might appear in a graph should not be messed with by a thread!

I hope my little tutorial on thread safety is helpful. I can't say I'm any kind of expert on parallel processing, but I've been dealing with in Igor's internals for quite a while. Parallel processing is difficult to understand and do right! It has all the complexity of regular programming, except that your code running in multiple threads has potential for that code to interact with itself in ways that break the linear execution order you rely on.

I should say, "pointer to a wave's data" is the same as Igor "wave reference". It is literally an internal variable that contains the index in the machine's memory where the wave's data starts.

2.    Do not write to a variable that is passed by reference.

 

ok, so as I understand that's the rule that has been infringed since a wave is always passed by reference.  Still I think the manual is not very clear about that. Actually we can write to a wave point as long as we make sure that other threads do not write to the SAME point of the wave, the rule is then respected because different points behave like different variables passed by reference....  Aaaah, even though I have quite a lot of experience with multithreading in .NET (C#) this formulation of rules was somewhat unclear for me, sorry for my misunderstanding - and thanks again for the explanations.

"That would require locking and unlocking a mutex for pretty much every element in the wave,     "

In .NET there are primitives for atomic incrementation of a variable, I'm not sure now but I thought they were lock-free, using only memory barriers (but maybe I'm wrong).

 

In .NET there are primitives for atomic incrementation of a variable, I'm not sure now but I thought they were lock-free, using only memory barriers (but maybe I'm wrong).

I think that would require that every element of a wave (and two for a complex element) would have to be an atomic variable. Since we write in C++, that would be std::atomic<something>. The correct memory access would be  std::memory_order_seq_cst. That isn't as bad as locking and unlocking a mutex, but not great. And making every single point of a wave an atomic is stretch.

I am a bit confused by Pawel and John's discussion. To prevent program crashes, should I avoid using multithreading with "w +=" for wave assignment in Igor Pro?

I imagine this might solved by a clever approach using MatrixOP and a matrix. I see a SUM(WAVE) function may also be useful. Finally, perhaps the order of operations should be reversed. Sum the function over i into a single value V(tt[p], i) and fill that sum (implicitly) into w[p].

I am trying to code with JJ's suggestion. Is the following scheme consistent with the comment?

  1. Create a temporary 2D wave, represented as V(tt[p], i).
  2. Make a thread-safe worker to compute the value for specific indices p and i (using 1/i^2*exp(-i^2*tt[p]/10)) and assign that result to V(tt[p], i).
  3. I use the MatrixOP function to sum the columns of V, which will give me the resulting 1D wave.

 

> Is the following scheme consistent with the comment?

The row goes over index p. The column goes over the index i. Suppose the matrix is value[npts][nis]. Calculate as

make/FREE/D/N=(npts,nis) value
duplicate/FREE tt ttd
ttd /= 10
value[][] = (1/q^2)*exp(-q^2*ttd[p]))

Perhaps the above can also be a MatrixOP/O value = ... calculation. Perhaps the step can be wrapped in a thread safe function, whereby passing (q^2) as a prefactor into the function would avoid the squaring calculation being done at every npts*nis cell in the matrix. After the last step, sum value along the q direction (perhaps with a MatrixOP) to obtain the equivalent of the "w += " step.

I am a bit confused by Pawel and John's discussion. To prevent program crashes, should I avoid using multithreading with "w +=" for wave assignment in Igor Pro?

No, that's fine. The Multithread keyword breaks up any wave assignment into independent blocks, so the threads write to memory locations that are all distinct. My warning about a user function the right-hand side was that you should not write a right-hand side user function that accesses wave elements other than p, q, r, s. The indices represented by p, q, r, s are part of the block assigned to a given thread. The indices p+1, p-1, etc., at the edges of the assignment are outside the block assigned to the thread.