parallel processing issue

I want to extract and sum up "beams" from a 3D wave depending on a ROI or mask wave, which I do using the code below:


function GetSum(cube, mask)
	wave cube, mask

	Make/O/N=(DimSize(cube, 2)) total=0
		
	variable cx, cy	
	for(cx = 0; cx < DimSize(Cube, 0); cx += 1) 		
			for(cy = 0; cy < DimSize(Cube, 1); cy += 1)
					
				if (mask[cx][cy] ==1)			// mask[cx][cy] can be 1 or 0				
					ImageTransform /Beam ={(cx), (cy)} getBeam cube
					wave W_Beam
					total+=W_Beam	
				endif		
			
			endfor
	endfor		
	
	KillWaves/Z W_Beam
end


For the size of my 3D wave this process takes about 2-3 seconds. To improve performance I copied and modified the parallel processing example from the help files to this:


Threadsafe function workerFunc(cube, total, mask, col)
	wave cube, total, mask
	variable col
	
	variable i
	for (i=0; i<DimSize(cube, 0);i+=1)
		
		if (mask[i][col] ==1)	
			ImageTransform /Beam ={(i), (col)} getBeam cube
			wave W_Beam
			total+=W_Beam	
		endif
	endfor

end


Function MT_GetSum(cube, mask)
	WAVE cube, mask
	
	Make/O/N=(DimSize(cube, 2)) total=0
	
	Variable ncol= DimSize(cube,1)
	Variable i,col,nthreads= ThreadProcessorCount
	Variable threadGroupID= ThreadGroupCreate(nthreads)
	
	for(col=0; col<ncol;)
		for(i=0; i<nthreads; i+=1)
		
			ThreadStart threadGroupID,i,workerFunc(cube,total, mask,col)
			
			col+=1
			if( col>=ncol )
				break
			endif
		endfor
		
		do
			Variable threadGroupStatus = ThreadGroupWait(threadGroupID,100)
		while( threadGroupStatus != 0 )
	endfor
	Variable dummy= ThreadGroupRelease(threadGroupID)
End



This gives a 60-70% speedup but it appears that wave "total" from the two methods does not contain identical values. The parallel version has slightly lower totals between 0-3% (0.25 avg) compared to the single thread version.
Any ideas what I'm doing wrong?

I don't have a lot of time but from a quick look at your post:

ChrLie wrote:

Threadsafe function workerFunc(cube, total, mask, col)
	wave cube, total, mask
	variable col
	
	variable i
	for (i=0; i<DimSize(cube, 0);i+=1)
		
		if (mask[i][col] ==1)	
			ImageTransform /Beam ={(i), (col)} getBeam cube
			wave W_Beam
			total+=W_Beam	
		endif
	endfor

end



Think about what happens in your parallel calculation. Igor spawns a bunch of threads and these start crunching away, independently and regardless of what other threads may be doing. This also means that if different threads write to the same memory, each will modify this memory as it pleases and the result is unpredictable. Read the section called "Example atomic operation" here to understand what that means with respect to your problem.

Igor tries to prevent these problems from occurring by giving each thread its own, private data folder hierarchy, such that all waves created by the thread are accessible only by that thread. The exception, of course, are function arguments. In other words, every thread refers to the same "total" wave.

So when you do this:

total+=W_Beam	

The result is undefined because multiple threads can modify it simultaneously.

Your only option is to provide each thread with its own copy of total (so if there are 8 threads, create 8 copies of total, all initialized to zero), and to add all of these waves together into the final result after the threads have completed.


Also, the loop in which you call ThreadStart has issues, but I don't have time now so maybe someone else will chip in on that.
Looking at this again, your ThreadStart loop does look OK. I misread it due to reading it in a hurry, sorry. However, my comment above still stands.

I played with this a bit:

function GetSum(cube, mask)	// this is your original function
	wave cube, mask
 
	Make/O/N=(DimSize(cube, 2))/D total=0
 
	variable cx, cy	
	for(cx = 0; cx < DimSize(Cube, 0); cx += 1) 		
			for(cy = 0; cy < DimSize(Cube, 1); cy += 1)
 
				if (mask[cx][cy] ==1)			// mask[cx][cy] can be 1 or 0				
					ImageTransform /Beam ={(cx), (cy)} getBeam cube
					wave W_Beam
					total+=W_Beam[p]
				endif		
 
			endfor
	endfor		
 
	KillWaves/Z W_Beam
end

Function GetSum2(cube, mask)	// MatrixOP-based
	wave cube, mask
	
	Make/O/N=(DimSize(cube, 2))/D total=0
 	
	MatrixOP /FREE Multiplied = cube * mask
	
	variable i
	for (i = 0; i < DimSize(Multiplied, 2); i+=1)
		ImageTransform /P=(i) sumPlane, Multiplied
		total[i] += V_value
	endfor
End

function GetSum3(cube, mask)	// MultiThreaded
	wave cube, mask
 
	Make/O/N=(DimSize(cube, 2))/D total=0
	
	MultiThread total = WorkerFunc3(cube, mask, p)
end

ThreadSafe Function WorkerFunc3(cube, mask, plane)
	wave cube, mask
	variable plane
	
	variable i, j, total = 0
	for (i = 0; i < DimSize(cube, 0); i+=1)
		for (j = 0; j < DimSize(cube, 1); j+=1)
			if (mask[i][j] == 1)
				total += cube[i][j][plane]
			endif
		endfor
	endfor
	
	return total
End

function GetSum4(cube, mask)	// MultiThreaded + MatrixOP
	wave cube, mask
 
	Make/O/N=(DimSize(cube, 2))/D total=0
	
	MultiThread total = WorkerFunc4(cube, mask, p)
end

ThreadSafe Function WorkerFunc4(cube, mask, plane)
	wave cube, mask
	variable plane
	
	MatrixOP /FREE W_Sum = sum(mask * cube[][][plane])
	return W_Sum[0]
End

Function DoTiming()
	Make /o/n=(128,128) /D M_Mask = round(enoise(0.5) + 0.5)
	Make /o/n=(128,128, 512) /D/O M_cube = enoise(1)
	
	variable timerRef, time1, time2, time3, time4
	
	timerRef = StartMSTimer
	GetSum(m_cube, m_mask)
	time1 = StopMSTimer(timerRef) / 1e6
	
	timerRef = StartMSTimer
	GetSum2(m_cube, m_mask)
	time2 = StopMSTimer(timerRef) / 1e6
	
	timerRef = StartMSTimer
	GetSum3(m_cube, m_mask)
	time3 = StopMSTimer(timerRef) / 1e6
	
	timerRef = StartMSTimer
	GetSum4(m_cube, m_mask)
	time4 = StopMSTimer(timerRef) / 1e6
	
	Printf "Original: %f, MatrixOP: %f s, MultiThread: %f s, MatrixOP + MultiThread: %f s\r", time1, time2, time3, time4
End


Result on my computer:

•DoTiming()
  Original: 0.343414 s, MatrixOP: 0.056957 s, MultiThread: 0.389801 s, MatrixOP + MultiThread: 0.008289 s


In my implementation, multithreading of the 'naive' Igor code actually leads to a slowdown. Using MatrixOP, however, is much faster than the naive code and multithreaded MatrixOP appears to be the performance winner.

Note the dirty trick that I'm using here – I make use of the fact that your mask contains only zero or one, and therefore the undesired values can be removed simply by multiplying the data with the mask (but of course you need to make a copy of the data to do that). Doing this multiplication is much faster than the explicit Igor code because it runs entirely in machine code, and computers are very good at multiplying lots of numbers.

ChrLie wrote: The parallel version has slightly lower totals between 0-3% (0.25 avg) compared to the single thread version.

I think that this is probably due to a mistake in your code. But be wary of using single-precision waves when not necessary. Use doubles (Make /D) when practical.

I'd be interested to hear what the performance difference is on your dataset.
741 wrote:
Note the dirty trick that I'm using here – I make use of the fact that your mask contains only zero or one, and therefore the undesired values can be removed simply by multiplying the data with the mask (but of course you need to make a copy of the data to do that). Doing this multiplication is much faster than the explicit Igor code because it runs entirely in machine code, and computers are very good at multiplying lots of numbers.


GetSum4() is awesome!

Thinking of "planes" instead of "beams" clearly makes a difference. The result for my cube:


  Original: 2.831555, MatrixOP: 0.350114 s, MultiThread: 6.146490 s, MatrixOP + MultiThread: 0.097558 s


As for differences in totals in the initial post; I somehow had the feeling that all threads working on one wave (or memory) might be the problem but your solution is much more elegant anyway!

Thanks a lot!