Hi,
Recently I am coding with OpenCL.
In order to gain a better performance for data transfer, it requires the host buffer to be aligned to a byte boundary of 4096:
https://software.intel.com/en-us/articles/getting-the-most-from-opencl-…
When using the CL_MEM_USE_HOST_PTR flag, if we want to guarantee a zero copy buffer on Intel processor graphics, we need to ensure that we adhere to two device-dependent alignment and size rules. We must create a buffer that is aligned to a 4096 byte boundary and have a size that is a multiple of 64 bytes.
I think the latter is fulfilled. For the former requirement, when I tested the pointer of Igor wave in C using the follow code:
if (((uintptr_t)IgorPtr % 4096u) != 0)
{
return -903;
}
IgorPtr was obtained from WaveData(waveHandle).
In this case, -903 was returned in Igor when the XOP was executed.
Otherwise, if I created an aligned ptr in c and check it:
int *pbuf = (int *)_aligned_malloc(sizeof(int) * 1024, 4096);
if (((uintptr_t)pbuf % 4096u) != 0)
{
return -903;
}
-903 was NOT returned and the alignment was good.
Therefore the question is, how should I create the buffer with the specific alignment, e.g. 4096?
Have you looked at https://github.com/pdedecker/IgorCL?
Relating to your question. If you need memory aligned to a certain boundary and can not enfore that on creation time (as is the case for Igor waves) the usual approach is to make the wave a bit larger and don't use the first x bytes so that the real data starts at the desired boundary. https://stackoverflow.com/a/227900 has some details on that.
Have you looked at https://github.com/pdedecker/IgorCL?
Relating to your question. If you need memory aligned to a certain boundary and can not enfore that on creation time (as is the case for Igor waves) the usual approach is to make the wave a bit larger and don't use the first x bytes so that the real data starts at the desired boundary. https://stackoverflow.com/a/227900 has some details on that.
Thanks for the reminder about the IgorCL work, I can probably get some hints as to how the transfer was optimized.
At the moment I have already got working codes, except the transfer time is taking a longer than the expected (~ 2-3 times) such that I trying to find out why.
Your link probably gave what I wanted, I will try it later, thanks a lot.
FYI, I have been looking into whether we could add a flag to Make to force alignment to a user specified boundary. This appears to be impractical. The data of a wave actually starts at the end of an array of wave info where there is a final field: double wData[1]; There are about 2500 places in the code where we directly access this using, for example, double x= xwP->wData[i];
To support alignment, we would have to change double wData[1] to double *wDataPtr which could then be allocated separately. That of course changes 2500 places in the code.
I also found in researching Apple Metal MPSMatrixMultiplication, that to use shared memory (to avoid copying) the newBufferWithBytesNoCopy method takes a pointer input that has to be allocated with vm_allocate or mmap. Memory allocated by malloc is specifically disallowed. The memory must both start and end on a virtual memory page boundary.
There are about 2500 places in the code where we directly access this using, for example, double x= xwP->wData[i]
Manually doing that will be really no fun. From following the git mailing list I've came across coccinelle [1] which allows to write rules for code refactoring. It is designed only for C though. In [2] I've seen DMS Software Reengineering Toolkit which claims to do the same for C++.
Then there is also clang-rename [3] but that AFAIK only works if your code base is compiled with cmake as it requires a compilation database.
July 4, 2018 at 12:33 pm - Permalink
July 4, 2018 at 12:49 pm - Permalink
FYI, I have been looking into whether we could add a flag to Make to force alignment to a user specified boundary. This appears to be impractical. The data of a wave actually starts at the end of an array of wave info where there is a final field: double wData[1]; There are about 2500 places in the code where we directly access this using, for example, double x= xwP->wData[i];
To support alignment, we would have to change double wData[1] to double *wDataPtr which could then be allocated separately. That of course changes 2500 places in the code.
I also found in researching Apple Metal MPSMatrixMultiplication, that to use shared memory (to avoid copying) the newBufferWithBytesNoCopy method takes a pointer input that has to be allocated with vm_allocate or mmap. Memory allocated by malloc is specifically disallowed. The memory must both start and end on a virtual memory page boundary.
August 1, 2018 at 10:52 am - Permalink
In reply to FYI, I have been looking… by Larry Hutchinson
Manually doing that will be really no fun. From following the git mailing list I've came across coccinelle [1] which allows to write rules for code refactoring. It is designed only for C though. In [2] I've seen DMS Software Reengineering Toolkit which claims to do the same for C++.
Then there is also clang-rename [3] but that AFAIK only works if your code base is compiled with cmake as it requires a compilation database.
[1]: http://coccinelle.lip6.fr
[2]: https://stackoverflow.com/a/2427608
[3]: https://clang.llvm.org/extra/clang-rename.html
August 2, 2018 at 04:57 am - Permalink