Open-sea hell !

BlitzMax Forums/Brucey's Modules/Open-sea hell !

Brucey(Posted 2009) [#1]
The worst thing you want when you are stuck in a boat, thousands of miles out in the ocean, is a storm... That would be open-sea hell !

Fortunately for the rest of us, we can sit here and do something fun with BlitzMax :-)

For this latest "module for a weekend" experiment, you will currently need a Mac with Snow Leopard, and perhaps an NVidia graphics card, although it works on my Mac Mini using its dual-core CPU.

BaH.OpenCL, is a WIP wrapper for OpenCL, which allows you to write BlitzMax apps which can run code on supporting GPUs :-)

Obviously, you can't sit and code a game on the GPU, since that's not what this library is designed for, but you can throw computation intensive calculations at it, and it will happily go off and throw lots of available processor cores at it.
Think of it a bit like having multi-threading without having to worry about the threads :-)

The second included example is using both cores on my Mac (using the very unscientific "top", which showed it running at 179% CPU).

The code you write for OpenCL is a variation of C99, and is compiled up at runtime by the framework.
It also has the capability to interact directly with OpenGL data, which might be useful for some texture manipulation or such things.

The OpenCL API is quite extensive, with many functions for manipulating ints, floats, vectors, etc in very efficient ways. Check out the Quick Reference Card (PDF link on that page, and API starts from the middle of the second page) for a summary of the commands.

In theory, the module should work on all platforms, but for non Mac platforms, you would need to set up the appropriate NVidia or AMD/ATI SDK - which I have not had time to look into at the moment.

Much of the Types and functions have been implemented, but it might take some work to get things just right - given the nature of the way it works.

Currently available from the SVN repository at googlecode.

Fun fun ;-)


Brucey(Posted 2009) [#2]
On a plus note, the module compiles without error on Windows - without having any SDKs installed.

:-)


plash(Posted 2009) [#3]
The module compiles on XP-SP3 but neither examples will compile.

Here's the error log for example_01.bmx



Brucey(Posted 2009) [#4]
The module compiles on XP-SP3 but neither examples will compile.

I wouldn't expect them to, without something to link to...
... like a DLL shipped with an SDK, or suchlike.


plash(Posted 2009) [#5]
Hrm.. Looks like you have to be signed up as a developer to get the AMD SDK.
I won't bother then.


Brucey(Posted 2009) [#6]
Example output from the example_01 :
Executing:example_01.debug

Device Count = 1

OpenCL Device information :
    deviceType               : CL_DEVICE_TYPE_CPU
    vendorName               : Intel
    deviceName               : Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
    driverVersion            : 1.0
    deviceVersion            : OpenCL 1.0
    maxComputeUnits          : 2
    maxWorkItemDimensions    : 3
    maxWorkItemSizes         : (1, 1, 1)
    maxWorkGroupSize         : 1
    maxClockFrequency        : 2000
    maxMemAllocSize          : 1073741824
    imageSupport             : 1
    maxReadImageArgs         : 128
    maxWriteImageArgs        : 8
    image2dMaxWidth          : 8192
    image2dMaxHeight         : 8192
    image3dMaxWidth          : 2048
    image3dMaxHeight         : 2048
    image3dMaxDepth          : 2048
    maxSamplers              : 16
    maxParameterSize         : 4096
    globalMemCacheSize       : 4194304
    globalMemSize            : 1610612733
    maxConstantBufferSize    : 65536
    maxConstantArgs          : 8
    localMemSize             : 16384
    errorCorrectionSupport   : 0
    profilingTimerResolution : 1
    endianLittle             : 1
    profile                  : FULL_PROFILE
    extensions               : cl_khr_fp64
                               cl_khr_global_int32_base_atomics
                               cl_khr_global_int32_extended_atomics
                               cl_khr_local_int32_base_atomics
                               cl_khr_local_int32_extended_atomics
                               cl_khr_byte_addressable_store
                               cl_APPLE_gl_sharing
                               cl_APPLE_SetMemObjectDestructor
                               cl_APPLE_ContextLoggingFunctions
                              

This is what you get when you don't have a "proper" graphics card :-p


Looks like you have to be signed up as a developer to get the AMD SDK.

So I see. But at least it's free :-)


plash(Posted 2009) [#7]
But at least it's free :-)
Yeah.


Brucey(Posted 2009) [#8]
Win32 examples should now compile without the link errors.

They will expect a DLL called 'OpenCL.DLL' if you try to run them though... That's the name of the DLL that comes with the ATI SDK. Hopefully it'll be the same for the NVidia SDK, or we'll have a problem :-p

Getting there...


slenkar(Posted 2009) [#9]
opencl is begging for a realtime ray-tracing engine to be written :)


jkrankie(Posted 2009) [#10]
Looking forward to having the time to play with this :)

Cheers
Charlie


Brucey(Posted 2009) [#11]
Don't get too excited... Calling functions against the ATI library results in all kinds of weird and wonderful data corruption.

This may be an issue of using MinGW.
For example, calling a function and populating an int, on returning, I print the value of the int. It prints 1. I print the value again. It prints a very large number.

I've read one report of a user having similar issues on the AMD forums, so it's hopefully not a Brucey issue per-say... but I also cannot rule that out :-p


Going to try the NVidia SDK now... sigh.


Brucey(Posted 2009) [#12]
Well, I got the NVidia version to run without crashing, although it returns 0 devices here.

Quite understandable, given there's no NVidia graphics card, and Windows is running in Parallels... although I had hoped for perhaps CPU availability -but since none of the drivers install properly without a valid graphics card, I'm not overly concerned at this point.

Oh well :-)


beanage(Posted 2010) [#13]
Excited to see there's already an attempt on this! Uh, is it still available at google code? I feel a bit stupid not finding it..


theHand(Posted 2010) [#14]
Awesome, man!
So this is your current focus, eh? Looks great!
Oh, and ATI's stuff is maybe halfway between standards-compliant and "hacked up", I think. Go figure. I always buy Nvidia because they like standards. :)


Brucey(Posted 2010) [#15]
Apparently ATI have released (yet) another update to their SDK. Hopefully this has fixed the previous issues.
Although I don't have any of the requisite hardware, it should always at least let you call into the library without spewing crap back at you. So when you ask it - how many processors can I use? - it should at least say 0...

Of course, if it's a MinGW issue, then things will just need to be "worked around", I suppose...


Tachyon(Posted 2010) [#16]
I would like to see this fluid simulation http://www.blitzbasic.com/Community/posts.php?topic=87491 pushed through OpenCL. Unfortunately, I don't have the time to put it together. Maybe someone else can as a test?


ker2x(Posted 2010) [#17]
Friendly greetings !!

I'm new to BlitzMax (bought it yesterday), switched from Purebasic.
I'm not a programmer (linux sysadmin and postgresql DBA).
I know (more or less, as a full time sysad) various langage, played with openCL from Java and failed at using openCL from C/C++ ...

One of the main reason to switch to BlitzMax is your OpenCL module (and the wide range of supported game engine, and the fact that BlitzMax use FASM, like Purebasic, so i can optimize my code using SSE2/3 code).

Currently downloading the latest trunk of your impressive module list. I hope to be able to play with openCL (fractal, MonteCarlo, ...) without too much pain.

So, simply, Thank you ! I'll provide feedback as soon as possible :)


ker2x(Posted 2010) [#18]
running the exemple 1, I got :
RUNTIME ERROR:Attempt to index array element beyond array length

According to the debug (using BLide free edition) :
Function _setDevice:TCLDevice(list:TCLDevice[], index:Int, devicePtr:Byte Ptr, deviceType:Int)
Local device:TCLDevice = _create(devicePtr)
device.deviceType = deviceType
list[index] = device <=== IT FAIL HERE
Return device
End Function

The debug console show : Local list:TCLDevice[0]=Null
I'd say that it fail a detecting an OpenCL Device.

Windows 7 64bits ultimate, NVidia 8800GTX, with openCL Driver and (unrelated) CUDA SDK installed.

OpenCL usually work on this computer.

Same problem with exemple 2.

Edit : Stacktrace :
~>Unhandled Exception:Attempt to index array element beyond array length
~>
~>StackTrace{
~>@...;
~>Function example_01
~>Local devices:TCLDevice[]=Null
~>@...;
~>Function GetDevices
~>Local deviceType:Int=-1
~>@...;
~>Function _setDevice
~>Local list:TCLDevice[]=$002c0770
~>Local index:Int=4204125
~>Local devicePtr:Byte Ptr=$001b2e50
~>Local deviceType:Int=4204125
~>Local device:TCLDevice=$002c0790
~>}


ker2x(Posted 2010) [#19]
more info about my card here (uploaded a few mn ago) : http://www.ozone3d.net/gpu/db/index.php?which=3a1322bb3d43f0e5d6f6f8c4f3e83698


Brucey(Posted 2010) [#20]
My biggest problem is a lack of hardware to test the different drivers on Windows.
When you try to install the drivers, they expect a supporting gfx card.. if you don't have that installed, usually the drivers won't install.

On Mac, you just load the library and it all "just works", even if you don't have the supporting hardware, because it falls-back to the CPU.


ker2x(Posted 2010) [#21]
I added some debug.

For some reason, in _setDevice :
Local deviceType:Int=4204125
So :
Local index:Int=4204125
But list.length = 1

Here is the patched code :
Function _setDevice:TCLDevice(list:TCLDevice[], index:Int, devicePtr:Byte Ptr, deviceType:Int)
Local device:TCLDevice = _create(devicePtr)
device.deviceType = deviceType
DebugLog "list length = " + list.Length
DebugLog "index = " + index
list[index] = device
Return device
End Function

And the debug output :
DebugLog:list length = 1
DebugLog:index = 4204125

Now why the deviceType is obviously wrong ? ... Debug in progress :)


ker2x(Posted 2010) [#22]
As a workaround, i replaced :
list[index] = device
with
list[0] = device

OpenCL Device information :
deviceType :
vendorName : NVIDIA Corporation
deviceName : GeForce 8800 GTX
driverVersion : 195.62
deviceVersion : OpenCL 1.0 CUDA
maxComputeUnits : 16
maxWorkItemDimensions : 3
maxWorkItemSizes : (512, 512, 64)
maxWorkGroupSize : 512
maxClockFrequency : 1350
maxMemAllocSize : 201326592
imageSupport : 1
maxReadImageArgs : 128
maxWriteImageArgs : 8
image2dMaxWidth : 8192
image2dMaxHeight : 8192
image3dMaxWidth : 2048
image3dMaxHeight : 2048
image3dMaxDepth : 2048
maxSamplers : 16
maxParameterSize : 4352
globalMemCacheSize : 0
globalMemSize : 805306368
maxConstantBufferSize : 65536
maxConstantArgs : 9
localMemSize : 16384
errorCorrectionSupport : 0
profilingTimerResolution : 1000
endianLittle : 1
profile : FULL_PROFILE
extensions : cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query


As a first project... look like i didn't choose the easy way... *grins*


ker2x(Posted 2010) [#23]
Exemple 2 fail too...
Not in the same place, but probably for the same reason.

Function InitDevice:TCLPlatform(deviceType:Int)
Local devices:TCLDevice[] = GetDevices(deviceType)
If devices.length > 0 Then
Local device:TCLDevice = devices[0]

Local this:TCLPlatform = New TCLPlatform
this.platformPtr = bmx_ocl_platform_init(this, device.devicePtr) <= ERROR HERE
this.device = device
Return this
End If
End Function

bad refs:obj=$2023e80 refs=$20207e1


ker2x(Posted 2010) [#24]
It make no sense to me ...

in : bmx_ocl_platform_getdevices(...) {
...

BBArray * list = _bah_opencl_TCLDevice__newDeviceList(size);

When i add a debug in : Function _newDeviceList:TCLDevice[] (Count:Int)
It show that Count =1, so size = 1. Great !

Following :

for (int n = 0; n < size; n++) {
cl_device_id device = devices[n];
cl_device_type type;
size_t s = 0;

err = clGetDeviceInfo(device, CL_DEVICE_TYPE, sizeof(type), &type, &s);

MaxCLDevice * dev = new MaxCLDevice(device);
BBObject * obj = _bah_opencl_TCLDevice__setDevice(list, n, dev, type);
dev->SetHandle(obj);
}

Size should, obviously, still be "1"
and n = 1

Function _setDevice:TCLDevice(list:TCLDevice[], index:Int, devicePtr:Byte Ptr, deviceType:Int)

So, here, index should be = 1
But ... it's = 4204125

*sigh*


Brucey(Posted 2010) [#25]
But ... it's = 4204125

I know... welcome to my world... :-p

I'm not entirely sure what's going on with it.
It's almost as if the data is being corrupted along the way somewhere.


ker2x(Posted 2010) [#26]
heh, thx :)

Could it be some 64bits oddity ?
Also... i'm compiling using MinGW

F:\MinGW\bin>gcc.exe --version
gcc.exe (GCC) 3.4.5 (mingw-vista special r3)


Brucey(Posted 2010) [#27]
Also... i'm compiling using MinGW

I'm also hoping that MinGW isn't the limiting factor here.


ker2x(Posted 2010) [#28]
I bought BlitzMax to avoid doing OpenCL in C/C++ ... and the first thing i do is debbuging a BlitzMax library written in ... C !

Look like your world is in the same galaxy as mine ;)


Brucey(Posted 2010) [#29]
It would probably be easier for me to work with if I had a box with Windows + NVidia... and a box with Windows + ATI...
Alas... that is not the case, but I'm working through it.

The fact that it does work as it should on Mac, means that I'm not going to give up on it :-p


Brucey(Posted 2010) [#30]
I appear to have made a little progress...

In my debug... the size returned from clGetDeviceInfo() would be 1... then change to something else.
Some tweaking has it remaining as 1. But it's still crashing. Still, it *feels* like a step in the right direction...


ker2x(Posted 2010) [#31]
Oops, i forgot to tell you that (line 351) :

BBObject * obj = _bah_opencl_TCLDevice__setDevice(list, n, dev, type);

The bmx code see : 4204125

But when i change to !
BBObject * obj = _bah_opencl_TCLDevice__setDevice(list, 1, dev, type);

The bmx code see : 1

If i keep :
BBObject * obj = _bah_opencl_TCLDevice__setDevice(list, n, dev, type);

and change :
for (int n = 0; n < size; n++) {
with :
for (int n = 0; n < 1; n++) {

i still have 4204125

...


Brucey(Posted 2010) [#32]
I have got that bit working now... and it seems to work up to the point of the return from bmx_ocl_platform_getdevices() - at which point it crashes.

It is even detecting a "type 2" device, which I think is CPU.

I'll try and remove my debug and check in the changes, and see if you can get it to run a little further.


Brucey(Posted 2010) [#33]
Updated SVN.

I'm now investigating the crash out of bmx_ocl_platform_getdevices().


ker2x(Posted 2010) [#34]
updated and tested with a clean exemple and module : 4204125


ker2x(Posted 2010) [#35]
I found some info on : http://forums.nvidia.com/index.php?showtopic=96942 and http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2070


ker2x(Posted 2010) [#36]
I updated to the latest BETA NVidia Driver : Both exemple works \o/

Sorry about all the mess :) :)


ker2x(Posted 2010) [#37]
In your openCL code :

__kernel void inverseSquareroot(__global float *a, __global float *result) {
int gid = get_global_id(0);
result[gid] = 1.0f/sqrt(a[gid]);
}

It's faster if you use :
__kernel void inverseSquareroot(__global float *a, __global float *result) {
int gid = get_global_id(0);
result[gid] = rsqrt(a[gid]);
}


Brucey(Posted 2010) [#38]
Sorry about all the mess

No problem. I'm just glad it's working for you now! :-)

The problem with OpenCL at the moment is that it is still very early in its development. And the drivers are developing with it. But at least we can see some forward progress!!

There may be some more things I am missing in the implementation... if you see anything important, let me know and I'll try to sort it out.

Thanks for your patience, and I hope the rest of your BlitzMax experience better than the beginning ! :-)


ker2x(Posted 2010) [#39]
i bought a nettop powered by a nvidia ION platforme.
Currently installing everything needed to play with openCL on it.
I'll tell you if it works, and how slow it is :)


ker2x(Posted 2010) [#40]
i can't find what i'm missing... i try to compile the module but it cannot find the opencl.lib


ker2x(Posted 2010) [#41]
Mmm nope, it's ok, the error was totally unrelated.
And i couldn't compile this module "normally" with bmk or the BLide IDE because i put the module into mod\opencl.mod instead of mod\bah.mod\opencl.mod

now it's ok :
penCL Device information :
deviceType : CL_DEVICE_TYPE_GPU
vendorName : NVIDIA Corporation
deviceName : ION LE
driverVersion : 190.89
deviceVersion : OpenCL 1.0
maxComputeUnits : 1 <====== LOL :)
maxWorkItemDimensions : 3
maxWorkItemSizes : (512, 512, 64)
maxWorkGroupSize : 512
maxClockFrequency : 1100
maxMemAllocSize : 134217728
imageSupport : 1
maxReadImageArgs : 128
maxWriteImageArgs : 8
image2dMaxWidth : 8192
image2dMaxHeight : 8192
image3dMaxWidth : 2048
image3dMaxHeight : 2048
image3dMaxDepth : 2048
maxSamplers : 16
maxParameterSize : 4352
globalMemCacheSize : 0
globalMemSize : 131792896
maxConstantBufferSize : 65536
maxConstantArgs : 9
localMemSize : 16384
errorCorrectionSupport : 0
profilingTimerResolution : 1000
endianLittle : 1
profile : FULL_PROFILE
extensions : cl_khr_byte_addressable_store
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics


ker2x(Posted 2010) [#42]
exemple 2 :
Time elapsed for OpenCL run : 42219
Time elapsed for BlitzMax run : 33343

Yup... the CPU code is faster than the GPU. (probably because the ION share some/the memory with the system. and the exemple 2 is not computation intensive, but mostly IO Bound.

BUT .... it works !! That's all i want :)
I can play and crash everything, i'll never have to reboot my main desktop computer again :)

For reference, the exemple 2 on my main computer :
Time elapsed for OpenCL run : 1551
Time elapsed for BlitzMax run : 8238

a 8800GTX on Win7 64bits, Quadcore Q6600, 8GB of RAM.
(8800GTX opencl details posted 3 days ago on this forum)

My ion plateform is a Packard-Bell (ewww!) C1600


Brucey(Posted 2010) [#43]
Time elapsed for OpenCL run : 1551
Time elapsed for BlitzMax run : 8238

Well, that at least shows some of the potential, even if that example is very simple. Given a "proper" test, I am sure CPU-based code will be blown away!

Nice to know that it works on different systems :-)


ker2x(Posted 2010) [#44]
Oops, i forgot that i was in debug mode.
In release mode :

Time elapsed for OpenCL run : 1548
Time elapsed for BlitzMax run : 446

The cpu is faster on both computer.
But, again, the exemple code are very inefficient for GPU : too much memory access, not enough math :)

I will write a more complex openCL code and release it so you can provide a better exemple to unleash the massive power of GPGPU :)


Brucey(Posted 2010) [#45]
But, again, the exemple code are very inefficient for GPU

Yeah. I didn't really want to implement anything very big... there's only so much time in my day ;-)

Thanks for testing it though. Your feedback helps a lot !


ker2x(Posted 2010) [#46]
Is there a way to pass a cl_int as kernel arguments instead of a cl_mem ?

i have a compile error : unable to convert int to TCLmem (obviously !)


ker2x(Posted 2010) [#47]
Hehehe, i found why the openCL code was slow. it was not what i tought.

Change :
kernelSquare.Execute(1, Size)
with :
kernelSquare.Execute(1, Size,256)
Enjoy :

Time elapsed for OpenCL run : 59
Time elapsed for BlitzMax run : 450

1.00000000 : 1.00000000
4.00000000 : 4.00000000
9.00000000 : 9.00000000
16.0000000 : 16.0000000
25.0000000 : 25.0000000
36.0000000 : 36.0000000
49.0000000 : 49.0000000
64.0000000 : 64.0000000
81.0000000 : 81.0000000
100.000000 : 100.000000


ker2x(Posted 2010) [#48]
I used the openCL profiler with exemple 2, with some modification :
Const Size:Int = 1024 * 1024 * 64
kernelSquare.Execute(1, Size, 512)
And removing the CPU square function.

~93% of total time is spent with memory transfert (Host<->Device)
~6% of time in CPU time to "execute" the "Square" kernel
~1% in real GPU execution

And still around 10x faster than cpu code :)


Brucey(Posted 2010) [#49]
Is there a way to pass a cl_int as kernel arguments instead of a cl_mem ?

Yes... via SetArgInt() SetArgFloat() etc... which I appear not to have implemented yet.
Will sort that oversight out ASAP :-/


DavidDC(Posted 2010) [#50]
And still around 10x faster than cpu code

Now there's a statement to catch the eye! Thanks for exploring this.


ker2x(Posted 2010) [#51]
I would like to see this fluid simulation http://www.blitzbasic.com/Community/posts.php?topic=87491 pushed through OpenCL. Unfortunately, I don't have the time to put it together. Maybe someone else can as a test?


I'll try this weekend, if i can understand the code. :)
(remember that i discovered BlitzMax last weekend and i'm not really a developper ^^ )


Brucey(Posted 2010) [#52]
I found a bug...

workDim supports values of 1 - 3.
Anything more than 1, and globalWorkSize and localWorkSize should be "arrays" of the same size.

My plan for this, is to create a new Method, called ExecuteDim(), and for Execute() remove the workDim parameter.


ker2x(Posted 2010) [#53]
My plan for this, is to create a new Method, called ExecuteDim(), and for Execute() remove the workDim parameter.


look good :)


ker2x(Posted 2010) [#54]
I created a github repository to OpenCLize this nice fluid simulation, here : http://github.com/ker2x/BM_Fluid


ker2x(Posted 2010) [#55]
Here is a patch for exemple 2 using executeDim, including some documentation :




ker2x(Posted 2010) [#56]
oops, should be :
kernelSquare.ExecuteDim(workDim, globalWorkSize, localWorkSize)


instead of :
kernelSquare.ExecuteDim(1, globalWorkSize, localWorkSize)



Pete Rigz(Posted 2010) [#57]
Really nice work all. I'd love to have a play with this, getting Device Count = 0 on the first example though. Have an nvidia 8800GTS with the latest beta drivers (on win7 64). I'd say it's finding the opencl.dll ok as first time I ran this it didn't do anything until I installed the latest drivers. Anything obvious you think I might be missing?

Guess I can just go play on the mac, but it'd be nice to get it working on my pc :)


ker2x(Posted 2010) [#58]
Try using the drivers provided here : http://developer.nvidia.com/object/opencl-download.html

I also suggest the openCL profiler and Sample+SDK :)


Brucey(Posted 2010) [#59]
If you do this :
Local workDim:Int = 1
Local globalWorkSize:Int[workDim]
Local localWorkSize:Int[workDim]

you only have to change one thing.

:-)


ker2x(Posted 2010) [#60]
Indeed ;)


Pete Rigz(Posted 2010) [#61]
Those drivers got it working, thanks!


Samichan(Posted 2011) [#62]
Is this still being developed?

I got the module to build fine but examples crash with EAV at TCLPlatform.GetDevices()


both of these crash the same way

Windows 7 64
nvidia geforce gtx295

Windows 7 32
nvidia geforce 9600gt


Updated drivers and sdks today.
The example binaries (OCL and CUDA both) bundled with the sdk work fine on both machines, but the VC projects fail miserably as usual (trying to use VC++ 2008 E)


BLaBZ(Posted 2012) [#63]
Does this still work? Can I use OpenCL in BlitzMAX?