Open-sea hell !
BlitzMax Forums/Brucey's Modules/Open-sea hell !
| ||
The worst thing you want when you are stuck in a boat, thousands of miles out in the ocean, is a storm... That would be open-sea hell ! Fortunately for the rest of us, we can sit here and do something fun with BlitzMax :-) For this latest "module for a weekend" experiment, you will currently need a Mac with Snow Leopard, and perhaps an NVidia graphics card, although it works on my Mac Mini using its dual-core CPU. BaH.OpenCL, is a WIP wrapper for OpenCL, which allows you to write BlitzMax apps which can run code on supporting GPUs :-) Obviously, you can't sit and code a game on the GPU, since that's not what this library is designed for, but you can throw computation intensive calculations at it, and it will happily go off and throw lots of available processor cores at it. Think of it a bit like having multi-threading without having to worry about the threads :-) The second included example is using both cores on my Mac (using the very unscientific "top", which showed it running at 179% CPU). The code you write for OpenCL is a variation of C99, and is compiled up at runtime by the framework. It also has the capability to interact directly with OpenGL data, which might be useful for some texture manipulation or such things. The OpenCL API is quite extensive, with many functions for manipulating ints, floats, vectors, etc in very efficient ways. Check out the Quick Reference Card (PDF link on that page, and API starts from the middle of the second page) for a summary of the commands. In theory, the module should work on all platforms, but for non Mac platforms, you would need to set up the appropriate NVidia or AMD/ATI SDK - which I have not had time to look into at the moment. Much of the Types and functions have been implemented, but it might take some work to get things just right - given the nature of the way it works. Currently available from the SVN repository at googlecode. Fun fun ;-) |
| ||
On a plus note, the module compiles without error on Windows - without having any SDKs installed. :-) |
| ||
The module compiles on XP-SP3 but neither examples will compile. Here's the error log for example_01.bmx |
| ||
The module compiles on XP-SP3 but neither examples will compile. I wouldn't expect them to, without something to link to... ... like a DLL shipped with an SDK, or suchlike. |
| ||
Hrm.. Looks like you have to be signed up as a developer to get the AMD SDK. I won't bother then. |
| ||
Example output from the example_01 :Executing:example_01.debug Device Count = 1 OpenCL Device information : deviceType : CL_DEVICE_TYPE_CPU vendorName : Intel deviceName : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz driverVersion : 1.0 deviceVersion : OpenCL 1.0 maxComputeUnits : 2 maxWorkItemDimensions : 3 maxWorkItemSizes : (1, 1, 1) maxWorkGroupSize : 1 maxClockFrequency : 2000 maxMemAllocSize : 1073741824 imageSupport : 1 maxReadImageArgs : 128 maxWriteImageArgs : 8 image2dMaxWidth : 8192 image2dMaxHeight : 8192 image3dMaxWidth : 2048 image3dMaxHeight : 2048 image3dMaxDepth : 2048 maxSamplers : 16 maxParameterSize : 4096 globalMemCacheSize : 4194304 globalMemSize : 1610612733 maxConstantBufferSize : 65536 maxConstantArgs : 8 localMemSize : 16384 errorCorrectionSupport : 0 profilingTimerResolution : 1 endianLittle : 1 profile : FULL_PROFILE extensions : cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_APPLE_gl_sharing cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions This is what you get when you don't have a "proper" graphics card :-p Looks like you have to be signed up as a developer to get the AMD SDK. So I see. But at least it's free :-) |
| ||
But at least it's free :-) Yeah. |
| ||
Win32 examples should now compile without the link errors. They will expect a DLL called 'OpenCL.DLL' if you try to run them though... That's the name of the DLL that comes with the ATI SDK. Hopefully it'll be the same for the NVidia SDK, or we'll have a problem :-p Getting there... |
| ||
opencl is begging for a realtime ray-tracing engine to be written :) |
| ||
Looking forward to having the time to play with this :) Cheers Charlie |
| ||
Don't get too excited... Calling functions against the ATI library results in all kinds of weird and wonderful data corruption. This may be an issue of using MinGW. For example, calling a function and populating an int, on returning, I print the value of the int. It prints 1. I print the value again. It prints a very large number. I've read one report of a user having similar issues on the AMD forums, so it's hopefully not a Brucey issue per-say... but I also cannot rule that out :-p Going to try the NVidia SDK now... sigh. |
| ||
Well, I got the NVidia version to run without crashing, although it returns 0 devices here. Quite understandable, given there's no NVidia graphics card, and Windows is running in Parallels... although I had hoped for perhaps CPU availability -but since none of the drivers install properly without a valid graphics card, I'm not overly concerned at this point. Oh well :-) |
| ||
Excited to see there's already an attempt on this! Uh, is it still available at google code? I feel a bit stupid not finding it.. |
| ||
Awesome, man! So this is your current focus, eh? Looks great! Oh, and ATI's stuff is maybe halfway between standards-compliant and "hacked up", I think. Go figure. I always buy Nvidia because they like standards. :) |
| ||
Apparently ATI have released (yet) another update to their SDK. Hopefully this has fixed the previous issues. Although I don't have any of the requisite hardware, it should always at least let you call into the library without spewing crap back at you. So when you ask it - how many processors can I use? - it should at least say 0... Of course, if it's a MinGW issue, then things will just need to be "worked around", I suppose... |
| ||
I would like to see this fluid simulation http://www.blitzbasic.com/Community/posts.php?topic=87491 pushed through OpenCL. Unfortunately, I don't have the time to put it together. Maybe someone else can as a test? |
| ||
Friendly greetings !! I'm new to BlitzMax (bought it yesterday), switched from Purebasic. I'm not a programmer (linux sysadmin and postgresql DBA). I know (more or less, as a full time sysad) various langage, played with openCL from Java and failed at using openCL from C/C++ ... One of the main reason to switch to BlitzMax is your OpenCL module (and the wide range of supported game engine, and the fact that BlitzMax use FASM, like Purebasic, so i can optimize my code using SSE2/3 code). Currently downloading the latest trunk of your impressive module list. I hope to be able to play with openCL (fractal, MonteCarlo, ...) without too much pain. So, simply, Thank you ! I'll provide feedback as soon as possible :) |
| ||
running the exemple 1, I got : RUNTIME ERROR:Attempt to index array element beyond array length According to the debug (using BLide free edition) : Function _setDevice:TCLDevice(list:TCLDevice[], index:Int, devicePtr:Byte Ptr, deviceType:Int) Local device:TCLDevice = _create(devicePtr) device.deviceType = deviceType list[index] = device <=== IT FAIL HERE Return device End Function The debug console show : Local list:TCLDevice[0]=Null I'd say that it fail a detecting an OpenCL Device. Windows 7 64bits ultimate, NVidia 8800GTX, with openCL Driver and (unrelated) CUDA SDK installed. OpenCL usually work on this computer. Same problem with exemple 2. Edit : Stacktrace : ~>Unhandled Exception:Attempt to index array element beyond array length ~> ~>StackTrace{ ~>@...; ~>Function example_01 ~>Local devices:TCLDevice[]=Null ~>@...; ~>Function GetDevices ~>Local deviceType:Int=-1 ~>@...; ~>Function _setDevice ~>Local list:TCLDevice[]=$002c0770 ~>Local index:Int=4204125 ~>Local devicePtr:Byte Ptr=$001b2e50 ~>Local deviceType:Int=4204125 ~>Local device:TCLDevice=$002c0790 ~>} |
| ||
more info about my card here (uploaded a few mn ago) : http://www.ozone3d.net/gpu/db/index.php?which=3a1322bb3d43f0e5d6f6f8c4f3e83698 |
| ||
My biggest problem is a lack of hardware to test the different drivers on Windows. When you try to install the drivers, they expect a supporting gfx card.. if you don't have that installed, usually the drivers won't install. On Mac, you just load the library and it all "just works", even if you don't have the supporting hardware, because it falls-back to the CPU. |
| ||
I added some debug. For some reason, in _setDevice : Local deviceType:Int=4204125 So : Local index:Int=4204125 But list.length = 1 Here is the patched code : Function _setDevice:TCLDevice(list:TCLDevice[], index:Int, devicePtr:Byte Ptr, deviceType:Int) Local device:TCLDevice = _create(devicePtr) device.deviceType = deviceType DebugLog "list length = " + list.Length DebugLog "index = " + index list[index] = device Return device End Function And the debug output : DebugLog:list length = 1 DebugLog:index = 4204125 Now why the deviceType is obviously wrong ? ... Debug in progress :) |
| ||
As a workaround, i replaced : list[index] = device with list[0] = device OpenCL Device information : deviceType : vendorName : NVIDIA Corporation deviceName : GeForce 8800 GTX driverVersion : 195.62 deviceVersion : OpenCL 1.0 CUDA maxComputeUnits : 16 maxWorkItemDimensions : 3 maxWorkItemSizes : (512, 512, 64) maxWorkGroupSize : 512 maxClockFrequency : 1350 maxMemAllocSize : 201326592 imageSupport : 1 maxReadImageArgs : 128 maxWriteImageArgs : 8 image2dMaxWidth : 8192 image2dMaxHeight : 8192 image3dMaxWidth : 2048 image3dMaxHeight : 2048 image3dMaxDepth : 2048 maxSamplers : 16 maxParameterSize : 4352 globalMemCacheSize : 0 globalMemSize : 805306368 maxConstantBufferSize : 65536 maxConstantArgs : 9 localMemSize : 16384 errorCorrectionSupport : 0 profilingTimerResolution : 1000 endianLittle : 1 profile : FULL_PROFILE extensions : cl_khr_byte_addressable_store cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query As a first project... look like i didn't choose the easy way... *grins* |
| ||
Exemple 2 fail too... Not in the same place, but probably for the same reason. Function InitDevice:TCLPlatform(deviceType:Int) Local devices:TCLDevice[] = GetDevices(deviceType) If devices.length > 0 Then Local device:TCLDevice = devices[0] Local this:TCLPlatform = New TCLPlatform this.platformPtr = bmx_ocl_platform_init(this, device.devicePtr) <= ERROR HERE this.device = device Return this End If End Function bad refs:obj=$2023e80 refs=$20207e1 |
| ||
It make no sense to me ... in : bmx_ocl_platform_getdevices(...) { ... BBArray * list = _bah_opencl_TCLDevice__newDeviceList(size); When i add a debug in : Function _newDeviceList:TCLDevice[] (Count:Int) It show that Count =1, so size = 1. Great ! Following : for (int n = 0; n < size; n++) { cl_device_id device = devices[n]; cl_device_type type; size_t s = 0; err = clGetDeviceInfo(device, CL_DEVICE_TYPE, sizeof(type), &type, &s); MaxCLDevice * dev = new MaxCLDevice(device); BBObject * obj = _bah_opencl_TCLDevice__setDevice(list, n, dev, type); dev->SetHandle(obj); } Size should, obviously, still be "1" and n = 1 Function _setDevice:TCLDevice(list:TCLDevice[], index:Int, devicePtr:Byte Ptr, deviceType:Int) So, here, index should be = 1 But ... it's = 4204125 *sigh* |
| ||
But ... it's = 4204125 I know... welcome to my world... :-p I'm not entirely sure what's going on with it. It's almost as if the data is being corrupted along the way somewhere. |
| ||
heh, thx :) Could it be some 64bits oddity ? Also... i'm compiling using MinGW F:\MinGW\bin>gcc.exe --version gcc.exe (GCC) 3.4.5 (mingw-vista special r3) |
| ||
Also... i'm compiling using MinGW I'm also hoping that MinGW isn't the limiting factor here. |
| ||
I bought BlitzMax to avoid doing OpenCL in C/C++ ... and the first thing i do is debbuging a BlitzMax library written in ... C ! Look like your world is in the same galaxy as mine ;) |
| ||
It would probably be easier for me to work with if I had a box with Windows + NVidia... and a box with Windows + ATI... Alas... that is not the case, but I'm working through it. The fact that it does work as it should on Mac, means that I'm not going to give up on it :-p |
| ||
I appear to have made a little progress... In my debug... the size returned from clGetDeviceInfo() would be 1... then change to something else. Some tweaking has it remaining as 1. But it's still crashing. Still, it *feels* like a step in the right direction... |
| ||
Oops, i forgot to tell you that (line 351) : BBObject * obj = _bah_opencl_TCLDevice__setDevice(list, n, dev, type); The bmx code see : 4204125 But when i change to ! BBObject * obj = _bah_opencl_TCLDevice__setDevice(list, 1, dev, type); The bmx code see : 1 If i keep : BBObject * obj = _bah_opencl_TCLDevice__setDevice(list, n, dev, type); and change : for (int n = 0; n < size; n++) { with : for (int n = 0; n < 1; n++) { i still have 4204125 ... |
| ||
I have got that bit working now... and it seems to work up to the point of the return from bmx_ocl_platform_getdevices() - at which point it crashes. It is even detecting a "type 2" device, which I think is CPU. I'll try and remove my debug and check in the changes, and see if you can get it to run a little further. |
| ||
Updated SVN. I'm now investigating the crash out of bmx_ocl_platform_getdevices(). |
| ||
updated and tested with a clean exemple and module : 4204125 |
| ||
I found some info on : http://forums.nvidia.com/index.php?showtopic=96942 and http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2070 |
| ||
I updated to the latest BETA NVidia Driver : Both exemple works \o/ Sorry about all the mess :) :) |
| ||
In your openCL code : __kernel void inverseSquareroot(__global float *a, __global float *result) { int gid = get_global_id(0); result[gid] = 1.0f/sqrt(a[gid]); } It's faster if you use : __kernel void inverseSquareroot(__global float *a, __global float *result) { int gid = get_global_id(0); result[gid] = rsqrt(a[gid]); } |
| ||
Sorry about all the mess No problem. I'm just glad it's working for you now! :-) The problem with OpenCL at the moment is that it is still very early in its development. And the drivers are developing with it. But at least we can see some forward progress!! There may be some more things I am missing in the implementation... if you see anything important, let me know and I'll try to sort it out. Thanks for your patience, and I hope the rest of your BlitzMax experience better than the beginning ! :-) |
| ||
i bought a nettop powered by a nvidia ION platforme. Currently installing everything needed to play with openCL on it. I'll tell you if it works, and how slow it is :) |
| ||
i can't find what i'm missing... i try to compile the module but it cannot find the opencl.lib |
| ||
Mmm nope, it's ok, the error was totally unrelated. And i couldn't compile this module "normally" with bmk or the BLide IDE because i put the module into mod\opencl.mod instead of mod\bah.mod\opencl.mod now it's ok : penCL Device information : deviceType : CL_DEVICE_TYPE_GPU vendorName : NVIDIA Corporation deviceName : ION LE driverVersion : 190.89 deviceVersion : OpenCL 1.0 maxComputeUnits : 1 <====== LOL :) maxWorkItemDimensions : 3 maxWorkItemSizes : (512, 512, 64) maxWorkGroupSize : 512 maxClockFrequency : 1100 maxMemAllocSize : 134217728 imageSupport : 1 maxReadImageArgs : 128 maxWriteImageArgs : 8 image2dMaxWidth : 8192 image2dMaxHeight : 8192 image3dMaxWidth : 2048 image3dMaxHeight : 2048 image3dMaxDepth : 2048 maxSamplers : 16 maxParameterSize : 4352 globalMemCacheSize : 0 globalMemSize : 131792896 maxConstantBufferSize : 65536 maxConstantArgs : 9 localMemSize : 16384 errorCorrectionSupport : 0 profilingTimerResolution : 1000 endianLittle : 1 profile : FULL_PROFILE extensions : cl_khr_byte_addressable_store cl_nv_compiler_options cl_nv_device_attribute_query cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics |
| ||
exemple 2 : Time elapsed for OpenCL run : 42219 Time elapsed for BlitzMax run : 33343 Yup... the CPU code is faster than the GPU. (probably because the ION share some/the memory with the system. and the exemple 2 is not computation intensive, but mostly IO Bound. BUT .... it works !! That's all i want :) I can play and crash everything, i'll never have to reboot my main desktop computer again :) For reference, the exemple 2 on my main computer : Time elapsed for OpenCL run : 1551 Time elapsed for BlitzMax run : 8238 a 8800GTX on Win7 64bits, Quadcore Q6600, 8GB of RAM. (8800GTX opencl details posted 3 days ago on this forum) My ion plateform is a Packard-Bell (ewww!) C1600 |
| ||
Time elapsed for OpenCL run : 1551 Time elapsed for BlitzMax run : 8238 Well, that at least shows some of the potential, even if that example is very simple. Given a "proper" test, I am sure CPU-based code will be blown away! Nice to know that it works on different systems :-) |
| ||
Oops, i forgot that i was in debug mode. In release mode : Time elapsed for OpenCL run : 1548 Time elapsed for BlitzMax run : 446 The cpu is faster on both computer. But, again, the exemple code are very inefficient for GPU : too much memory access, not enough math :) I will write a more complex openCL code and release it so you can provide a better exemple to unleash the massive power of GPGPU :) |
| ||
But, again, the exemple code are very inefficient for GPU Yeah. I didn't really want to implement anything very big... there's only so much time in my day ;-) Thanks for testing it though. Your feedback helps a lot ! |
| ||
Is there a way to pass a cl_int as kernel arguments instead of a cl_mem ? i have a compile error : unable to convert int to TCLmem (obviously !) |
| ||
Hehehe, i found why the openCL code was slow. it was not what i tought. Change : kernelSquare.Execute(1, Size) with : kernelSquare.Execute(1, Size,256) Enjoy : Time elapsed for OpenCL run : 59 Time elapsed for BlitzMax run : 450 1.00000000 : 1.00000000 4.00000000 : 4.00000000 9.00000000 : 9.00000000 16.0000000 : 16.0000000 25.0000000 : 25.0000000 36.0000000 : 36.0000000 49.0000000 : 49.0000000 64.0000000 : 64.0000000 81.0000000 : 81.0000000 100.000000 : 100.000000 |
| ||
I used the openCL profiler with exemple 2, with some modification : Const Size:Int = 1024 * 1024 * 64 kernelSquare.Execute(1, Size, 512) And removing the CPU square function. ~93% of total time is spent with memory transfert (Host<->Device) ~6% of time in CPU time to "execute" the "Square" kernel ~1% in real GPU execution And still around 10x faster than cpu code :) |
| ||
Is there a way to pass a cl_int as kernel arguments instead of a cl_mem ? Yes... via SetArgInt() SetArgFloat() etc... which I appear not to have implemented yet. Will sort that oversight out ASAP :-/ |
| ||
And still around 10x faster than cpu code Now there's a statement to catch the eye! Thanks for exploring this. |
| ||
I would like to see this fluid simulation http://www.blitzbasic.com/Community/posts.php?topic=87491 pushed through OpenCL. Unfortunately, I don't have the time to put it together. Maybe someone else can as a test? I'll try this weekend, if i can understand the code. :) (remember that i discovered BlitzMax last weekend and i'm not really a developper ^^ ) |
| ||
I found a bug... workDim supports values of 1 - 3. Anything more than 1, and globalWorkSize and localWorkSize should be "arrays" of the same size. My plan for this, is to create a new Method, called ExecuteDim(), and for Execute() remove the workDim parameter. |
| ||
My plan for this, is to create a new Method, called ExecuteDim(), and for Execute() remove the workDim parameter. look good :) |
| ||
I created a github repository to OpenCLize this nice fluid simulation, here : http://github.com/ker2x/BM_Fluid |
| ||
Here is a patch for exemple 2 using executeDim, including some documentation : |
| ||
oops, should be :kernelSquare.ExecuteDim(workDim, globalWorkSize, localWorkSize) instead of : kernelSquare.ExecuteDim(1, globalWorkSize, localWorkSize) |
| ||
Really nice work all. I'd love to have a play with this, getting Device Count = 0 on the first example though. Have an nvidia 8800GTS with the latest beta drivers (on win7 64). I'd say it's finding the opencl.dll ok as first time I ran this it didn't do anything until I installed the latest drivers. Anything obvious you think I might be missing? Guess I can just go play on the mac, but it'd be nice to get it working on my pc :) |
| ||
Try using the drivers provided here : http://developer.nvidia.com/object/opencl-download.html I also suggest the openCL profiler and Sample+SDK :) |
| ||
If you do this :Local workDim:Int = 1 Local globalWorkSize:Int[workDim] Local localWorkSize:Int[workDim] you only have to change one thing. :-) |
| ||
Indeed ;) |
| ||
Those drivers got it working, thanks! |
| ||
Is this still being developed? I got the module to build fine but examples crash with EAV at TCLPlatform.GetDevices() both of these crash the same way Windows 7 64 nvidia geforce gtx295 Windows 7 32 nvidia geforce 9600gt Updated drivers and sdks today. The example binaries (OCL and CUDA both) bundled with the sdk work fine on both machines, but the VC projects fail miserably as usual (trying to use VC++ 2008 E) |
| ||
Does this still work? Can I use OpenCL in BlitzMAX? |