Parellel Task Processing

BlitzMax Forums/BlitzMax Programming/Parellel Task Processing

BLaBZ

(Posted 2015) [#1]

Has Anyone implemented something like this -

https://software.intel.com/en-us/articles/designing-the-framework-of-a-parallel-game-engine

Any thoughts?

Does Unreal or CryEngine implement this? I don't believe Unity does but it seems that this step maybe a necessary "know how" in the game developers toolbox in the future.

Yasha

(Posted 2015) [#2]

That's certainly a valid use of the technology as it currently stands.

Personally I don't think this is the way of the future, though (perhaps it's "way of the present") - actually, I think thread-based parallelism is likely to become less important to the average game developer and application programmer from here on out, not more.

Why? Because the multithreading-model of parallelism isn't actually a fundamental paradigm shift/gamechanger technology. You're still programming sequentially, but with a small advantage gained from having a tiny handful of sequential programs running at once. However, fundamentally, this model doesn't scale. You can make great use of four cores, pretty decent use of 8 or 16, but once you get to 32 or 64 you're really running out of things to do in one program that's still fundamentally structured around long sequential operations on broad and loose and uneven data structures. That article only really splits the engine in a very coarse/highlevel way, into a half dozen large task groups that are still fundamentally sequential. You can split some of them further, but by the logic that leads to task-based division of structure you end up with tighter and tighter data coupling the more finely you try to divide a task area (e.g. "physics" can be split off from "graphics" easily, but trying to split an active simulation in two requires much more synchronisation).

My thinking is that the real way forward is vectorising everything, i.e. push all the major data crunching onto the GPU or GPU-like devices (like future expansions of AVX), for massive, low-level parallelism within single tasks - this flips the axes somewhat, as tasks become compressed in time and expand in breadth, and reduces the need for multiple processors to engage for one program (can go back to sequencing them at the program toplevel). Vectorisation is infinitely expandable in its axis (as GPUs are enthusiastically demonstrating), as long as you can redesign the basic architecture of your task to involve functionally pure execution units on rigidly structured data without sharing. This requires abandoning a lot of traditional gamedev code and data structures (i.e. deep instruction sequences operating for nondeterministic loops over trees), but it can be done for all simulation and logic code.

Edit: the correct term for the "flipped axis" which I forgot while writing this is "data parallelism" as opposed to "task parallelism".

ziggy

(Posted 2015) [#3]

Yahsa: I think the implementation of this runtime architecture should be left out of the programming paradigm as much as possible, and leave distribution of tasks to a JIT-like compiler, as this should be done according to current machine calculation load. Pretending the code needs to be written with this in mind sounds a bit like the idea of an "inline" keyword on a procedural language. It was just a way to certificate that compilers were not good enough at that moment.
You can code with the idea of making this kind of optimization more likely to happen, in the same way you avoid the creation of garbage on GC languages, etc, but lots of the time, the compiler is smarter than you!

Yasha

(Posted 2015) [#4]

That's an interesting idea that would be awesome. Personally I'm not aware of any language runtimes that can actually do that at the moment though (GCC and LLVM still struggle to spot vectorisation opportunities at the moment that aren't either hinted or blatantly obvious; JS doesn't even try). Not sure the compilers are as smart as all that yet.

The thing about vectoring - as opposed to threading - is that it requires all operation channels to be identical. This makes handling the code easy for a machine because dependencies are easy to work out, but actually applying said code is only possible for data structured in certain ways. (e.g. it's easy to mistakenly write OpenCL code that never parallelises at all.)

You're definitely right about one big point though: thinking about enabling optimization is still the archaic way to design. The gamechanger would be programmers not thinking about writing code that prevents optimization: in the same way that inlining is easy with current programs/languages because nobody directly manipulates the stack with `volatile auto` variables or whatever any more, vectoring would be easy in a future language/style where people don't even try to write to shared globals or run data-dependent loops and the like, because they're used to getting the job done with a uniform dataflow; not even thinking in terms of mutable shared memory, just as coders in current GC languages don't usually think about the fact that object lifetimes end any more.

An analogy (not related to parallelism): code would be "better" in general if everyone used a const-by-default style. Not because languages with that are easier to optimise, or because the compiler can't spot variables that are never mutated - but because if everyone worked without the idea of reusing variables unless absolutely necessary, they'd think in terms of a cleaner flow graph in the first place.

*	(Posted 2015) [#5]

But wouldn't the idea of not reusing variables just make programs with bigger memory footprints, surely reusing variables makes smaller tighter programs?

Yasha

(Posted 2015) [#6]

No, there's usually no direct connection between whether you do something like reuse variables and what the compiler emits. In practice, most compilers actually unpack your code to never reuse variables anyway under the hood, before beginning their work. As ziggy says above, the compiler is smarter than you. It isn't going to let a squishy human mess up its register allocations.

Consider the usefulness of being able to just see those LLVM-style transformations for yourself though. The programmer who can do that has more freedom to create interesting code in the first place, never mind optimization.

*	(Posted 2015) [#7]

Ah I sent what ya mean now :)

Yeah that would be a brilliant idea :)