Ref counting and Mark and sweep

Community Forums/General Help/Ref counting and Mark and sweep

ziggy(Posted 2013) [#1]
Has anyone any experience mixing the two? ref counting could theorically (not sure if it is a good idea) allow for faster collection when there are no cyclic references at a very slow cost and mark and sweep could be used with less overhead (as most died object would have been already collected by the ref counting) so pauses would be smaller...
But, as everything related to performance... you can't be sure unless you actually test it so, before I spend a month or the like, has anyone tested this before and, is there any conclusion that could be helpful?


Yasha(Posted 2013) [#2]
I wrote an obscenely long post on this topic yesterday which may be of some interest to you: http://www.blitzbasic.com/Community/post.php?topic=100424&post=1185874


Short version: there is no "best" solution to memory management - it depends very much on the language, and the programs you expect to be written in that language.

-- Which language? (If this is for your own, which languages is it most like?)
-- is it a toy, research project, or production quality?
-- Do you care about real-time interactivity or only overall speed?
-- How many objects do you expect it to allocate?
-- How often? What is their lifespan likely to be?
-- How large are they? How much do they vary?

BlitzMax for instance counts as a language with a "small" number of "large" objects with "long" lifespans (objects with many fields allocated into a collection and used throughout the playtime of a level). Lisp is a language with a "large" number of "varying" objects with "either long or short" lifespans (you might have some big vectors, but there are a lot of tiny cons cells to manage). Haskell is a language with a "very large" number of "small" objects with "extremely short" lifespans (huge numbers of temporary tuples created between function calls - think GB/sec).


The more you tend towards large numbers of small objects changing owners frequently, the less efficient refcounting becomes. The reference counting for a language at Haskell's extreme will be significantly slower than tracing, and if your objects are generally small, the count field itself may become a significant waste of memory (if you're a Lisp mostly playing with two-word cons cells, adding a third word for the count can increase memory usage by over 100%... see further). A language manipulating small objects will also benefit greatly from a tracing GC's ability to move objects around in memory to make it more cache-friendly. A language that uses a lot of similar tiny objects like cons cells may save on memory by compressing them into smaller spaces or sharing objects (e.g. hash-consing and cdr-coding are techniques that allow Lisps to share the same space for multiple objects; cdr-coding can remove all of the memory overhead of list nodes, for instance); using refcounting would prevent this sort of optimisation.

The more you tend towards small numbers of large objects, the more efficient refcounting becomes as there are fewer updates and it uses less memory. Towards the BlitzMax end of the scale refcounting may start to outperform tracing.

The general guideline most people assume nowadays is that you should assume tracing is faster than refcounting until you've proven otherwise. Note that a production-quality tracing GC will not search the entire dataset but will divide the data into generations, so it only normally needs to trace a few objects at a time (something like 90% of objects never live past their first assignment, so you mostly only need to repeatedly scan the last ~1MB of allocations). A production-quality reference counter will use static analysis (a la ARC) to remove as many of the refcount updates as possible at compile-time; the goal with reference counting is, bizarrely, not to count references wherever possible. Note also that performance can mean two unrelated things: total execution time (almost always better to use tracing), which disregards pauses and only measures time from start to finish; and real-time execution, which uses algorithms that are slower overall to ensure no noticeable pauses creep in.


In either case, a naive one-pass mark-and-sweep that doesn't move objects, and a naive reference counter that updates counts on every access, will both be slow compared to production runtimes. They are also by far the best options for a toy or research (assuming it's not GC research) compiler, because they're simple and readable and you don't normally care about eking every last bit of performance out.

For a production-grade project... start by using a third-party GC and tweak as necessary. The Boehm GC is general-purpose and can slot into any language and is fast enough for real-world use. Although it pauses, it is likely to be much faster than a naive implementation of any algorithm because... well, it's a massive corporate effort.

Other third party options include ARC, for refcounting (compile to something you can run through Clang), or the GC behind RScheme (which is real-time, no pauses), and designed to be retargetable.


On the matter of combining approaches, conventional wisdom is that combining refcounting and tracing isn't especially effective as you will end up with the worst of both worlds (the trace will have to run either way to catch the cycles, especially if it's conservative rather than precise; and you still have the overhead of the refcounts while you aren't tracing). The advanced research compilers MLKit and Stalin both ignore refcounting and instead use regions (a kind of pool allocation), which can be dramatically faster as long as the programs written don't exhibit "unusual" object lifespans (e.g. a lot of objects being created deep down the call chain and passed all the way up to the top level, bypassing region boundaries).


Finally... if your language has "interesting" semantics, you don't need to be constrained by convention! e.g. if you use a linear type system, you can completely get rid of any runtime memory management at all (except moving, if you want it)! The downside being that the language requires you to program in a very different way, and be more aware of dataflow and ownership.


ziggy(Posted 2013) [#3]
Thanks Yasha!
It's for one personal researh project that is not aimed to production-quality, but more or a learning project.
My idea was/is to try to get a working generational mark&sweep GC. I think the C++ implementation of Bohm GC does take into account threading and all the abomination you can do with pointers in C++. I don't think I need something this complex for a start but, I don't know, it's the first time I'm working in the creation of a GC!


Yasha(Posted 2013) [#4]
If it's your very first time, a really simple GC to copy might be PicoLisp's: http://picolisp.com/5000/!wiki?home

One-pass, no moving, fits in ~20 lines of C (trivially possible to make it nonrecursive too, maybe make that your first upgrade). It's a good demonstration too of how sometimes "fast enough" is all you want, because its simplicity means it actually is pretty fast. Simplest GC implementation I know of.

It is not generational. I couldn't recommend a "simple" generational GC to look at sadly.


ziggy(Posted 2013) [#5]
@Yasha: How safe would be to make the final collection of dead object, on a Mark&Sweep on a separated thread? If they're dead, sohuldn't it be safe by definition? (I'm just sitting in my table, with some paper, a pen, and trying to organize ideas, but as you seem to have worked this road before, maybe you have something useful to say (I mean, something more)

EDIT: The only think that comes to my mind would be that any complete mem copy to prevent fragmentation should (maybe?) be performed once the dead code mem is free?


Yasha(Posted 2013) [#6]
There's some information that might help introduce this topic here and here.

I'll warn you now though: as with every other time threading rears its head, adding it to GC is going to be hard. I would investigate things like incremental or generational GC first if possible.


Anyway the direct answer to your question is (annoyingly, as always), "it depends". Basically it depends on what is involved in freeing objects. Simple plain old data is normally not "freed" as such, especially in a copying GC, because it will have been allocated form a larger pool block and will just be forgotten about, to be overwritten either by a new object, or when the data in the pool is compacted.

-- if the object isn't plain old data, but is something external that needs to be finalised, wherever that other stuff is, it will need to be locked for the freeing thread
-- only one thread can compact one pool at a time
-- if a thread needs to access data from elsewhere in the program while freeing an object, it will need to have locked access to whatever is allocated there

So to grossly oversimplify, what it comes down to is that you end up having multiple smaller pools that objects are equally likely to be allocated from; some of these can be collected by threads in the background while others are doing useful work. Any one pool though has to be treated as a whole unit; that pool's work thread will be paused while it's being collected. If a collector thread needs access to information in a work thread's pool to finalise an object, the work will get paused; and if a work thread needs access to data in a pool currently being compacted, the work will have to wait. So it's only a pause-reduction system, not a complete pause-elimination system; eventually a collector thread is likely to have to lock a work thread for one reason or another (in practice a large enough number of pools may well be enough to eliminate all noticeable pauses, so it's still worth investigating). If you read the link, Microsoft also are happy to pause threads for young-generation collections because they feel it's fast enough not to be a problem in practice.

For elimination of pauses altogether you're likely to have better results from an incremental GC. The two could also be combined; perhaps more practical to rely on single-threaded means to reduce or eliminate pauses, and simply split the heap into separate areas for the purpose of actually enabling multithreading at all, rather than using multithreading as your principal GC accelerator.


(As for my own experience... extremely limited. I have never investigated implementing a complicated GC; in my own work, I'm mainly interested in static analysis to try to reduce the need for it at all, and just use a naive one to clean up the extras.)