Jack Dermody

C# Block Allocation

Block allocation is a technique that can help to improve the performance of object allocation by allocating a large block of memory at a time and then allocating objects from that block.

This is particularly useful when creating a large number of smaller objects that are all deallocated at once when some unit of work is complete.

Benefits of Block Allocation

There are several benefits to using block allocation:

Drawbacks of Block Allocation

There are also a few drawbacks to using block allocation:

When to Use Block Allocation

Block allocation is a good choice to use when you need to allocate a large number of objects. It can also be a good choice to use when you need to reduce memory fragmentation or simplify memory management.

.NET Block Allocation

This approach works efficiently in languages like C++ with the support of the placement new operator.

However, it can also be adapted for use in .NET with some considerations, offering considerable performance improvements.

Scenario

Let's consider a scenario where we process a substantial amount of data and generate summaries as we proceed.

In large datasets, profiling reveals that a significant portion of time is spent on object allocation.

Since we know that we'll require many such objects, is there a way to optimize this process?

interface IEvent
{
    ulong Id { get; }
    double Score { get; }
    byte Attribute1 { get; }
    byte Attribute2 { get; }
    float ConsolidatedScore { get; }
}

struct SomeStruct : IEvent
{
    public ulong Id { get; set; }
    public double Score { get; set; }
    public byte Attribute1 { get; set; }
    public byte Attribute2 { get; set; }
    public float ConsolidatedScore { get; set; }
}

class SomeClass : IEvent
{
    public ulong Id { get; set; }
    public double Score { get; set; }
    public byte Attribute1 { get; set; }
    public byte Attribute2 { get; set; }
    public float ConsolidatedScore { get; set; }
}

The initial approach might be to investigate whether using a struct for the objects yields better performance. Structs are allocated on the stack, and an array of structs is allocated as a single block of contiguous memory. Indeed, in some cases, this approach can be faster.

However, a caveat arises when dealing with boxing. If we refer to the events via an interface or store them in a data structure that is not backed by an array, then the structs will be boxed, resulting in significantly worse performance compared to using a class.

Let's compare the performance of allocating objects using classes and structs:

static void TimeIterations(string name, Action<int> callback)
{
    var sw = new Stopwatch();
    sw.Start();
    for (var i = 0; i < 1024 * 1024 * 32; i++)
        callback(i);
    sw.Stop();
    Console.WriteLine($"{name}: {sw.ElapsedMilliseconds:N0}ms");
}

var blocks = new List<SomeClass>();
var blocks2 = new List<SomeStruct>();
var blocks3 = new List<IEvent>();
TimeIterations("gc allocator (class)", i => blocks.Add(new SomeClass()));
TimeIterations("gc allocator (struct)", i => blocks2.Add(new SomeStruct()));
TimeIterations("gc allocator (boxing struct)", i => blocks3.Add(new SomeStruct()));

class allocation

4048ms

struct allocation

3714ms

boxing struct allocation

7414ms

From the results, we observe a slight improvement when using structs compared to classes, but a significant penalty when dealing with boxed structs.

Block Allocator

To achieve even better performance, we can take control of memory allocations ourselves using the unmanaged heap (not garbage-collected).

Of course, it is essential to remember to free the memory to avoid memory leaks.

We will also need to enable unsafe compilation and write code that resembles traditional C++.

unsafe struct DataWrapper
{
    readonly SomeStruct* _ptr;

    public DataWrapper(SomeStruct* ptr)
    {
        _ptr = ptr;
    }

    public ulong Id {
        get => _ptr->Id;
        set => _ptr->Id = value;
    }

    public double Score {
        get => _ptr->Score;
        set => _ptr->Score = value;
    }

    public byte Attribute1 {
        get => _ptr->Attribute1;
        set => _ptr->Attribute1 = value;
    }

    public byte Attribute2 {
        get => _ptr->Attribute2;
        set => _ptr->Attribute2 = value;
    }

    public float ConsolidatedScore {
        get => _ptr->ConsolidatedScore;
        set => _ptr->ConsolidatedScore = value;
    }
}

class UnmanagedAllocator<T> : IDisposable
{
    readonly int _size, _blockSize, _allocationSize;
    readonly List<IntPtr> _buffers = new List<IntPtr>();
    readonly Func<IntPtr, T> _wrapperCreator;
    IntPtr _curr = IntPtr.Zero;
    int _index = -1;

    public UnmanagedAllocator(int typeSize, Func<IntPtr, T> wrapperCreator, int blockSize = 32768)
    {
        _size = typeSize;
        _blockSize = blockSize;
        _allocationSize = _size * _blockSize;
        _wrapperCreator = wrapperCreator;
    }

    public T Allocate()
    {
        if (_index < 0 || _index >= _blockSize) {
            _buffers.Add(_curr = Marshal.AllocHGlobal(_allocationSize));
            _index = 0;
        }

        return _wrapperCreator(_curr + (_index++ * _size));
    }

    public void Dispose()
    {
        foreach (var buffer in _buffers)
            Marshal.FreeHGlobal(buffer);
        _buffers.Clear();
        _index = 0;
    }
}

var allocator = new UnmanagedAllocator<DataWrapper>(
    sizeof(SomeStruct), 
    ptr => new DataWrapper((SomeStruct*) ptr)
);

To avoid returning pointers from the allocator, we provide a "safe" wrapper that hides the pointer. The block allocator reserves enough memory for 32,768 structs each time it runs out, significantly reducing the number of individual allocations, which in this scenario goes down from 33,554,432 to just 1024!

Results

The runtime performance of the block allocator is approximately half that of allocating structs via the CLR.

If we use StructLayout with Pack=1, we may also be able to reduce the allocation size, albeit with some potential performance cost.

Since the memory is unmanaged, there is no need to pin it, and the pointers remain valid until the allocator is disposed.

This lends itself to a unit of work pattern: create an allocator, complete the unit of work while allocating lots of objects, and then dispose of the allocator when done, freeing all allocated objects at once.

This approach can lead to significant performance improvements in scenarios involving large-scale object allocation.