Allocations are good, actually

Something that you can often hear in the Rust community, especially from people who were previously C or C++ developers, is the adage "allocations are slow".

The other day a friend asked me how to create a consecutive list of numbers. I pointed her at (0..).take(x).collect() which can be made into a Vec<_>, with a number of her choice. It did made me think however about how this could be done much nicer in a allocation-free manner.

It lead me to come up with the following code which creates a [_; _] slice, depending on which integer representation and length you choose.

(0..)
  .zip(0..x)
  .fold([0; x], |mut acc, (step, i)| {
    acc[i] = step;
    acc
  })

So with this in mind, I wanted to run some comparisons. I chose the numbers so that 32768 consecutive numbers would be generated. I compiled the example with both Debug and Release mode. (All of these measurements are done with rustc 1.33.0 (2aa4c46cf 2019-02-28))

Let's start with the non-allocating version.

$ time target/debug/playground
target/debug/playground  1.45s user 0.00s system 99% cpu 1.457 total
$ time target/release/playground
target/release/playground  0.27s user 0.00s system 99% cpu 0.270 total

Cool! So as you can see, the Release profile is over 500% faster. And performance-wise this is quite reasonable.

Let's see how an allocating implementation stacks up to it. The code used here is the following.

let vec: Vec<u32> = (0..)
  .take(1024 * 32)
  .collect();

So how fast is this gonna be?

$ time target/debug/playground
target/debug/playground  0.01s user 0.00s system 93% cpu 0.010 total
$ time target/release/playground
target/release/playground  0.00s user 0.00s system 85% cpu 0.005 total

What? ...it's faster?!

Well, I guess this does to show that it's not as simple as saying "allocations are bad". Avoiding allocations at all cost can slow you down.

Thanks for coming to my TED talk!

. . .

Yes but why?

Okay maybe you're more curious than that and want to understand what's going on here. So come along, let's read some assembly!

Let's focus mostly on the release profile here, because Debug generates a lot of code that makes it harder to understand. So we have two code snippets that we should throw into godbolt to see what rustc does.

// This doesn't allocate
const length: usize = 1024 * 32;
pub fn slice() -> [u32; length] {
    (0..)
        .zip(0..length)
        .fold([0; length], |mut acc, (num, idx)| {
        acc[idx] = num;
        acc
    })
}

// This does
pub fn vec() -> Vec<u32> {
    (0..).take(1024 * 32).collect()
}

Let's have a look at the assembly that the vec() function generates.

.LCPI0_0:
        .long   0
        .long   1
        .long   2
        .long   3

# ... snip ...

example::vec:
        push    rbx
        mov     rbx, rdi
        mov     edi, 131072
        mov     esi, 4
        call    qword ptr [rip + __rust_alloc@GOTPCREL]
        test    rax, rax
        je      .LBB0_4
        movdqa  xmm0, xmmword ptr [rip + .LCPI0_0]
        mov     ecx, 28
        movdqa  xmm8, xmmword ptr [rip + .LCPI0_1]
        movdqa  xmm9, xmmword ptr [rip + .LCPI0_2]
        movdqa  xmm10, xmmword ptr [rip + .LCPI0_3]
        movdqa  xmm4, xmmword ptr [rip + .LCPI0_4]
        movdqa  xmm5, xmmword ptr [rip + .LCPI0_5]
        movdqa  xmm6, xmmword ptr [rip + .LCPI0_6]
        movdqa  xmm7, xmmword ptr [rip + .LCPI0_7]
        movdqa  xmm1, xmmword ptr [rip + .LCPI0_8]
.LBB0_2:
        movdqa  xmm2, xmm0
        paddd   xmm2, xmm8
        # ... snip ...
        ret
.LBB0_4:
        mov     edi, 131072
        mov     esi, 4
        call    qword ptr [rip + _ZN5alloc5alloc18...@GOTPCREL]
        ud2

(full code dump here)

As you can see this uses the "Move Aligned Packed Integer Values" instructions in x86_64. From some x86 docs:

Moves 128, 256 or 512 bits of packed doubleword/quadword integer values from the source operand (the second operand) to the destination operand (the first operand).

Basically the LLVM can figure out that our numbers are predictable and can allocate them in a way that is batchable.

We will already see how the non-alloc code is going to be slower: because the code that assigns numbers is less unterstandable to a compiler (i.e. assigning values to an array sequencially) this will not end up being batched.

That's not to say that alloc code is going to be this fast on every platform (RISC instruction sets lack many vectoring techniques) and this doesn't even take embedded targets into account.

But there you have it.

LLVM is magic...

... and saying "allocations are bad" really isn't telling the whole story.