mystrlen vs android-bionic’s strlen on ARM CPU

Gabriel Gonzalez
4 min readOct 2, 2012

After my last post I got a request from @trufae asking for a benchmark against Android’s bionic library so there I go. I begun compiling the available bionic’s strlen version and running the same test as in my previous post.

The first thing I noticed reading the bionic’s strlen version is the they use a better way to detect the null character, which reduced by 1 instruction my straight forward byte-to-byte search which, of course, would lead to a performance improvement. They also unrolled the loop a little bit more up to 4 ldr instructions.

# This is inline C assembly 
sub %[t], %[v], %[mask], lsr #7
and %[t], %[t], %[mask]
bics %[t], %[t], %[v]

versus

andnes	v4, v3, v1, LSR #8 
andnes v4, v3, v1, LSR #16
andnes v4, v3, v1, LSR #24

From now on I will only show fitted version of the lines to reduce noise due to scheduling and other OS related delays.

As can be seen the bionic’s version outperforms mystrlen much more than mystrlen outperforms uclibc version. To test how much each bionic improvement affects performance I first imported to mystrlen the 3-instruction version of null character finding, which didn’t lead to noticeable performance improvement.

So the key must be in the unrolling. I unrolled mystrlen but, instead of using 4 ldr instruction, I used the block copy to reduce the number of instructions. The results are shown in the next figure:

Surprisingly, mystrlen outperform bionic’s strlen! Here are the numbers for the y = mx + c plots:

Now mystrlen is around 14% faster than bionic’s version and 29% faster than uClibc’s version.

Why? This is a little bit weird so I had a look at the binary produced by gcc for the bionic’s version and I found that the code wasn’t even close to the original inline assembly, gcc had just un-unrolled the code. Even using better optimization flags the generated code wasn’t better:

86e4: ldr r6, [r3], #4 
86e8: sub r4, r4, r3
86ec: sub r5, r6, r2, lsr #7
86f0: and r5, r5, r2
86f4: bics r5, r5, r6
86f8: ldreq r6, [r3], #4
86fc: beq 86ec

Conclusion
With the new null checking algorithm and unrolling 4 loops mystrlen, is 29% faster than the uClibc version, which now makes a real difference. The processor where I have been conducting the tests is a arm926ejs and doesn’t even have the PLD instruction (preload data) which, I bet, would improve the performance of the algorithm.

Here is the latest version of mystrlen:

.globl mystrlen 
mystrlen:
stmfd sp!, {v1, v2, v3, v4, v5, v6, v7, lr}
mov v7, a1
ldr v6, =0x80808080
##
## un-aligned address till we get aligned
##
1:
tst v7, #3
beq 0f
ldrb v1, [v7], #1
tst v1, #0xFF
beq 4f
bal 1b
## un-rolling strings
## as few instructions in the loop
## as possible
## - Check whether any position equals 0
0:
ldmfd v7!, {v1, v2, v3, v4}
sub v5, v1, v6, LSR #7
and v5, v5, v6
bics v5, v5, v1
bne 1f
sub v5, v2, v6, LSR #7
and v5, v5, v6
bics v5, v5, v2
bne 2f
sub v5, v3, v6, LSR #7
and v5, v5, v6
bics v5, v5, v3
bne 3f
sub v5, v4, v6, LSR #7
and v5, v5, v6
bics v5, v5, v4
bne 4f beq 0b
4:
mov v1, v4
bal 0f
3:
mov v1, v3
sub v7, v7, #4
bal 0f
2:
mov v1, v2
sub v7, v7, #8
bal 0f
1:
sub v7, v7, #12
## After the loop we calculate the diff
## between the end and the begining of the str
##
0:
tst v1, #0xFF
subeq v7, v7, #1
tst v1, #0xFF00
subeq v7, v7, #1
tst v1, #0xFF0000
subeq v7, v7, #1
tst v1, #0xFF000000
4:
subeq v7, v7, #1 s
ub a1, v7, a1
ldmfd sp!, {v1, v2, v3, v4, v5, v6, v7, pc}

Originally published at http://www.gabrielgonzalezgarcia.com on October 2, 2012.

--

--