mystrlen vs android-bionic’s strlen on ARM CPU

4 min readOct 2, 2012

After my last post I got a request from @trufae asking for a benchmark against Android’s bionic library so there I go. I begun compiling the available bionic’s strlen version and running the same test as in my previous post.

The first thing I noticed reading the bionic’s strlen version is the they use a better way to detect the null character, which reduced by 1 instruction my straight forward byte-to-byte search which, of course, would lead to a performance improvement. They also unrolled the loop a little bit more up to 4 ldr instructions.

# This is inline C assembly 
sub %[t], %[v], %[mask], lsr #7 
and %[t], %[t], %[mask] 
bics %[t], %[t], %[v]

versus

andnes	v4, v3, v1, LSR #8 
andnes	v4, v3, v1, LSR #16 
andnes	v4, v3, v1, LSR #24

From now on I will only show fitted version of the lines to reduce noise due to scheduling and other OS related delays.

As can be seen the bionic’s version outperforms mystrlen much more than mystrlen outperforms uclibc version. To test how much each bionic improvement affects performance I first imported to mystrlen the 3-instruction version of null character finding, which didn’t lead to noticeable performance improvement.

So the key must be in the unrolling. I unrolled mystrlen but, instead of using 4 ldr instruction, I used the block copy to reduce the number of instructions. The results are shown in the next figure:

Surprisingly, mystrlen outperform bionic’s strlen! Here are the numbers for the y = mx + c plots:

Now mystrlen is around 14% faster than bionic’s version and 29% faster than uClibc’s version.

Why? This is a little bit weird so I had a look at the binary produced by gcc for the bionic’s version and I found that the code wasn’t even close to the original inline assembly, gcc had just un-unrolled the code. Even using better optimization flags the generated code wasn’t better:

86e4: ldr r6, [r3], #4 
86e8: sub r4, r4, r3 
86ec: sub r5, r6, r2, lsr #7 
86f0: and r5, r5, r2 
86f4: bics r5, r5, r6 
86f8: ldreq r6, [r3], #4 
86fc: beq 86ec

Conclusion
With the new null checking algorithm and unrolling 4 loops mystrlen, is 29% faster than the uClibc version, which now makes a real difference. The processor where I have been conducting the tests is a arm926ejs and doesn’t even have the PLD instruction (preload data) which, I bet, would improve the performance of the algorithm.

Here is the latest version of mystrlen:

.globl mystrlen 
mystrlen: 
stmfd	sp!, {v1, v2, v3, v4, v5, v6, v7, lr} 
mov	v7, a1 
ldr	v6, =0x80808080 ## 
## un-aligned address till we get aligned 
## 
1:	
tst	v7, #3 
beq	0f 
ldrb	v1, [v7], #1 
tst	v1, #0xFF 
beq	4f 
bal	1b ## un-rolling strings 
## as few instructions in the loop 
## as possible
## - Check whether any position equals 0 
0:	
ldmfd	v7!, {v1, v2, v3, v4} 
sub	v5, v1, v6, LSR #7 
and	v5, v5, v6 
bics	v5, v5, v1 
bne	1f 
sub	v5, v2, v6, LSR #7 
and	v5, v5, v6 
bics	v5, v5, v2 
bne	2f 
sub	v5, v3, v6, LSR #7 
and	v5, v5, v6 
bics	v5, v5, v3 
bne	3f 
sub	v5, v4, v6, LSR #7 
and	v5, v5, v6 
bics	v5, v5, v4 
bne	4f beq	0b 
4:	
mov	v1, v4 
bal	0f 
3:	
mov	v1, v3 
sub	v7, v7, #4 
bal	0f 
2:	
mov	v1, v2 
sub	v7, v7, #8 
bal	0f 
1:	
sub	v7, v7, #12 
## After the loop we calculate the diff 
## between the end and the begining of the str 
## 
0:	
tst	v1, #0xFF 
subeq	v7, v7, #1 
tst	v1, #0xFF00 
subeq	v7, v7, #1 
tst	v1, #0xFF0000 
subeq	v7, v7, #1 
tst	v1, #0xFF000000 
4:	
subeq	v7, v7, #1 s
ub	a1, v7, a1 
ldmfd	sp!, {v1, v2, v3, v4, v5, v6, v7, pc}

Originally published at http://www.gabrielgonzalezgarcia.com on October 2, 2012.

mystrlen vs android-bionic’s strlen on ARM CPU

Written by Gabriel Gonzalez