ARM Optimized strlen()
Few weeks ago I started a personal project to enable lua programming under u-boot to allow easy access of peripherals and create small embeddable initialization scripts which I have missed during professional Embedded Development.
Another goal of this projects is to optimize u-boot’s ARM code so the first optimization go through string manipulation functions, today strlen().
The current strlen code is just a simple C loop looking for 0x0 to then return the number of Bytes read. This is far from optimum since it really miss some interesting ARM features and algorithmic improvements.
Here is the code I have come up after studying different scenarios, it sure isn’t perfect so please drop a line with improvements or fixes!
.globl mystrlen mystrlen: stmfd sp!, {v1, v2, v3, v4, lr} mov v2, a1 mov v3, #255 ## ## un-aligned address till we get aligned ## 1: ands v4, v2, #3 beq 0f ldrb v1, [v2], #1 ands v4, v3, v1 beq 1f bal 1b ## un-rolling strings ## as few instructions in the loop ## as possible ## - Check whether any position equals 0 0: ldr v1, [v2], #4 ands v4, v3, v1 andnes v4, v3, v1, LSR #8 andnes v4, v3, v1, LSR #16 andnes v4, v3, v1, LSR #24 bne 0b ## After the loop we calculate the diff ## between the end and the begining of the str ## ands v4, v3, v1 subeq v2, v2, #1 andnes v4, v3, v1, LSR #8 subeq v2, v2, #1 andnes v4, v3, v1, LSR #16 subeq v2, v2, #1 andnes v4, v3, v1, LSR #24 1: subeq v2, v2, #1 sub a1, v2, a1 ldmfd sp!, {v1, v2, v3, v4, pc}
The algorithm can be divided in three different stages:
- Lines 11–16: When the memory address is un-aligned we go byte-by-byte till we find 0x0 or we reach an 4-Bytes aligned memory so we jump to the unrolling loop
- Lines 22–27: When on aligned memory we read a word and mask each Byte looking for 0x0. Thanks to ARM instruction set we can save a few cycles when finding 0x0 in the least significant Byte.
- Lines 32–39: Onces finished depending whether the end-of-string Byte is we subtract from 1 to 4 bytes of the current mem address so we can easily return the length
Bencharmk
I have checked the performance of this algorithms against a simple C loop looking for ‘\0’ and uClibc’s implementation. The tests where carried calculating the lengths for strings from 1 to 99999 characters averaged over 3 trials to reduce outliers.
There is a huge difference between uClibc’s strlen/mystrlen and the simple C loop, so my optimization has passed the first test
This figure shows both algorithms are really close in performance, it seems mystrlen is just slightly below uClibc’s strlen but not for being even noticeable.
This figure shows the for each string length the time difference, when it is above 0 it tells mystrlen is better than uClibc’s strlen. Given that the time units are micro-seconds we can say both algorithms has the same performance; further data analysis shows mystrlen is slightly better.
Average: 0.88
Std Dev: 13.75
Conclusion
The ARM optimization shown here is at least as good as uClibc’s ARM specific strlen. One funny thing is that for 1 char mystrlen is 5x faster than uClibc’s strlen.
Originally published at http://www.gabrielgonzalezgarcia.com on September 27, 2012.