strchr() optimized for ARM and performance analysis
Note: This optimization are for an ARMv5 processor (arm926ejs) further improvement could be achieved in a newer ARM versions
The next function I want to play with is strchr() used for locating a char in a given string; returns a pointer to the first occurrence of the character or NULL if not found.
char *strchr(const char *s, int c);
The code I have come up with has 4 differentiated sections as can be seen in the code listing below:
- Lines 12–18: Address is un-aligned so we look for the given and null character and then leave or we hit an aligned address so we continue to the word-based code
- Lines 25–35: Read a word, check whether there is a 0x0 in any byte and save the shifted data for checking against the given char. If any of these two conditions matches we leave.
- Lines 39–61: After leaving the word-code we must check whether we read a null char and return 0 or the byte read matches the char we are looking for and return the proper address.
- Lines 67–69: Exit code for the un-aligned code, just check whether we left with 0 or not.
.globl mystrchr mystrchr: stmfd sp!, { v1, v2, v3, v4, v5, v6, lr } # a1 : src # a2 : char mov v6, #255 # When src pointer is un-aligned go Byte by Byte # till we reach an aligned address 1: and v1, a1, #3 beq 1f ldrb v1, [a1], #1 cmp v1, #0 cmpne v1, a2 beq exit bal 1b # Aligned mode # 1. Read Word # 2. Check whether we got a \0 and keep the shifted data for later use # 3. Find char # 4. If any of the above is true, break 1: ldr v5, [a1], #4 ands v1, v6, v5 andnes v2, v6, v5, LSR #8 andnes v3, v6, v5, LSR #16 andnes v4, v6, v5, LSR #24 eornes v5, a2, v1 eornes v5, a2, v2 eornes v5, a2, v3 eornes v5, a2, v4 beq check bal 1b # Char Found? # Check for 0x0 and reduce src pointer to the matched char check: cmp v1, #0 beq error eors v5, a2, v1 subeq a1, a1, #4 beq ret cmp v2, #0 beq error eors v1, a2, v2 subeq a1, a1, #3 beq ret cmp v3, #0 beq error eors v1, a2, v3 subeq a1, a1, #2 beq ret cmp v4, #0 beq error eors v1, a2, v4 subeq a1, a1, #1 beq ret # We shouldnt reach here, but just in case return NULL mov a1, #0 # If we leave with 0 do not substract exit: cmp v1, #0 subne a1, a1, #1 error: moveq a1, #0 ret: ldmfd sp!, { v1, v2, v3, v4, v5, v6, pc }
Benchmark Analysis
The next figure shows the performance of this algorithm against the one built in uClibc and shows that the longer the string length is the better mystrchr() outperforms the uClibc:
The next plot shows the difference between both times, positive means mystrch() did better job than uClibc’s implementation:
To better understand the data and quantify the performance we need to fit the previous two data-sets as shown in the next figure:
The parameters of the fitted lines (y = mx + c) are as follows:
The above analysis shows mystrchr() is around 9% faster than uClibc’s version but it is only noticeable for large input strings which is the less likely scenario. For a common usage both functions show the same performance.
The funny thing is that uClibc’s seems to be kind of partially optimized ASM function but it is not, it is just a simple C loop as follow and it is not even a good compiler job as can be also seen in the listing below, problem might just simply be that the 8 instructions in the loop for checking the exit condition are just too much.
Wchar *Wstrchr(register const Wchar *s, Wint c) { do { if (*s == ((Wchar)c)) { return (Wchar *) s; /* silence the warning */ } } while (*s++); return NULL; }00000000: 0: e20110ff and r1, r1, #255 ; 0xff 4: e5d03000 ldrb r3, [r0] 8: e1530001 cmp r3, r1 c: 012fff1e bxeq lr 10: e3530000 cmp r3, #0 14: 12800001 addne r0, r0, #1 18: 1afffff9 bne 4 1c: e1a00003 mov r0, r3 20: e12fff1e bx lr
Originally published at http://www.gabrielgonzalezgarcia.com on October 1, 2012.