GSoC status update #2 (memcmp) DRAFT
05-06-2024
Table of Contents
What is memcmp even?
The man page on FreeBSD has the following to say:
MEMCMP(3) FreeBSD Library Functions Manual MEMCMP(3)
NAME
memcmp – compare byte string
LIBRARY
Standard C Library (libc, -lc)
SYNOPSIS
#include <string.h>
int
memcmp(const void *b1, const void *b2, size_t len);
DESCRIPTION
The memcmp() function compares byte string b1 against byte string b2.
Both strings are assumed to be len bytes long.
RETURN VALUES
The memcmp() function returns zero if the two strings are identical,
otherwise returns the difference between the first two differing bytes
(treated as unsigned char values, so that ‘\200’ is greater than ‘\0’,
for example). Zero-length strings are always identical.
SEE ALSO
bcmp(3), strcasecmp(3), strcmp(3), strcoll(3), strxfrm(3),
timingsafe_memcmp(3), wmemcmp(3)
STANDARDS
The memcmp() function conforms to ISO/IEC 9899:1990 (“ISO C90”).
FreeBSD 15.0-CURRENT August 15, 2016 FreeBSD 15.0-CURRENT
Now when starting off coding anything I like to start off with a minimum
example, a PoC (Proof of Concept). So for these string functions I first make a
small wrapper program to either call the libc function or my assembly function.
Then I compare the output so get some quick turnaround times before integrating
it into libc and running the proper unit tests.
So I did just that, and to my surprise libc wasn’t returning the expected
values.
Here be dragons 🐉
So when I run my little wrapper program with the following:
#include <stdio.h>
#include <string.h>
int main() {
printf("%i\n", memcmp("\200", "\0", 1));
}
It returns 1, thats odd. However, ISO/IEC 9899:1990 only specifies that a number greater than, equal to, or less than zero shall be returned. So going by the standard it’s a valid return value. But it doesn’t match up with what the FreeBSD man page states.
But if we edit our program a bit like the following we notice something interesting.
#include <stdio.h>
#include <string.h>
int main() {
printf("%i\n", memcmp((char[] ) {"\200"}, "\0", 1));
}
It returns 128
, as expected!
The reason for this is that on Aarch64
we have inherited string functions from
the Arm Optimized Routines
. These functions aren’t originally written with
FreeBSD in mind and this has simply gone unnoticed. If we instead glance over at
the manpage for memcmp on glibc
we see the following regarding its return
value:
RETURN VALUE
The memcmp() function returns an integer less than, equal to, or greater than
zero if the first n bytes of s1 is found, respectively, to be less than, to
match, or be greater than the first n bytes of s2.
For a nonzero return value, the sign is determined by the sign of the difference
between the first pair of bytes (interpreted as unsigned char) that differ in s1
and s2.
If n is zero, the return value is zero.
I strongly suspect that nobody is checking the return value beyond <, =, or >, except for the units tests… Which do fail on FreeBSD!
I generated this result page by using the kyua
command kyua report-html -r
${TEST}.db
, everything you need to know about the FreeBSD test suite is
available on the FreeBSD wiki.
Finally, these inherited Arm Optimized Routines
are located in
/usr/src/contrib/arm-optimized-routines
and plugged in to the string functions
Makefile.inc
like the following:
AARCH64_STRING_FUNCS= \
memchr \
memcmp \
memcpy \
memmove \
memrchr \
memset \
stpcpy \
strchr \
strchrnul \
strcmp \
strcpy \
strlen \
strncmp \
strnlen \
strrchr
#
# Add the above functions. Generate an asm file that includes the needed
# Arm Optimized Routines file defining the function name to the libc name.
# Some file need multiple macros defined or a weak symbol added we can
# override the generated file in these cases.
#
.for FUNC in ${AARCH64_STRING_FUNCS}
.if !exists(${FUNC}.S)
${FUNC}.S:
printf '/* %sgenerated by libc/aarch64/string/Makefile.inc */\n' @ > ${.TARGET}
printf '#define __%s_aarch64 %s\n' ${FUNC} ${FUNC} >> ${.TARGET}
printf '#include "aarch64/%s.S"\n' ${FUNC} >> ${.TARGET}
CLEANFILES+= ${FUNC}.S
.endif
MDSRCS+= ${FUNC}.S
CFLAGS.${FUNC}.S+=-I${SRCTOP}/contrib/arm-optimized-routines/string
.endfor
There we can swap out the faulty one for a compliant compiler generated implementation, but lets get going with porting ours!
@@ -15,11 +15,12 @@ AARCH64_STRING_FUNCS= \
strchrnul \
strcmp \
strcpy \
- memcmp \
strncmp \
strnlen \
strrchr
+MDSRCS+= \
+ memcmp.S
#
# Add the above functions. Generate an asm file that includes the needed
# Arm Optimized Routines file defining the function name to the libc name.
These functions already have ARM NEON enchancements from the Arm Optimized Routines
memchr | x
memcmp | x
memcpy | NO! (SWAR)
memmove | Implemented in memcpy
memrchr | x
memset | x
stpcpy | x
strchr | x
strchrnul | x
strcmp | NO! (SWAR)
strcpy | x
strlen | x
strncmp | NO! (SWAR)
strnlen | x
strrchr | x
Now I wont be touching memcpy as its implementation could benefit from another
design (see XXX). But if we not ignore the functions with pre-existing
implementations by ARM we are left with the following list. But lets finish off
memcmp
first because we already got started on it and maybe we can beat Arm at
their own game.
FUNCTION AMD64
bcmp S1
index S1
memccpy S1
rindex S1
stpncpy S1
strcat S1
strcmp S1
strcspn S2
strlcat S1
strlcpy S1
strncat S1
strncmp S1
strncpy S1
strpbrk S2
strsep S2
strspn S2
timingsafe_bcmp S1
Scalar compiler output
After the above small hiccup I took a look at the .c
code for memcmp and its
generated assembly. Nothing particularly notable here
amd64 SIMD implementation
Now this is a long one
ARCHENTRY(memcmp, baseline)
cmp $32, %rdx # enough to permit use of the long kernel?
ja .Llong
test %rdx, %rdx # zero bytes buffer?
je .L0
/*
* Compare strings of 1--32 bytes. We want to do this by
* loading into two xmm registers and then comparing. To avoid
* crossing into unmapped pages, we either load 32 bytes from
* the start of the buffer or 32 bytes before its end, depending
* on whether there is a page boundary between the overread area
* or not.
*/
/* check for page boundaries overreads */
lea 31(%rdi), %eax # end of overread
lea 31(%rsi), %r8d
lea -1(%rdi, %rdx, 1), %ecx # last character in buffer
lea -1(%rsi, %rdx, 1), %r9d
xor %ecx, %eax
xor %r9d, %r8d
test $PAGE_SIZE, %eax # are they on different pages?
jz 0f
/* fix up rdi */
movdqu -32(%rdi, %rdx, 1), %xmm0
movdqu -16(%rdi, %rdx, 1), %xmm1
lea -8(%rsp), %rdi # end of replacement buffer
sub %rdx, %rdi # start of replacement buffer
movdqa %xmm0, -40(%rsp) # copy to replacement buffer
movdqa %xmm1, -24(%rsp)
0: test $PAGE_SIZE, %r8d
jz 0f
/* fix up rsi */
movdqu -32(%rsi, %rdx, 1), %xmm0
movdqu -16(%rsi, %rdx, 1), %xmm1
lea -40(%rsp), %rsi # end of replacement buffer
sub %rdx, %rsi # start of replacement buffer
movdqa %xmm0, -72(%rsp) # copy to replacement buffer
movdqa %xmm1, -56(%rsp)
/* load data and compare properly */
0: movdqu 16(%rdi), %xmm1
movdqu 16(%rsi), %xmm3
movdqu (%rdi), %xmm0
movdqu (%rsi), %xmm2
mov %edx, %ecx
mov $-1, %edx
shl %cl, %rdx # ones where the buffer is not
pcmpeqb %xmm3, %xmm1
pcmpeqb %xmm2, %xmm0
pmovmskb %xmm1, %ecx
pmovmskb %xmm0, %eax
shl $16, %ecx
or %ecx, %eax # ones where the buffers match
or %edx, %eax # including where the buffer is not
not %eax # ones where there is a mismatch
#ifndef BCMP
bsf %eax, %edx # location of the first mismatch
cmovz %eax, %edx # including if there is no mismatch
movzbl (%rdi, %rdx, 1), %eax # mismatching bytes
movzbl (%rsi, %rdx, 1), %edx
sub %edx, %eax
#endif
ret
/* empty input */
.L0: xor %eax, %eax
ret
/* compare 33+ bytes */
ALIGN_TEXT
.Llong: movdqu (%rdi), %xmm0 # load head
movdqu (%rsi), %xmm2
mov %rdi, %rcx
sub %rdi, %rsi # express rsi as distance from rdi
and $~0xf, %rdi # align rdi to 16 bytes
movdqu 16(%rsi, %rdi, 1), %xmm1
pcmpeqb 16(%rdi), %xmm1 # compare second half of this iteration
add %rcx, %rdx # pointer to last byte in buffer
jc .Loverflow # did this overflow?
0: pcmpeqb %xmm2, %xmm0
pmovmskb %xmm0, %eax
xor $0xffff, %eax # any mismatch?
jne .Lmismatch_head
add $64, %rdi # advance to next iteration
jmp 1f # and get going with the loop
/*
* If we got here, a buffer length was passed to memcmp(a, b, len)
* such that a + len < a. While this sort of usage is illegal,
* it is plausible that a caller tries to do something like
* memcmp(a, b, SIZE_MAX) if a and b are known to differ, intending
* for memcmp() to stop comparing at the first mismatch. This
* behaviour is not guaranteed by any version of ISO/IEC 9899,
* but usually works out in practice. Let's try to make this
* case work by comparing until the end of the address space.
*/
.Loverflow:
mov $-1, %rdx # compare until the end of memory
jmp 0b
/* process buffer 32 bytes at a time */
ALIGN_TEXT
0: movdqu -32(%rsi, %rdi, 1), %xmm0
movdqu -16(%rsi, %rdi, 1), %xmm1
pcmpeqb -32(%rdi), %xmm0
pcmpeqb -16(%rdi), %xmm1
add $32, %rdi # advance to next iteration
1: pand %xmm0, %xmm1 # 0xff where both halves matched
pmovmskb %xmm1, %eax
cmp $0xffff, %eax # all bytes matched?
jne .Lmismatch
cmp %rdx, %rdi # end of buffer reached?
jb 0b
/* less than 32 bytes left to compare */
movdqu -16(%rdx), %xmm1 # load 32 byte tail through end pointer
movdqu -16(%rdx, %rsi, 1), %xmm3
movdqu -32(%rdx), %xmm0
movdqu -32(%rdx, %rsi, 1), %xmm2
pcmpeqb %xmm3, %xmm1
pcmpeqb %xmm2, %xmm0
pmovmskb %xmm1, %ecx
pmovmskb %xmm0, %eax
shl $16, %ecx
or %ecx, %eax # ones where the buffers match
not %eax # ones where there is a mismatch
#ifndef BCMP
bsf %eax, %ecx # location of the first mismatch
cmovz %eax, %ecx # including if there is no mismatch
add %rcx, %rdx # pointer to potential mismatch
movzbl -32(%rdx), %eax # mismatching bytes
movzbl -32(%rdx, %rsi, 1), %edx
sub %edx, %eax
#endif
ret
#ifdef BCMP
.Lmismatch:
mov $1, %eax
.Lmismatch_head:
ret
#else /* memcmp */
.Lmismatch_head:
tzcnt %eax, %eax # location of mismatch
add %rax, %rcx # pointer to mismatch
movzbl (%rcx), %eax # mismatching bytes
movzbl (%rcx, %rsi, 1), %ecx
sub %ecx, %eax
ret
.Lmismatch:
movdqu -48(%rsi, %rdi, 1), %xmm1
pcmpeqb -48(%rdi), %xmm1 # reconstruct xmm1 before PAND
pmovmskb %xmm0, %eax # mismatches in first 16 bytes
pmovmskb %xmm1, %edx # mismatches in second 16 bytes
shl $16, %edx
or %edx, %eax # mismatches in both
not %eax # matches in both
tzcnt %eax, %eax # location of mismatch
add %rax, %rdi # pointer to mismatch
movzbl -64(%rdi), %eax # mismatching bytes
movzbl -64(%rdi, %rsi, 1), %ecx
sub %ecx, %eax
ret
#endif
ARCHEND(memcmp, baseline)
Apart from the comments the code does the following:
p1
p2
p3
/* CMEQ
input C a l l m e m a y b e
43 61 6c 6c 20 6d 65 20 6d 61 ...
mask 20 20 20 20 20 20 20 20 20 20 ..
res 00 00 00 00 FF 00 00 FF 00 00 ..
*/
B = Bytes (8bit)
H = Halfwords (16bit)
S = Single words (32bit)
D = Double words (64bit)