GSoC status update #2 (memcmp) DRAFT

05-06-2024

Table of Contents

What is memcmp even?

The man page on FreeBSD has the following to say:

MEMCMP(3)              FreeBSD Library Functions Manual              MEMCMP(3)

NAME
     memcmp – compare byte string

LIBRARY
     Standard C Library (libc, -lc)

SYNOPSIS
     #include <string.h>

     int
     memcmp(const void *b1, const void *b2, size_t len);

DESCRIPTION
     The memcmp() function compares byte string b1 against byte string b2.
     Both strings are assumed to be len bytes long.

RETURN VALUES
     The memcmp() function returns zero if the two strings are identical,
     otherwise returns the difference between the first two differing bytes
     (treated as unsigned char values, so that ‘\200’ is greater than ‘\0’,
     for example).  Zero-length strings are always identical.

SEE ALSO
     bcmp(3), strcasecmp(3), strcmp(3), strcoll(3), strxfrm(3),
     timingsafe_memcmp(3), wmemcmp(3)

STANDARDS
     The memcmp() function conforms to ISO/IEC 9899:1990 (“ISO C90”).

FreeBSD 15.0-CURRENT            August 15, 2016           FreeBSD 15.0-CURRENT

Now when starting off coding anything I like to start off with a minimum example, a PoC (Proof of Concept). So for these string functions I first make a small wrapper program to either call the libc function or my assembly function. Then I compare the output so get some quick turnaround times before integrating it into libc and running the proper unit tests.
So I did just that, and to my surprise libc wasn’t returning the expected values.

Here be dragons 🐉

So when I run my little wrapper program with the following:

#include <stdio.h>
#include <string.h>

int main() {
    printf("%i\n", memcmp("\200", "\0", 1));
}

It returns 1, thats odd. However, ISO/IEC 9899:1990 only specifies that a number greater than, equal to, or less than zero shall be returned. So going by the standard it’s a valid return value. But it doesn’t match up with what the FreeBSD man page states.

But if we edit our program a bit like the following we notice something interesting.

#include <stdio.h>
#include <string.h>

int main() {
  printf("%i\n", memcmp((char[] ) {"\200"}, "\0", 1));
}

It returns 128, as expected!

The reason for this is that on Aarch64 we have inherited string functions from the Arm Optimized Routines. These functions aren’t originally written with FreeBSD in mind and this has simply gone unnoticed. If we instead glance over at the manpage for memcmp on glibc we see the following regarding its return value:

RETURN VALUE 

    The memcmp() function returns an integer less than, equal to, or greater than
    zero if the first n bytes of s1 is found, respectively, to be less than, to
    match, or be greater than the first n bytes of s2.

    For a nonzero return value, the sign is determined by the sign of the difference
    between the first pair of bytes (interpreted as unsigned char) that differ in s1
    and s2.

    If n is zero, the return value is zero.

I strongly suspect that nobody is checking the return value beyond <, =, or >, except for the units tests… Which do fail on FreeBSD!

I generated this result page by using the kyua command kyua report-html -r ${TEST}.db, everything you need to know about the FreeBSD test suite is available on the FreeBSD wiki.

Finally, these inherited Arm Optimized Routines are located in /usr/src/contrib/arm-optimized-routines and plugged in to the string functions Makefile.inc like the following:

AARCH64_STRING_FUNCS= \
    memchr \
    memcmp \
    memcpy \
    memmove \
    memrchr \
    memset \
    stpcpy \
    strchr \
    strchrnul \
    strcmp \
    strcpy \
    strlen \
    strncmp \
    strnlen \
    strrchr

#
# Add the above functions. Generate an asm file that includes the needed
# Arm Optimized Routines file defining the function name to the libc name.
# Some file need multiple macros defined or a weak symbol added we can
# override the generated file in these cases.
#
.for FUNC in ${AARCH64_STRING_FUNCS}
.if !exists(${FUNC}.S)
${FUNC}.S:
    printf '/* %sgenerated by libc/aarch64/string/Makefile.inc */\n' @ > ${.TARGET}
    printf '#define __%s_aarch64 %s\n' ${FUNC} ${FUNC} >> ${.TARGET}
    printf '#include "aarch64/%s.S"\n' ${FUNC} >> ${.TARGET}
CLEANFILES+=    ${FUNC}.S
.endif

MDSRCS+=    ${FUNC}.S
CFLAGS.${FUNC}.S+=-I${SRCTOP}/contrib/arm-optimized-routines/string
.endfor

There we can swap out the faulty one for a compliant compiler generated implementation, but lets get going with porting ours!

@@ -15,11 +15,12 @@ AARCH64_STRING_FUNCS= \
        strchrnul \
        strcmp \
        strcpy \
-       memcmp \
        strncmp \
        strnlen \
        strrchr

+MDSRCS+= \
+       memcmp.S
 #
 # Add the above functions. Generate an asm file that includes the needed
 # Arm Optimized Routines file defining the function name to the libc name.

These functions already have ARM NEON enchancements from the Arm Optimized Routines

    memchr             |   x
    memcmp             |   x
    memcpy             |   NO! (SWAR)
    memmove            |   Implemented in memcpy
    memrchr            |   x
    memset             |   x
    stpcpy             |   x
    strchr             |   x
    strchrnul          |   x
    strcmp             |   NO! (SWAR)
    strcpy             |   x
    strlen             |   x
    strncmp            |   NO! (SWAR)
    strnlen            |   x
    strrchr            |   x

Now I wont be touching memcpy as its implementation could benefit from another design (see XXX). But if we not ignore the functions with pre-existing implementations by ARM we are left with the following list. But lets finish off memcmp first because we already got started on it and maybe we can beat Arm at their own game.

           FUNCTION          AMD64
           bcmp              S1
           index             S1
           memccpy           S1
           rindex            S1
           stpncpy           S1
           strcat            S1
           strcmp            S1
           strcspn           S2
           strlcat           S1
           strlcpy           S1
           strncat           S1
           strncmp           S1
           strncpy           S1
           strpbrk           S2
           strsep            S2
           strspn            S2
           timingsafe_bcmp   S1

Scalar compiler output

After the above small hiccup I took a look at the .c code for memcmp and its generated assembly. Nothing particularly notable here

memcmp assembly listing

amd64 SIMD implementation

Now this is a long one

ARCHENTRY(memcmp, baseline)
    cmp     $32, %rdx       # enough to permit use of the long kernel?
    ja      .Llong

    test        %rdx, %rdx      # zero bytes buffer?
    je      .L0

    /*
     * Compare strings of 1--32 bytes.  We want to do this by
     * loading into two xmm registers and then comparing.  To avoid
     * crossing into unmapped pages, we either load 32 bytes from
     * the start of the buffer or 32 bytes before its end, depending
     * on whether there is a page boundary between the overread area
     * or not.
     */

    /* check for page boundaries overreads */
    lea     31(%rdi), %eax      # end of overread
    lea     31(%rsi), %r8d
    lea     -1(%rdi, %rdx, 1), %ecx # last character in buffer
    lea     -1(%rsi, %rdx, 1), %r9d
    xor     %ecx, %eax
    xor     %r9d, %r8d
    test        $PAGE_SIZE, %eax    # are they on different pages?
    jz      0f

    /* fix up rdi */
    movdqu      -32(%rdi, %rdx, 1), %xmm0
    movdqu      -16(%rdi, %rdx, 1), %xmm1
    lea     -8(%rsp), %rdi      # end of replacement buffer
    sub     %rdx, %rdi      # start of replacement buffer
    movdqa      %xmm0, -40(%rsp)    # copy to replacement buffer
    movdqa      %xmm1, -24(%rsp)

0:  test        $PAGE_SIZE, %r8d
    jz      0f

    /* fix up rsi */
    movdqu      -32(%rsi, %rdx, 1), %xmm0
    movdqu      -16(%rsi, %rdx, 1), %xmm1
    lea     -40(%rsp), %rsi     # end of replacement buffer
    sub     %rdx, %rsi      # start of replacement buffer
    movdqa      %xmm0, -72(%rsp)    # copy to replacement buffer
    movdqa      %xmm1, -56(%rsp)

    /* load data and compare properly */
0:  movdqu      16(%rdi), %xmm1
    movdqu      16(%rsi), %xmm3
    movdqu      (%rdi), %xmm0
    movdqu      (%rsi), %xmm2
    mov     %edx, %ecx
    mov     $-1, %edx
    shl     %cl, %rdx       # ones where the buffer is not
    pcmpeqb     %xmm3, %xmm1
    pcmpeqb     %xmm2, %xmm0
    pmovmskb    %xmm1, %ecx
    pmovmskb    %xmm0, %eax
    shl     $16, %ecx
    or      %ecx, %eax      # ones where the buffers match
    or      %edx, %eax      # including where the buffer is not
    not     %eax            # ones where there is a mismatch
#ifndef BCMP
    bsf     %eax, %edx      # location of the first mismatch
    cmovz       %eax, %edx      # including if there is no mismatch
    movzbl      (%rdi, %rdx, 1), %eax   # mismatching bytes
    movzbl      (%rsi, %rdx, 1), %edx
    sub     %edx, %eax
#endif
    ret

    /* empty input */
.L0:    xor     %eax, %eax
    ret

    /* compare 33+ bytes */
    ALIGN_TEXT
.Llong: movdqu      (%rdi), %xmm0       # load head
    movdqu      (%rsi), %xmm2
    mov     %rdi, %rcx
    sub     %rdi, %rsi      # express rsi as distance from rdi
    and     $~0xf, %rdi     # align rdi to 16 bytes
    movdqu      16(%rsi, %rdi, 1), %xmm1
    pcmpeqb     16(%rdi), %xmm1     # compare second half of this iteration
    add     %rcx, %rdx      # pointer to last byte in buffer
    jc      .Loverflow      # did this overflow?
0:  pcmpeqb     %xmm2, %xmm0
    pmovmskb    %xmm0, %eax
    xor     $0xffff, %eax       # any mismatch?
    jne     .Lmismatch_head
    add     $64, %rdi       # advance to next iteration
    jmp     1f          # and get going with the loop

    /*
     * If we got here, a buffer length was passed to memcmp(a, b, len)
     * such that a + len < a.  While this sort of usage is illegal,
     * it is plausible that a caller tries to do something like
     * memcmp(a, b, SIZE_MAX) if a and b are known to differ, intending
     * for memcmp() to stop comparing at the first mismatch.  This
     * behaviour is not guaranteed by any version of ISO/IEC 9899,
     * but usually works out in practice.  Let's try to make this
     * case work by comparing until the end of the address space.
     */
.Loverflow:
    mov     $-1, %rdx       # compare until the end of memory
    jmp     0b

    /* process buffer 32 bytes at a time */
    ALIGN_TEXT
0:  movdqu      -32(%rsi, %rdi, 1), %xmm0
    movdqu      -16(%rsi, %rdi, 1), %xmm1
    pcmpeqb     -32(%rdi), %xmm0
    pcmpeqb     -16(%rdi), %xmm1
    add     $32, %rdi       # advance to next iteration
1:  pand        %xmm0, %xmm1        # 0xff where both halves matched
    pmovmskb    %xmm1, %eax
    cmp     $0xffff, %eax       # all bytes matched?
    jne     .Lmismatch
    cmp     %rdx, %rdi      # end of buffer reached?
    jb      0b

    /* less than 32 bytes left to compare */
    movdqu      -16(%rdx), %xmm1    # load 32 byte tail through end pointer
    movdqu      -16(%rdx, %rsi, 1), %xmm3
    movdqu      -32(%rdx), %xmm0
    movdqu      -32(%rdx, %rsi, 1), %xmm2
    pcmpeqb     %xmm3, %xmm1
    pcmpeqb     %xmm2, %xmm0
    pmovmskb    %xmm1, %ecx
    pmovmskb    %xmm0, %eax
    shl     $16, %ecx
    or      %ecx, %eax      # ones where the buffers match
    not     %eax            # ones where there is a mismatch
#ifndef BCMP
    bsf     %eax, %ecx      # location of the first mismatch
    cmovz       %eax, %ecx      # including if there is no mismatch
    add     %rcx, %rdx      # pointer to potential mismatch
    movzbl      -32(%rdx), %eax     # mismatching bytes
    movzbl      -32(%rdx, %rsi, 1), %edx
    sub     %edx, %eax
#endif
    ret

#ifdef BCMP
.Lmismatch:
    mov     $1, %eax
.Lmismatch_head:
    ret
#else /* memcmp */
.Lmismatch_head:
    tzcnt       %eax, %eax      # location of mismatch
    add     %rax, %rcx      # pointer to mismatch
    movzbl      (%rcx), %eax        # mismatching bytes
    movzbl      (%rcx, %rsi, 1), %ecx
    sub     %ecx, %eax
    ret

.Lmismatch:
    movdqu      -48(%rsi, %rdi, 1), %xmm1
    pcmpeqb     -48(%rdi), %xmm1    # reconstruct xmm1 before PAND
    pmovmskb    %xmm0, %eax     # mismatches in first 16 bytes
    pmovmskb    %xmm1, %edx     # mismatches in second 16 bytes
    shl     $16, %edx
    or      %edx, %eax      # mismatches in both
    not     %eax            # matches in both
    tzcnt       %eax, %eax      # location of mismatch
    add     %rax, %rdi      # pointer to mismatch
    movzbl      -64(%rdi), %eax     # mismatching bytes
    movzbl      -64(%rdi, %rsi, 1), %ecx
    sub     %ecx, %eax
    ret
#endif
ARCHEND(memcmp, baseline)

Apart from the comments the code does the following:

  /* CMEQ

input   C  a  l  l     m  e     m  a  y  b  e
       43 61 6c 6c 20 6d 65  20 6d 61 ...
        
mask   20 20 20 20 20 20 20  20 20 20  ..

res   00 00  00 00 FF 00 00 FF 00 00 ..

  */


B = Bytes (8bit)

H = Halfwords (16bit)

S = Single words (32bit)

D = Double words (64bit)