Dinoxor

Re-implementing bitwise operations as abstractions in aarch64 neon registers.

                          _._
                        _/:|:
                       /||||||.
                       ||||||||.
                      /|||||||||:
                     /|||||||||||
                    .|||||||||||||
                    | ||||||||||||:
                  _/| |||||||||||||:_=---.._
                  | | |||||:'''':||  '~-._  '-.
                _/| | ||'         '-._   _:    ;
                | | | '               '~~     _;
                | '                _.=._    _-~
             _.~                  {     '-_'
     _.--=.-~       _.._          {_       }
 _.-~   @-,        {    '-._     _. '~==+  |
('          }       \_      \_.=~       |  |
`======='  /_         ~-_    )         <_oo_>
  `-----~~/ /'===...===' +   /
         <_oo_>         /  //
                       /  // 
                      <_oo_>

I was bored an missing my favorite bipolar bear because she was being deported :( Fuck you Microsoft. I wanted to teach low-level Rust while I was learning arm64 ASM. I bought the book. This was before I had a manic episode and swore off any more computer knowledge. This post is going to be me, reading myself my own code to figure out what it does. I love obfuscation and bit twiddling because I can visualize the registers and perform a certain amount of operations in my head before I get frustrated and need to be on the computer again.

What I do remember about writing it, is that I wanted to make software slightly harder to reverse engineer by replacing the simple xor instruction with a set of instructions that has the same effect. I wanted to this by utilizing the 128-bit Neon registers on my AArch64 processor.

All the (heavily commented) code is available on my Github, if you'd like to pull it and follow along.

Exclusive Or Truth Table

I did some research to figure out what the xor logic does and whether there were alternative ways to do the same thing. I had some fruitful discussions with a few LLMs and perused the web. "With two inputs, XOR is true if and only if the inputs differ (one is true, one is false)." Sometimes bitwise operations are easier to visualize when I look at a truth table. There's a nice colored version on the wikipedia page but the one below will suffice.

ABA⊕B
FFF
FTT
TFT
TTF

Or goal is to XOR one byte with another byte. So that when we XOR 0b01010101 by 0b10101010 we expect the result to be 0b11111111.

Let's take a look at the entrypoint to our assembly procedure.

_dinoxor

_dinoxor:
    stp x29, x30, [sp, #-16]!
    mov x29, sp

    mov x2, #0
    eor v0.16b, v0.16b, v0.16b

    bl spread_bits_to_bytes
    mov x0, x1
    bl spread_bits_to_bytes

    bl prepare_xor_truth_table

    bl prepare_multiplication_table

    bl calculate_xor_result

    ldp x29, x30, [sp], #16

    ret

The first operation transfers the link register (frame pointer) which resides in x29 and the stack address which lives in x30 to the address calculated when you subtract 16 bytes from the stack pointer.

[sp, #-16] This part means "subtract 16 bytes from the stack pointer register and calculate the address in the register. The ! (exclamation point) at the end, mutates the stack pointer to contain the value you get when you subtract 16 from the stack pointer." The stack is where you store data for your procedure, it grows down, meaning that as you add things to it you subtract from the address that points to it's start.

Dinoxor first instruction

Before advancing to the next instruction we can use lldb to subtract 16 from the stack pointer and check that our instruction behaves as we expect.

(lldbinit) p/x $sp 
(unsigned long) 0x000000016fdfb5a0
(lldbinit) p/x $sp - 16
(unsigned long) 0x000000016fdfb590

We can then print the two 64bit values following the address that the stack pointer points to.

(lldbinit)  mem read --format x --size 8 --count 2 '$sp - 16'
0x16fdfb590: 0x000000016fdfb64b 0x000000016fdfb619

Now let's advance to the next instruction.

Dinoxor second instruction

The above image shows that the stack pointer (SP) has been modified to contain the value that we expected (0x000000016fdfb590). Now let's verify the 64 bits it points to contain the values in x29 and x30.

(lldbinit)  mem read --format x --size 8 --count 2 $sp
0x16fdfb590: 0x000000016fdfb5e0 0x0000000100000750

Great, we can now see that

0x16fdfb590: 0x000000016fdfb64b 0x000000016fdfb619

has become

0x16fdfb590: 0x000000016fdfb5e0 0x0000000100000750

Then we move the stack pointer to the link register. The value 0x000000016FDFB590 in SP should replace the value 0x000000016FDFB5E0 in x29.

Basically what we're doing, is stashing the stack pointer and link register so that when we are ready to return from our procedure, we can restore the stack pointer and link register to their inital state. This way, the process that called our procedure can continue its execution. There are official docs on the calling convention available from ARM but the best explanation I found was the accepted answer from this StackOverflow question:

A callee-save register must be saved by the callee (in opposition to a caller-save register, where the caller saves the register); so, if this is the ABI you are using, you do not have to save r10 before calling another function (the other function is responsible for saving it).

You can just substitute x for r. The x denotes a 64bit register while r generally denotes a 32bit register but can also be used to refer to 64bit registers.

The following 2 instructions are relatively simple.

mov x2, #0
eor v0.16b, v0.16b, v0.16b

The first moves the immediate value 0 into the x2 register. The next XORs, or in ARM parlance eor's (exclusive OR's), the v0 register which is the first of the general purpose NEON registers. I think of v0-v31 exactly like x0-x27 except the v registers can hold 128 bits that can be accessed and manipulated in what are called lanes. ARM has great reference documentation on NEON programming. When you eor an operand against itself and store the result in itself, this has the effect of "zeroing out the register." In summation, we expect both x2 and v0 to contain the value 0 after these two instructions are run.

Let's step through these two instuctions and check.

Dinoxor fourth instruction

(lldbinit) reg read $x2
      x2 = 0x0000000000000000
(lldbinit) reg read $v0
      v0 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}

This shows both that the registers contain 0, and also that lldb prints the general purpose x registers as a 64bit hex value and the NEON v registers as 16 8bit hex values. Each of these values in v0 is called an 8bit lane.

When a procedure is called in aarch64, the AAPCS64 (AArch64 Procedure Call Standard) specifies that the x0 - x7 registers will contain the paramaters passed to the procedure. Our dinoxor procedure expects 2 parameters to be passed, each a 1 byte operand value that should be eor'd. This parameters will be in x0 and x1.

Let's take a look at a C program that will run our assembly procedure.

#include <stdio.h>

#include <stdio.h>
#include <stdint.h>
#include <string.h>

extern uint8_t dinoxor(uint8_t x0, uint8_t x1);

void printBits(unsigned int num)
{
    printf("0b");
    for(int bit=0;bit<(sizeof(uint8_t) * 8); bit++)
    {
        printf("%i", num & 0x01);
        num = num >> 1;
    }
    printf("\n");
}

int main() {
    uint8_t a = 0b01010101;
    uint8_t b = 0b10101010;

    uint8_t ret = dinoxor(a, b);

    printBits(ret);

    return 0;
}

When debugging this program, we expect x0 to contain 0b01010101 and x1 to contain 0b10101010 before we branch to the spread_bits_to_bytes procedure.

Dinoxor fifth instruction

(lldbinit) bits $x0
0b01010101
(lldbinit) bits $x1
0b10101010

Now that we can see our expectation has been validated, I need to point out that bits is not a built-in lldb command. Is is a command script loaded by way of ~./lldbinit.py. Here is the script:

import lldb

def bits(debugger, command, result, internal_dict):
    frame = debugger.GetSelectedTarget().GetProcess().GetSelectedThread().GetSelectedFrame()
    expr = command.strip()
    value = frame.EvaluateExpression(expr)
    if value.IsValid():
        num = value.GetValueAsUnsigned()
        print(f"0b{num:08b}")
    else:
        print("Invalid expression")

The next instruction, bl spread_bits_to_bytes will branch execution to the address of the procedure spread_bits_to_bytes. bl stands for Branch with Link which takes the address of the next instruction, in our case the address containing mov x0, x1, and stores it in LR so that when we return from spread_bits_to_bytes the program will branch to the address that it is to resume execution from.

spread_bits_to_bytes and spread_bit_loop

Let's take a look at the procedure we just branched to and the procedure that follows. Notice that spread_bits_to_bytes does not end with a ret. That means, that the program will continue execution straight into the procedure defined under it, because each instruction defined in an assembly file is executed procedurally, or in order, one line at a time.

spread_bits_to_bytes:
    eor v1.16b, v1.16b, v1.16b
    eor v2.16b, v2.16b, v2.16b

    mov w2, #0

spread_bit_loop:
    lsr w3, w0, w2
    and w3, w3, #0x01
    
    mov w4, w3
    
    ext v2.16b, v0.16b, v0.16b, #1
    ins v2.b[0], w4
    mov v0.16b, v2.16b

    add w2, w2, #1
    cmp w2, #8
    b.lt spread_bit_loop

    ext v2.16b, v0.16b, v0.16b, #1
    
    ret

spread_bits_to_bytes only prepares the destination registers we will be working with. It zeroes out the NEON registers v1 and v2 and sets w2 to 0. w2 is just way to address only the least significant bit, or LSB of x2. w2 is set to 0 so it will act as a counter in the loop that follows in spread_bit_loop. We want the procedure to run once for each bit in w0, so we need it to run 8 times.

eor v1.16b, v1.16b, v1.16b
eor v2.16b, v2.16b, v2.16b

mov w2, #0

Execution then advances to the next instuction, defined in spread_bit_loop, where the actual logic is done.

lsr w3, w0, w2
and w3, w3, #0x01

mov w4, w3

ext v2.16b, v0.16b, v0.16b, #1
ins v2.b[0], w4
mov v0.16b, v2.16b

add w2, w2, #1
cmp w2, #8
b.lt spread_bit_loop

ext v2.16b, v0.16b, v0.16b, #1

In the first interation of the loop w2 will contain 0. So the first instruction, lsr w3, w0, w2 will not have any effect. LSR stands for logical shift right. In this case it will shift the bits in w0 by the value stored in w2 and the bit that falls off the end will get stored in w3. Let's look at some psuedocode.

# lsr w3, w0, w2
lsr w3, 0b01010101, 1

# Move each bit one position to the right and place a 0 in the, now empty, most significant bit's (MSB) place.
# First shift the LSB to the right, it falls off our value
0b01010101 -> 0b0101010_ 1
# Shift each bit that follows
0b01010101 -> 0b_0101010
# Place a 0 in the MSB
0b_0101010 -> 0b00101010

# Result
# w0 = 0b01010101
# w2 = 1
# w3 = 0b00101010

Now that we have that explanation out of the way, let's step through the first iteration. Our counter in w2 begins at 0, so the first lsr instruction won't have any effect. We step through it to the next instruction.

and w3, w3, #0x01

This instruction performs a Logical AND of the byte in w3 and 1. and will compare each bit in two operands and return a 1 in each bit place where the values are both 1. Let's look at another truth table.

ABA∧B
FFF
FTF
TFF
TTT

Now some psuedocode.

0b01010101 # w3
0b00000001 # 1
----------
0b00000001 # Result, the only bit place that contains 1 in both bytes is the least significant bit.

We step throught the instruction and see that w3 now contains 1. This has the effect of isolating each bit of our input so that we can move them, one-by-one into their own byte lane in a NEON register.

(lldbinit) bits $w3
0b00000001

The next instruction moves the value we just stored in w3 to w4 so that when we work with this bit we don't mutate the value in w3.

mov w4, w3

Now we see our first NEON specific instruction.

ext v2.16b, v0.16b, v0.16b, #1

ext is the, somewhat complicated, Extract vector from pair of vectors instruction. In our case it has the effect of shifting the v0 register by one byte and storing the result in v2. This is because we are extracting bytes from 2 vectors but starting from the least significant byte so that by the time the first vector is extracted, there is no more room left in the destination register. This first iteration has no effect, because v0 is still 0 so v2 will also contain 0.

ins v2.b[0], w4

This next instruction inserts the byte in w4 into the first lane of v1. Remember that we previously moved the bit from the and operation into w4, so now we are at the instruction where we are actually moving the least significant bit of our original input to a NEON register and interpreting it as a whole ass byte. Here is the result.

(lldbinit) p $v2
(unsigned char __attribute__((ext_vector_type(16)))) (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)

ARM's documentation states that the purpose of the INS instruction is to insert vector element from another vector element, but apparently it works for inserting from a regular non-NEON register into a vector element because the compiler changes the instruction to mov.b v2[0], w4, which moves the byte into the first ane of v2.

Let's move on to the next two instructions.

mov v0.16b, v2.16b
add w2, w2, #1

The first instruction moves all lanes of v2 into v0. The second increments our counter in w2 by 1.

Next up.

cmp w2, #8
b.lt spread_bit_loop

We compare the value in w2 to 8, and if the value is less than 8 we branch to the beginning of the spread_bit_loop procedure. The value of w2 is 0 so we branch.

Now that we know the instructions, let's step through them all and inspect the important values after each instruction.

---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000001
w4 = 0b00000001
---
lsr w3, w0, w2 // Shift the input byte right by the current bit position to bring the target bit to the LSB
---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00101010
w4 = 0b00000001
---
and w3, w3, #0x01 // Isolate the LSB (which is now the target bit)
---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000000
w4 = 0b00000001
---
mov w4, w3 // Move the processed bit to w4 (to ensure w4 is correctly updated before duplication)
---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000000
w4 = 0b00000000
---
ext v2.16b, v0.16b, v0.16b, #1  // Shift v0 left by one byte to make space for the new byte
---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000000
w4 = 0b00000000
---
ins v2.b[0], w4 // Insert the new byte at position 0 of v2
---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000000
w4 = 0b00000000
---
mov v0.16b, v2.16b // Move the temporary result back to v0
---
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
v2 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000000
w4 = 0b00000000
---
add w2, w2, #1 // Increment the bit position counter
---
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
v2 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
w0 = 0b01010101
w2 = 0b00000010
w3 = 0b00000000
w4 = 0b00000000
---
cmp w2, #8 // Compare the counter with 8 (number of bits in a byte)
---
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
v2 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
w0 = 0b01010101
w2 = 0b00000010
w3 = 0b00000000
w4 = 0b00000000
---
b.lt spread_bit_loop       // If the counter is less than 8, continue the loop

You can see above that the byte in the least significant lane of v0 wrapped around to the most significant lane when the vector was shifted left. Let's now look at w0, and list v0 printed at the end of every iteration.

On ARM NEON lanes' order of significance:

This confused me at first and will probably throw me off later so I'll take the time to note it. When LLDB prints NEON register values, it does so by listing the lanes from least significant to most significant UNLIKE printing bits, where the least significant bit is the value all the way to the right, at the end of the bit string.

w0 = 0b01010101
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00)
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01)
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00)
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01)
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00)
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01)

As you can see above, the bit pattern from w0 has been reproduced as a byte pattern in the upper half of v0.

0b01010101
0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01

After this ret is executed, we return to the frame that called spread_bits_to_bytes, to the address of the instruction after the call. Let's go back and look at the _dinoxor procedure where we land.

    bl spread_bits_to_bytes
    mov x0, x1 // <-- We land here.
    bl spread_bits_to_bytes

We move the second operand to _dinoxor, which is the second parameter, or byte that we are xor'ing against into x0, and we again branch to spread_bits_to_bytes. Now we repeat the same logic with a different input. Let's analyze our operand in register x0.

(lldbinit) bits $x0
0b10101010

Now let's let the loop iterate 8 times and check the value of the v2 NEON register before we return.

(lldbinit) p $v2
(unsigned char __attribute__((ext_vector_type(16)))) (0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01)

Here we can see that v2 wasn't overwritten! It was modified and the logic holds, so that the most significant 8 lanes contain the first operand, each bit inflated into a byte and placed in a lane matching it's bit location in a byte, and the second operand is found in the least significant 8 lanes with the same properties.

Let's split up the value in v2 and compare each half to their corresponding operand to help see the pattern visually.

[ 0b10101010                                    , 0b01010101                                     ]
[ 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]

[    1    0    1    0     1    0    1    0 ] [    0     1     0     1     0     1     0     1 ]
[ 0x01 0x00 0x01 0x00, 0x01 0x00 0x01 0x00 ] [ 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]

Now we're ready to look at the next procedure.

prepare_xor_truth_table

We can think of an Exclusive Or Truth table as a two-dimensional matrix in one dimension. Refer the truth table in the beginning of this post. We can unfold the table into one dimension like so:

ABA⊕B
FFF
FTT
TFT
TTF

Remove the operand columns.

A⊕B
F
T
T
F

Now, interperate the answer column on only an x-axis.

FTTF

We need this pattern to work with 128bit NEON registers, so we need to repeat this pattern as bytes across 4 lanes. The resulting neon register should be as follows:

v0 = [ 0x00 0x01 0x01 0x00 0x00 0x01 0x01 0x00 0x00 0x01 0x01 0x00 0x00 0x01 0x01 0x00 ]

The code to do this is quite simple.

prepare_xor_truth_table:
    movz w8, #0x0001, lsl #16
    movk w8, #0x0100
    
    dup v0.4s, w8
    
    ret

When we branch to this procedure $w8 is 0.

(lldbinit) bits $w8
0b00000000

The first instruction, movz shifts a 32bit immediate value, in this case #0x0001 which is equal to the decimal value 1 or the bit value 0b00000001, left by 16 places, leaving us with the value 0b10000000000000000. In practice, the compiler optimizes this instruction to mov w8, #0x10000, as the bit value of #0x10000 is 0b10000000000000000.

The next instruction movk or Move wide with keep moves an, optionally shifted, 32bit value into a destination register without altering the previous data in the register. Moving the value 0x100, or 0b100000000 in bits, to w8 gives us the result 0b10000000100000000.

Now we need to somehow turn this value into 4 individual bytes and duplicate that pattern 4 times to fill up a NEON register with the correct pattern. Lucky for us the dup instruction lets us duplicate the general purpose register w8 into a neon register while specifying the arrangement of the bytes when they are duplicated. The .4s in v0.4s means interperate the v0 register as 4 seperate 32bit scalar values and duplicate w8 into them.

The pattern is made clearer when we include leading 0's when inspecting w8, to display all 32bits.

w8 = 0b10000000100000000
w8 = 0b00000000000000010000000100000000
w8 = 0b00000000_00000001_00000001_00000000
w8 = 0x 00 01 01 00

Now lets imagine v0 as four 32bit scalar values.

v0 = [ 0x00000000 0x00000000 0x00000000 0x00000000 ]

Now duplicating them in this fashion gives us the xor truth table that we need.

(lldbinit) p $v0
(unsigned char __attribute__((ext_vector_type(16)))) (0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00

We then return execution the next instruction in our main dinoxor procedure, and we approach another unconditional branch.

bl prepare_multiplication_table

prepare_multiplication_table

The idea that allows us to XOR two binary values by using multiplication is quite simple, even if the rest of this post does not make it appear as such.

The algorithm to XOR x and y by using a lookup table, when both x and y are a binary value, or in the set {0, 1}, is truth_table[index] where index = 2x+y.

Let's try it, looking at the 1 dimensional truth table we built earlier.

0110

Now let's build a table of indexes into this array.

xy2x+1
000
102
011
113

Now if we use the last column as an index to our truth table we get:

0 ^ 0 = 0
1 ^ 0 = 1
0 ^ 1 = 1
1 ^ 1 = 0

The algorithm works!

The procedure to build this one dimensional lookup table in a NEON register is quite simple.

prepare_multiplication_table:
    movi v1.8b, #0x02
    movi v8.8b, #0x01
    
    mov v1.d[1], v8.d[0]
    
    ret

The first instruction moves the immediate value 2 into the lower half of v. The notation v1.8b means interperate v1 as 8 byte size lanes, only giving us access to the least significant 64bits and copying the value 2 to each lane.

(lldbinit) p $v1
(unsigned char __attribute__((ext_vector_type(16)))) (0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)

The next instruction moves the immediate value 1 into the lower half of v8 the same exact way. The reason we move 1 into each of the lower lanes is that we end up multiplying this entire register against a register that contains our first parameter in the lower lanes of a register and the second parameter in the upper lanes of the parameter.

(lldbinit) p $v8
(unsigned char __attribute__((ext_vector_type(16)))) (0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)

Then we just move the lower half of v8 into the empty, upper half of v1 with the mov v1.d[1], v8.d[0] instruction. This completes our multiplication table waiting patiently in v1.

(lldbinit) p $v1
(unsigned char __attribute__((ext_vector_type(16)))) (0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01

Finally, we return back to our main dinoxor procedure. Where we find that our next instruction is out last non-conditional branch.

bl calculate_xor_result

calculate_xor_result

Let's pause for a moment and examine our register state heading into this procedure.

v0 = [ 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00 ]
v1 = [ 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01 ]
v2 = [ 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]
v3 = [ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ]
  • v0: contains our truth table.
  • v1: contains our multiplication table.
  • v2: contains our two intial parameters, the first in the lower 8 lanes, the second in the higher 8 lanes.
  • v3: contains 0.

Now let's look at the procedure as a whole.

calculate_xor_result:
    mul v3.16b, v2.16b, v1.16b

    mov v3.d[1], v2.d[1]
    ext.16b v1, v3, v3, #8
    add.16b v1, v3, v1
    mov.d v1[1], xzr
    tbl.8b v1, {v0}, v1

    movz x1, #0x0201, lsl #0
    movk x1, #0x0804, lsl #16
    movk x1, #0x2010, lsl #32
    movk x1, #0x8040, lsl #48
    mov v0.d[0], x1

    mul v1.16b, v1.16b, v0.16b
    addv b0, v1.8b
    umov w0, v0.b[0]

    ret

First we multiply each value in our multiplication table by each value in our NEON register that contains our 2 parameters we would like to XOR and store the result in v3.

v3 = [ 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]

Then with mov v3.d[1], v2.d[1], we move the second parameter that was passed to dinoxor that is now spread across the most significant 8 lanes of v2 to the most significant 8 lanes of the register v3, replacing the bytes already there. Here is the result.

v3 = [ 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]

// To reiterate what we just explained.

     |- Bottom half of (mult table x first param) ----|- Second parameter to dinoxor ------------------| 
v3 = [ 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00 | 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]

The next instruction ext.16b v1, v3, v3, #8, has the effect of moving the Second parameter to dinoxor) into the bottom half of v1, overwriting half of our original multiplication table because we are finished with it, and we would like to reuse the register.

     |-  Second parameter to dinoxor -----------------|- Multiplication table x First param -----------| 
v1 = [ 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 | 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00 ]

Then the instruction add.16b v1, v3, v1 adds each lane in v1 to its corresponding lane in v3 and stores the result inv1 overwriting its previous contents. Let's look at psuedocode of the instruction.

       |- Second parameter to dinoxor ------------------|- Multiplication table x First param ---------- | 
  v1 = [ 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 | 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00 ]
       |- Multiplication table x First param -----------|- Second parameter to dinoxor ------------------| 
+ v3 = [ 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00 | 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]
----------------------------------------------------------------------------------------------------------
= v1 = [ 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x02, 0x01 | 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x02, 0x01 ]

The next instruction mov.d v1[1], xzr, zeroes out the upper 8 lanes of v1. We only require one of the halfs because they are exactly the same.

v1 = [ 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ]

The next instruction tbl.8b v1, {v0}, v1 , performs a table lookup using the indices in v1 and the truth table in v0 and stores the result in v1 overwriting it's initial contents. There is an impressive explanation that includes a graph in ARM's documentation on lookup table instructions.

v0 = [ 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00 ]
v1 = [ 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ]

When we use each value of v1 as an index into our truth table and overwrite v1 with the result we're left with the following register state.

v1 = [ 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ]

Now we have the Exclusive Or'd value spread across the least significant bits of the v1 register and we need a way to compress these back down to a single byte in a multi-purpose register. I attempted several variations of loops and instructions and was unsucessful. I did however find a magic number that when multiplied by the least significant half of v1 has the effect of reproducing the byte pattern into the first byte sized lane of v1 when each of the lanes are summed together.

movz x1, #0x0201, lsl #0   // Load the lower 16 bits of x1 with 0x0201
movk x1, #0x0804, lsl #16  // Load the next 16 bits of x1 with 0x0804
movk x1, #0x2010, lsl #32  // Load the next 16 bits of x1 with 0x2010
movk x1, #0x8040, lsl #48  // Load the upper 16 bits of x1 with 0x8040

These instructions create the magic number, 9241421688590303745 in decimal or 0b10000000_01000000_00100000_00010000_00001000_00000100_00000010_00000001. I find this particular pattern beautiful and was only able to formulate it by actually typing out the bit string and putzing about on a calculator.

Then the instruction mov.d v0[0], x1, moves this value into the lower half of v0.

v0 = [ 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00 ]

The top half of v0 still contains our truth table so we can just ignore it. Next mul v1.16b, v1.16b, v0.16b, multiplies v0 containing our abstracted XOR result by v1 which contains our magic number. Then the next instruction addv b0, v1.8b, sums each lane of v1 and stores the result in the first byte sized line of v0.

v0 = [ 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ]

Finally, we move the first lane of v0 into the w0 register, which is the return value of our entire procedure.

(lldbinit) bits $w0
0b11111111

That's it! We've done it! Our expected return value is in the register that will return our result to a calling function. Now all that's left is to ret back to our main _dinoxor procedure and restore the Stack Pointer and Link Register, returning control of execution to the calling process along with our result in w0, which if you don't remember is just a way to reference the least significant 32 bits of the 64bit register x0.

ldp x29, x30, [sp], #16

ret

Future plans

I've already implemented this in Rust using inline assembly and utilizing quickcheck to provide generative testing. These posts take quite a lot of energy, so I'll have one ready as soon as humanly possible. I'd love to hear from you if you found this useful in any way or have any suggestions!