Dinoxor
Re-implementing bitwise operations as abstractions in aarch64 neon registers.
_._
_/:|:
/||||||.
||||||||.
/|||||||||:
/|||||||||||
.|||||||||||||
| ||||||||||||:
_/| |||||||||||||:_=---.._
| | |||||:'''':|| '~-._ '-.
_/| | ||' '-._ _: ;
| | | ' '~~ _;
| ' _.=._ _-~
_.~ { '-_'
_.--=.-~ _.._ {_ }
_.-~ @-, { '-._ _. '~==+ |
(' } \_ \_.=~ | |
`=======' /_ ~-_ ) <_oo_>
`-----~~/ /'===...===' + /
<_oo_> / //
/ //
<_oo_>
I was bored an missing my favorite bipolar bear because she was being deported :( Fuck you Microsoft. I wanted to teach low-level Rust while I was learning arm64 ASM. I bought the book. This was before I had a manic episode and swore off any more computer knowledge. This post is going to be me, reading myself my own code to figure out what it does. I love obfuscation and bit twiddling because I can visualize the registers and perform a certain amount of operations in my head before I get frustrated and need to be on the computer again.
What I do remember about writing it, is that I wanted to make software slightly harder to reverse engineer by replacing the simple xor
instruction with a set of instructions that has the same effect. I wanted to this by utilizing the 128-bit Neon registers on my AArch64 processor.
All the (heavily commented) code is available on my Github, if you'd like to pull it and follow along.
Exclusive Or Truth Table
I did some research to figure out what the xor
logic does and whether there were alternative ways to do the same thing. I had some fruitful discussions with a few LLMs and perused the web. "With two inputs, XOR is true if and only if the inputs differ (one is true, one is false)." Sometimes bitwise operations are easier to visualize when I look at a truth table. There's a nice colored version on the wikipedia page but the one below will suffice.
A | B | A⊕B |
---|---|---|
F | F | F |
F | T | T |
T | F | T |
T | T | F |
Or goal is to XOR one byte with another byte. So that when we XOR 0b01010101
by 0b10101010
we expect the result to be 0b11111111
.
Let's take a look at the entrypoint to our assembly procedure.
_dinoxor
_dinoxor:
stp x29, x30, [sp, #-16]!
mov x29, sp
mov x2, #0
eor v0.16b, v0.16b, v0.16b
bl spread_bits_to_bytes
mov x0, x1
bl spread_bits_to_bytes
bl prepare_xor_truth_table
bl prepare_multiplication_table
bl calculate_xor_result
ldp x29, x30, [sp], #16
ret
The first operation transfers the link register (frame pointer) which resides in x29
and the stack address which lives in x30
to the address calculated when you subtract 16 bytes from the stack pointer.
[sp, #-16]
This part means "subtract 16 bytes from the stack pointer register and calculate the address in the register. The !
(exclamation point) at the end, mutates the stack pointer to contain the value you get when you subtract 16 from the stack pointer." The stack is where you store data for your procedure, it grows down, meaning that as you add things to it you subtract from the address that points to it's start.
Before advancing to the next instruction we can use lldb to subtract 16 from the stack pointer and check that our instruction behaves as we expect.
(lldbinit) p/x $sp
(unsigned long) 0x000000016fdfb5a0
(lldbinit) p/x $sp - 16
(unsigned long) 0x000000016fdfb590
We can then print the two 64bit values following the address that the stack pointer points to.
(lldbinit) mem read --format x --size 8 --count 2 '$sp - 16'
0x16fdfb590: 0x000000016fdfb64b 0x000000016fdfb619
Now let's advance to the next instruction.
The above image shows that the stack pointer (SP
) has been modified to contain the value that we expected (0x000000016fdfb590
). Now let's verify the 64 bits it points to contain the values in x29
and x30
.
(lldbinit) mem read --format x --size 8 --count 2 $sp
0x16fdfb590: 0x000000016fdfb5e0 0x0000000100000750
Great, we can now see that
0x16fdfb590: 0x000000016fdfb64b 0x000000016fdfb619
has become
0x16fdfb590: 0x000000016fdfb5e0 0x0000000100000750
Then we move the stack pointer to the link register. The value 0x000000016FDFB590
in SP
should replace the value 0x000000016FDFB5E0
in x29
.
Basically what we're doing, is stashing the stack pointer and link register so that when we are ready to return from our procedure, we can restore the stack pointer and link register to their inital state. This way, the process that called our procedure can continue its execution. There are official docs on the calling convention available from ARM but the best explanation I found was the accepted answer from this StackOverflow question:
A callee-save register must be saved by the callee (in opposition to a caller-save register, where the caller saves the register); so, if this is the ABI you are using, you do not have to save r10 before calling another function (the other function is responsible for saving it).
You can just substitute x
for r
. The x
denotes a 64bit register while r
generally denotes a 32bit register but can also be used to refer to 64bit registers.
The following 2 instructions are relatively simple.
mov x2, #0
eor v0.16b, v0.16b, v0.16b
The first moves the immediate value 0
into the x2
register. The next XORs, or in ARM parlance eor
's (exclusive OR's), the v0
register which is the first of the general purpose NEON registers. I think of v0-v31
exactly like x0-x27
except the v
registers can hold 128 bits that can be accessed and manipulated in what are called lanes. ARM has great reference documentation on NEON programming. When you eor
an operand against itself and store the result in itself, this has the effect of "zeroing out the register." In summation, we expect both x2
and v0
to contain the value 0
after these two instructions are run.
Let's step through these two instuctions and check.
(lldbinit) reg read $x2
x2 = 0x0000000000000000
(lldbinit) reg read $v0
v0 = {0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00}
This shows both that the registers contain 0
, and also that lldb prints the general purpose x
registers as a 64bit hex value and the NEON v
registers as 16 8bit hex values. Each of these values in v0
is called an 8bit lane.
When a procedure is called in aarch64, the AAPCS64 (AArch64 Procedure Call Standard) specifies that the x0 - x7
registers will contain the paramaters passed to the procedure. Our dinoxor
procedure expects 2 parameters to be passed, each a 1 byte operand value that should be eor
'd. This parameters will be in x0
and x1
.
Let's take a look at a C program that will run our assembly procedure.
#include <stdio.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
extern uint8_t dinoxor(uint8_t x0, uint8_t x1);
void printBits(unsigned int num)
{
printf("0b");
for(int bit=0;bit<(sizeof(uint8_t) * 8); bit++)
{
printf("%i", num & 0x01);
num = num >> 1;
}
printf("\n");
}
int main() {
uint8_t a = 0b01010101;
uint8_t b = 0b10101010;
uint8_t ret = dinoxor(a, b);
printBits(ret);
return 0;
}
When debugging this program, we expect x0
to contain 0b01010101
and x1
to contain 0b10101010
before we branch to the spread_bits_to_bytes
procedure.
(lldbinit) bits $x0
0b01010101
(lldbinit) bits $x1
0b10101010
Now that we can see our expectation has been validated, I need to point out that bits
is not a built-in lldb command. Is is a command script loaded by way of ~./lldbinit.py
. Here is the script:
import lldb
def bits(debugger, command, result, internal_dict):
frame = debugger.GetSelectedTarget().GetProcess().GetSelectedThread().GetSelectedFrame()
expr = command.strip()
value = frame.EvaluateExpression(expr)
if value.IsValid():
num = value.GetValueAsUnsigned()
print(f"0b{num:08b}")
else:
print("Invalid expression")
The next instruction, bl spread_bits_to_bytes
will branch execution to the address of the procedure spread_bits_to_bytes
. bl
stands for Branch with Link which takes the address of the next instruction, in our case the address containing mov x0, x1
, and stores it in LR
so that when we return from spread_bits_to_bytes
the program will branch to the address that it is to resume execution from.
spread_bits_to_bytes
and spread_bit_loop
Let's take a look at the procedure we just branched to and the procedure that follows. Notice that spread_bits_to_bytes
does not end with a ret
. That means, that the program will continue execution straight into the procedure defined under it, because each instruction defined in an assembly file is executed procedurally, or in order, one line at a time.
spread_bits_to_bytes:
eor v1.16b, v1.16b, v1.16b
eor v2.16b, v2.16b, v2.16b
mov w2, #0
spread_bit_loop:
lsr w3, w0, w2
and w3, w3, #0x01
mov w4, w3
ext v2.16b, v0.16b, v0.16b, #1
ins v2.b[0], w4
mov v0.16b, v2.16b
add w2, w2, #1
cmp w2, #8
b.lt spread_bit_loop
ext v2.16b, v0.16b, v0.16b, #1
ret
spread_bits_to_bytes
only prepares the destination registers we will be working with. It zeroes out the NEON registers v1
and v2
and sets w2
to 0
. w2
is just way to address only the least significant bit, or LSB of x2
. w2
is set to 0
so it will act as a counter in the loop that follows in spread_bit_loop
. We want the procedure to run once for each bit in w0
, so we need it to run 8
times.
eor v1.16b, v1.16b, v1.16b
eor v2.16b, v2.16b, v2.16b
mov w2, #0
Execution then advances to the next instuction, defined in spread_bit_loop
, where the actual logic is done.
lsr w3, w0, w2
and w3, w3, #0x01
mov w4, w3
ext v2.16b, v0.16b, v0.16b, #1
ins v2.b[0], w4
mov v0.16b, v2.16b
add w2, w2, #1
cmp w2, #8
b.lt spread_bit_loop
ext v2.16b, v0.16b, v0.16b, #1
In the first interation of the loop w2
will contain 0
. So the first instruction, lsr w3, w0, w2
will not have any effect. LSR
stands for logical shift right. In this case it will shift the bits in w0
by the value stored in w2
and the bit that falls off the end will get stored in w3
. Let's look at some psuedocode.
# lsr w3, w0, w2
lsr w3, 0b01010101, 1
# Move each bit one position to the right and place a 0 in the, now empty, most significant bit's (MSB) place.
# First shift the LSB to the right, it falls off our value
0b01010101 -> 0b0101010_ 1
# Shift each bit that follows
0b01010101 -> 0b_0101010
# Place a 0 in the MSB
0b_0101010 -> 0b00101010
# Result
# w0 = 0b01010101
# w2 = 1
# w3 = 0b00101010
Now that we have that explanation out of the way, let's step through the first iteration. Our counter in w2
begins at 0
, so the first lsr
instruction won't have any effect. We step through it to the next instruction.
and w3, w3, #0x01
This instruction performs a Logical AND of the byte in w3
and 1
. and
will compare each bit in two operands and return a 1
in each bit place where the values are both 1. Let's look at another truth table.
A | B | A∧B |
---|---|---|
F | F | F |
F | T | F |
T | F | F |
T | T | T |
Now some psuedocode.
0b01010101 # w3
0b00000001 # 1
----------
0b00000001 # Result, the only bit place that contains 1 in both bytes is the least significant bit.
We step throught the instruction and see that w3
now contains 1
. This has the effect of isolating each bit of our input so that we can move them, one-by-one into their own byte lane in a NEON register.
(lldbinit) bits $w3
0b00000001
The next instruction moves the value we just stored in w3
to w4
so that when we work with this bit we don't mutate the value in w3
.
mov w4, w3
Now we see our first NEON specific instruction.
ext v2.16b, v0.16b, v0.16b, #1
ext
is the, somewhat complicated, Extract vector from pair of vectors instruction. In our case it has the effect of shifting the v0
register by one byte and storing the result in v2
. This is because we are extracting bytes from 2 vectors but starting from the least significant byte so that by the time the first vector is extracted, there is no more room left in the destination register. This first iteration has no effect, because v0
is still 0
so v2
will also contain 0
.
ins v2.b[0], w4
This next instruction inserts the byte in w4
into the first lane of v1
. Remember that we previously moved the bit from the and
operation into w4
, so now we are at the instruction where we are actually moving the least significant bit of our original input to a NEON register and interpreting it as a whole ass byte. Here is the result.
(lldbinit) p $v2
(unsigned char __attribute__((ext_vector_type(16)))) (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
ARM's documentation states that the purpose of the INS instruction is to insert vector element from another vector element, but apparently it works for inserting from a regular non-NEON register into a vector element because the compiler changes the instruction to mov.b v2[0], w4
, which moves the byte into the first ane of v2
.
Let's move on to the next two instructions.
mov v0.16b, v2.16b
add w2, w2, #1
The first instruction moves all lanes of v2
into v0
. The second increments our counter in w2
by 1
.
Next up.
cmp w2, #8
b.lt spread_bit_loop
We compare the value in w2
to 8
, and if the value is less than 8
we branch to the beginning of the spread_bit_loop
procedure. The value of w2
is 0
so we branch.
Now that we know the instructions, let's step through them all and inspect the important values after each instruction.
---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000001
w4 = 0b00000001
---
lsr w3, w0, w2 // Shift the input byte right by the current bit position to bring the target bit to the LSB
---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00101010
w4 = 0b00000001
---
and w3, w3, #0x01 // Isolate the LSB (which is now the target bit)
---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000000
w4 = 0b00000001
---
mov w4, w3 // Move the processed bit to w4 (to ensure w4 is correctly updated before duplication)
---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000000
w4 = 0b00000000
---
ext v2.16b, v0.16b, v0.16b, #1 // Shift v0 left by one byte to make space for the new byte
---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000000
w4 = 0b00000000
---
ins v2.b[0], w4 // Insert the new byte at position 0 of v2
---
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v2 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000000
w4 = 0b00000000
---
mov v0.16b, v2.16b // Move the temporary result back to v0
---
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
v2 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
w0 = 0b01010101
w2 = 0b00000001
w3 = 0b00000000
w4 = 0b00000000
---
add w2, w2, #1 // Increment the bit position counter
---
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
v2 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
w0 = 0b01010101
w2 = 0b00000010
w3 = 0b00000000
w4 = 0b00000000
---
cmp w2, #8 // Compare the counter with 8 (number of bits in a byte)
---
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
v2 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
w0 = 0b01010101
w2 = 0b00000010
w3 = 0b00000000
w4 = 0b00000000
---
b.lt spread_bit_loop // If the counter is less than 8, continue the loop
You can see above that the byte in the least significant lane of v0
wrapped around to the most significant lane when the vector was shifted left. Let's now look at w0
, and list v0
printed at the end of every iteration.
On ARM NEON lanes' order of significance:
This confused me at first and will probably throw me off later so I'll take the time to note it. When LLDB prints NEON register values, it does so by listing the lanes from least significant to most significant UNLIKE printing bits, where the least significant bit is the value all the way to the right, at the end of the bit string.
w0 = 0b01010101
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01)
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00)
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01)
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00)
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01)
v0 = (0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00)
v0 = (0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01)
As you can see above, the bit pattern from w0
has been reproduced as a byte pattern in the upper half of v0
.
0b01010101
0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01
After this ret
is executed, we return to the frame that called spread_bits_to_bytes
, to the address of the instruction after the call. Let's go back and look at the _dinoxor
procedure where we land.
bl spread_bits_to_bytes
mov x0, x1 // <-- We land here.
bl spread_bits_to_bytes
We move the second operand to _dinoxor
, which is the second parameter, or byte that we are xor'ing against into x0
, and we again branch to spread_bits_to_bytes
. Now we repeat the same logic with a different input. Let's analyze our operand in register x0
.
(lldbinit) bits $x0
0b10101010
Now let's let the loop iterate 8 times and check the value of the v2
NEON register before we return.
(lldbinit) p $v2
(unsigned char __attribute__((ext_vector_type(16)))) (0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01)
Here we can see that v2
wasn't overwritten! It was modified and the logic holds, so that the most significant 8 lanes contain the first operand, each bit inflated into a byte and placed in a lane matching it's bit location in a byte, and the second operand is found in the least significant 8 lanes with the same properties.
Let's split up the value in v2
and compare each half to their corresponding operand to help see the pattern visually.
[ 0b10101010 , 0b01010101 ]
[ 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]
[ 1 0 1 0 1 0 1 0 ] [ 0 1 0 1 0 1 0 1 ]
[ 0x01 0x00 0x01 0x00, 0x01 0x00 0x01 0x00 ] [ 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]
Now we're ready to look at the next procedure.
prepare_xor_truth_table
We can think of an Exclusive Or Truth table as a two-dimensional matrix in one dimension. Refer the truth table in the beginning of this post. We can unfold the table into one dimension like so:
A | B | A⊕B |
---|---|---|
F | F | F |
F | T | T |
T | F | T |
T | T | F |
Remove the operand columns.
A⊕B |
---|
F |
T |
T |
F |
Now, interperate the answer column on only an x-axis.
F | T | T | F |
---|
We need this pattern to work with 128bit NEON registers, so we need to repeat this pattern as bytes across 4 lanes. The resulting neon register should be as follows:
v0 = [ 0x00 0x01 0x01 0x00 0x00 0x01 0x01 0x00 0x00 0x01 0x01 0x00 0x00 0x01 0x01 0x00 ]
The code to do this is quite simple.
prepare_xor_truth_table:
movz w8, #0x0001, lsl #16
movk w8, #0x0100
dup v0.4s, w8
ret
When we branch to this procedure $w8
is 0
.
(lldbinit) bits $w8
0b00000000
The first instruction, movz
shifts a 32bit immediate value, in this case #0x0001
which is equal to the decimal value 1
or the bit value 0b00000001
, left by 16 places, leaving us with the value 0b10000000000000000
. In practice, the compiler optimizes this instruction to mov w8, #0x10000
, as the bit value of #0x10000
is 0b10000000000000000
.
The next instruction movk
or Move wide with keep moves an, optionally shifted, 32bit value into a destination register without altering the previous data in the register. Moving the value 0x100
, or 0b100000000
in bits, to w8
gives us the result 0b10000000100000000
.
Now we need to somehow turn this value into 4 individual bytes and duplicate that pattern 4 times to fill up a NEON register with the correct pattern. Lucky for us the dup
instruction lets us duplicate the general purpose register w8
into a neon register while specifying the arrangement of the bytes when they are duplicated. The .4s
in v0.4s
means interperate the v0
register as 4 seperate 32bit scalar values and duplicate w8
into them.
The pattern is made clearer when we include leading 0
's when inspecting w8
, to display all 32bits.
w8 = 0b10000000100000000
w8 = 0b00000000000000010000000100000000
w8 = 0b00000000_00000001_00000001_00000000
w8 = 0x 00 01 01 00
Now lets imagine v0
as four 32bit scalar values.
v0 = [ 0x00000000 0x00000000 0x00000000 0x00000000 ]
Now duplicating them in this fashion gives us the xor truth table that we need.
(lldbinit) p $v0
(unsigned char __attribute__((ext_vector_type(16)))) (0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00
We then return execution the next instruction in our main dinoxor
procedure, and we approach another unconditional branch.
bl prepare_multiplication_table
prepare_multiplication_table
The idea that allows us to XOR two binary values by using multiplication is quite simple, even if the rest of this post does not make it appear as such.
The algorithm to XOR x
and y
by using a lookup table, when both x
and y
are a binary value, or in the set {0, 1}, is truth_table[index]
where index = 2x+y
.
Let's try it, looking at the 1 dimensional truth table we built earlier.
0 | 1 | 1 | 0 |
---|
Now let's build a table of indexes into this array.
x | y | 2x+1 |
---|---|---|
0 | 0 | 0 |
1 | 0 | 2 |
0 | 1 | 1 |
1 | 1 | 3 |
Now if we use the last column as an index to our truth table we get:
0 ^ 0 = 0
1 ^ 0 = 1
0 ^ 1 = 1
1 ^ 1 = 0
The algorithm works!
The procedure to build this one dimensional lookup table in a NEON register is quite simple.
prepare_multiplication_table:
movi v1.8b, #0x02
movi v8.8b, #0x01
mov v1.d[1], v8.d[0]
ret
The first instruction moves the immediate value 2
into the lower half of v
. The notation v1.8b
means interperate v1
as 8
byte size lanes, only giving us access to the least significant 64bits and copying the value 2
to each lane.
(lldbinit) p $v1
(unsigned char __attribute__((ext_vector_type(16)))) (0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
The next instruction moves the immediate value 1
into the lower half of v8
the same exact way. The reason we move 1
into each of the lower lanes is that we end up multiplying this entire register against a register that contains our first parameter in the lower lanes of a register and the second parameter in the upper lanes of the parameter.
(lldbinit) p $v8
(unsigned char __attribute__((ext_vector_type(16)))) (0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
Then we just move the lower half of v8
into the empty, upper half of v1
with the mov v1.d[1], v8.d[0]
instruction. This completes our multiplication table waiting patiently in v1
.
(lldbinit) p $v1
(unsigned char __attribute__((ext_vector_type(16)))) (0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
Finally, we return back to our main dinoxor
procedure. Where we find that our next instruction is out last non-conditional branch.
bl calculate_xor_result
calculate_xor_result
Let's pause for a moment and examine our register state heading into this procedure.
v0 = [ 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00 ]
v1 = [ 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01 ]
v2 = [ 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]
v3 = [ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ]
v0
: contains our truth table.v1
: contains our multiplication table.v2
: contains our two intial parameters, the first in the lower 8 lanes, the second in the higher 8 lanes.v3
: contains 0.
Now let's look at the procedure as a whole.
calculate_xor_result:
mul v3.16b, v2.16b, v1.16b
mov v3.d[1], v2.d[1]
ext.16b v1, v3, v3, #8
add.16b v1, v3, v1
mov.d v1[1], xzr
tbl.8b v1, {v0}, v1
movz x1, #0x0201, lsl #0
movk x1, #0x0804, lsl #16
movk x1, #0x2010, lsl #32
movk x1, #0x8040, lsl #48
mov v0.d[0], x1
mul v1.16b, v1.16b, v0.16b
addv b0, v1.8b
umov w0, v0.b[0]
ret
First we multiply each value in our multiplication table by each value in our NEON register that contains our 2 parameters we would like to XOR and store the result in v3
.
v3 = [ 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]
Then with mov v3.d[1], v2.d[1]
, we move the second parameter that was passed to dinoxor
that is now spread across the most significant 8 lanes of v2
to the most significant 8 lanes of the register v3
, replacing the bytes already there. Here is the result.
v3 = [ 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]
// To reiterate what we just explained.
|- Bottom half of (mult table x first param) ----|- Second parameter to dinoxor ------------------|
v3 = [ 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00 | 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]
The next instruction ext.16b v1, v3, v3, #8
, has the effect of moving the Second parameter to dinoxor) into the bottom half of v1
, overwriting half of our original multiplication table because we are finished with it, and we would like to reuse the register.
|- Second parameter to dinoxor -----------------|- Multiplication table x First param -----------|
v1 = [ 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 | 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00 ]
Then the instruction add.16b v1, v3, v1
adds each lane in v1
to its corresponding lane in v3
and stores the result inv1
overwriting its previous contents. Let's look at psuedocode of the instruction.
|- Second parameter to dinoxor ------------------|- Multiplication table x First param ---------- |
v1 = [ 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 | 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00 ]
|- Multiplication table x First param -----------|- Second parameter to dinoxor ------------------|
+ v3 = [ 0x02, 0x00, 0x02, 0x00, 0x02, 0x00, 0x02, 0x00 | 0x00, 0x01, 0x00, 0x01, 0x00, 0x01, 0x00, 0x01 ]
----------------------------------------------------------------------------------------------------------
= v1 = [ 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x02, 0x01 | 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x02, 0x01 ]
The next instruction mov.d v1[1], xzr
, zeroes out the upper 8 lanes of v1
. We only require one of the halfs because they are exactly the same.
v1 = [ 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ]
The next instruction tbl.8b v1, {v0}, v1
, performs a table lookup using the indices in v1
and the truth table in v0
and stores the result in v1
overwriting it's initial contents. There is an impressive explanation that includes a graph in ARM's documentation on lookup table instructions.
v0 = [ 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00 ]
v1 = [ 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x02, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ]
When we use each value of v1
as an index into our truth table and overwrite v1
with the result we're left with the following register state.
v1 = [ 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ]
Now we have the Exclusive Or'd value spread across the least significant bits of the v1
register and we need a way to compress these back down to a single byte in a multi-purpose register. I attempted several variations of loops and instructions and was unsucessful. I did however find a magic number that when multiplied by the least significant half of v1
has the effect of reproducing the byte pattern into the first byte sized lane of v1
when each of the lanes are summed together.
movz x1, #0x0201, lsl #0 // Load the lower 16 bits of x1 with 0x0201
movk x1, #0x0804, lsl #16 // Load the next 16 bits of x1 with 0x0804
movk x1, #0x2010, lsl #32 // Load the next 16 bits of x1 with 0x2010
movk x1, #0x8040, lsl #48 // Load the upper 16 bits of x1 with 0x8040
These instructions create the magic number, 9241421688590303745
in decimal or 0b10000000_01000000_00100000_00010000_00001000_00000100_00000010_00000001
. I find this particular pattern beautiful and was only able to formulate it by actually typing out the bit string and putzing about on a calculator.
Then the instruction mov.d v0[0], x1
, moves this value into the lower half of v0
.
v0 = [ 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01, 0x00 ]
The top half of v0
still contains our truth table so we can just ignore it. Next mul v1.16b, v1.16b, v0.16b
, multiplies v0
containing our abstracted XOR result by v1
which contains our magic number. Then the next instruction addv b0, v1.8b
, sums each lane of v1
and stores the result in the first byte sized line of v0
.
v0 = [ 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ]
Finally, we move the first lane of v0
into the w0
register, which is the return value of our entire procedure.
(lldbinit) bits $w0
0b11111111
That's it! We've done it! Our expected return value is in the register that will return our result to a calling function. Now all that's left is to ret
back to our main _dinoxor
procedure and restore the Stack Pointer and Link Register, returning control of execution to the calling process along with our result in w0
, which if you don't remember is just a way to reference the least significant 32 bits of the 64bit register x0
.
ldp x29, x30, [sp], #16
ret
Future plans
I've already implemented this in Rust using inline assembly and utilizing quickcheck to provide generative testing. These posts take quite a lot of energy, so I'll have one ready as soon as humanly possible. I'd love to hear from you if you found this useful in any way or have any suggestions!