Generative Testing Inline Assembly in Rust
thechinesegovernment
d888888b d888888b
d888 8888b d888888 888b
d88 88 898888b d8888 888 88b
d8P 88888888b d88888888888 b8b
88 8888888888 88888888888 88
88 88888888888 8888888888 88
98b 88888888888P 988888888 d8P
988 888 8888P _=_ 9888898 88 88P
9888 888888P q(-_-)p 98888 888P
9888888P '_) (_` 9888888P
88 /__/ \ 88
88 _(<_ / )_ 88
d88b (__\_\_|_/__) d88b
Recently, I learned what a small amount of the RISC instructions provided by aarch64 processors actually do (I read like 100 pages of Azeria's book). Government's are constantly caught lacking by the security community. RC4 is still found obfuscating code in big hacks. RC4 is fast, and this is almost always used as the explanation on the write-ups breaking down the reverse engineering of these tools. It always leaves me thinking, "Our computers are so much faster now than when you probably wrote this. Are you just copying source files from 1999 over and over again?" The answer is probably yes, but whatever.
How hard is it to implement ChaCha20? It turns out that, in Rust, it's not hard at all. Even the C code I was using as reference seemed pretty simple, possibly more simple than the Rust implementation. Before we go into the encryption algorithms that I implemented to benchmark my stupid little routines, let's learn about Rust's inline assembly feature.
Introduction
First, I wrote aarch64 assembly that obfuscates the eor
instruction. You can find the source code on my Github if you'd like the C and assembly files. They are also available in a less comprehensive package, in the reference
directory of this projects Github repository.
There is great documentation provided by Rust's maintainers. It is comprehensive and a lot to take in at once to someone new to the feature. That's why I find it easy to only take what I need. Let's not focus on the extraneous information and what could be done and instead we'll just focus on the parts we use.
───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ File: src/lib.rs
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ #![feature(test)]
2 │ #[cfg(test)]
3 │ extern crate quickcheck;
4 │ #[cfg(test)]
5 │ #[macro_use(quickcheck)]
6 │ extern crate quickcheck_macros;
7 │
8 │ pub mod dinoxor;
9 │ pub mod chacha20;
10 │ #[cfg(test)]
11 │ mod tests;
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
We want our project to be a library, so that our dinoxor
function can be called from other projects using our code as a dependency We want this to be easily shared via crates.io. It's already available, so if you'd like to use it in your 1337 anti-cheat VM, then add a line for it under [dependencies]
in your Cargo.toml
file.
[dependencies]
thechinesegovernment = "*"
And use it like this:
use thechinesegovernment::dinoxor::dinoxor;
let result = dinoxor(0b11101011, 0b11111111);
assert_eq!(result, 0b10100 )
This will eor
, "Exclusive-Or" a.k.a "XOR" two byte sized values. It will not do so quickly, well I mean quicker than your eyes can perceive but not quick relative to the single aarch64
processor instruction eor
. I've added benchmarks so we can see exactly how much slower this inline assembly-powered obfuscation routine runs.
src/tests.rs
#[bench]
fn bench_xor(b: &mut Bencher) {
b.iter(|| {
let n = black_box(0xFF);
(0x00..n).fold(0, |old: u8, new| old ^ new)
});
}
#[bench]
fn bench_dinoxor(b: &mut Bencher) {
b.iter(|| {
let n = black_box(0xFF);
(0x00..n).fold(0, |old: u8, new| dinoxor(old, new))
});
}
Rust has great docs on what black_box
does here. It basically prevents the compiler from optimizing away any code that uses that piece of data. Bencher
is the structure that runs your code and keeps track of stats. Above, we are benchmarking our obscure eor
(dinoxor
), against the regular eor
(^
).
λ cargo bench
Compiling thechinesegovernment v0.2.0 (/Users/tg/Projects/thechinesegovernment)
Finished `bench` profile [optimized] target(s) in 0.55s
Running unittests src/lib.rs (target/release/deps/thechinesegovernment-861a3ea09a6c5c29)
running 9 tests
test tests::test_calculate_xor_result ... ignored
test tests::test_chacha_known_vector ... ignored
test tests::test_chacha_properties ... ignored
test tests::test_dinoxor ... ignored
test tests::test_prepare_multiplication_table ... ignored
test tests::test_prepare_xor_truth_table ... ignored
test tests::test_spread_bits_to_bytes_registers ... ignored
test tests::bench_dinoxor ... bench: 7,095.95 ns/iter (+/- 111.96)
test tests::bench_xor ... bench: 6.29 ns/iter (+/- 0.20)
test result: ok. 0 passed; 0 failed; 7 ignored; 2 measured; 0 filtered out; finished in 3.49s
I was actually a little blown away by this. Our assembly, compiled by rustc
, inline, as part of a greater Rust library is only ~1,200
times slower than the individual ^
instruction. If you go and actually read the God-forsaken things I am doing to these bits and where I'm putting them, you'll understand my amazement.
The asm!
Macro
I left the code heavily commented for educational purposes.
pub fn dinoxor(x: u8, y: u8) -> u8 {
let result: u8;
unsafe {
asm!(
"mov x2, #0", // Initialize x2 to 0 (not used later)
"eor v0.16b, v0.16b, v0.16b", // Clear the contents of v0 (set all bits to 0)
"mov w0, {x:w}", // Move the first operand byte into w0
"bl {spread_bits_to_bytes}", // Call the spread_bits_to_bytes function to load the first operand byte into the lower half of v2
"mov w0, {y:w}", // Move the second operand byte into w0
"bl {spread_bits_to_bytes}", // Call the spread_bits_to_bytes function to load the second operand byte into the lower half of v2
// (shifting the previous value to upper)
"bl {prepare_xor_truth_table}", // Call the prepare_xor_truth_table function to initialize the XOR truth table in v0
"bl {prepare_multiplication_table}", // Call the prepare_multiplication_table function to initialize the multiplication table in v1
"bl {calculate_xor_result}", // Call the calculate_xor_result function to calculate and store the XOR'd byte in w0.
"mov {result:w}, w0", // Move the result from w0 to the result variable
x = in(reg) x,
y = in(reg) y,
result = inout(reg) x => result,
spread_bits_to_bytes = sym spread_bits_to_bytes,
prepare_xor_truth_table = sym prepare_xor_truth_table,
prepare_multiplication_table = sym prepare_multiplication_table,
calculate_xor_result = sym calculate_xor_result
);
}
result
}
The asm!
macro takes a list of assembly instructions as strings. Variables can be interpolated into the string by using the "{}"
interpolation syntax. The variables, such as input/output registers and linked symbols for function calls, are placed after the list of instructions.
List of instructions.
"mov x2, #0",
"eor v0.16b, v0.16b, v0.16b",
"mov w0, {x:w}",
"bl {spread_bits_to_bytes}",
"mov w0, {y:w}",
"bl {spread_bits_to_bytes}",
"bl {prepare_xor_truth_table}",
"bl {prepare_multiplication_table}",
"bl {calculate_xor_result}",
Register assignments.
x = in(reg) x,
y = in(reg) y,
lateout("w0") result,
clobber_abi("C"),
This instructs the compiler to bind the variables x
and y
to registers belonging to the first two arguments passed into the function. Lastly, it uses the register holding the first parameter as the memory location to store the result that gets returned from the function as is the requirement of the aarch64
Calling Conventions.
clobber_abi("C")
makes sure the runtime doesn't wipe the values we write to the register we specified in lateout("w0") result
.
Linker instructions.
spread_bits_to_bytes = sym spread_bits_to_bytes,
prepare_xor_truth_table = sym prepare_xor_truth_table,
prepare_multiplication_table = sym prepare_multiplication_table,
calculate_xor_result = sym calculate_xor_result
These bind our helper functions' memory locations to symbols branch-able from our inline asm.
I find this surprisingly simple when broken down as such. The Rust documentation is great but a little overwhelming if you just want to compile a lil assembly that calls a function with params.
ChaCha20
At the time I wrote this, when I would dig into any of my favorite applications' encryption algorithms, I would find that they were using ChaCha20. A long time ago when I looked at the Signal Protocol it was using ChaCha20 and other stuff. When I looked at OTRv4 it was using ChaCha20 and other stuff. ChaCha20 has replaced or is replacing anything that was or is still using AES I guess. Idk, I'm not a cryptographer.
I figured it would be useful to understand how stream ciphers worked, especially ChaCha20. What better way to do say than writing your own implementation and making it slow af!
My implementation is on my Github. I'm not going to be ably to explain it better than the Wikipedia or my in-depth documentation provided in the source code but I'll give it a shot.
You need to make a message written on an infinitely long fruit roll up impossible to read by anybody who doesn't have the secret password. You eat the fruit roll-up and poop it out in exactly the same form but the message that was written on it when you ate it needs to appear completely random to anyone who does not have the secret password. It's like a human centipede situation since fruit roll-ups are essentially plastic.
The fruit roll-ups get swallowed in 16 byte increments 🫃. You spell out your secret password by cutting pixie dust into a 32 byte super-secret password like "Lets just waste some more time !" and promptly snorting it where it will be kept in your favorite neurotransmitters. The ones that communicate with your stomach and tell it how to digest things. You then decide on a "Completely random 12 byte" name for the message session and snort that too. You write down "I AM CHACHA20, TASTE MY MERCY" on a piece of paper and swallow that too.
Every time a 16 byte chunk of fruit roll-up hits your stomach, your neurotransmitters rotate all the letters, from all the different inputs around in your stomach in such a way that it is impossible to tell what they used to say. You can only perform the same dance, using the same information to reverse the process and make the message readable again. The encrypted fruit roll-up message es expelled from your anus to its intended receiver like some sort of fucked up delicious human centipede type beat.
If we substitute the ^
for our dinoxor
in our ChaCha20::process_with_dinoxor
and benchmark it to see how noticeable such a change would be, we're met with pretty surprising results.
test tests::bench_chacha_known_vector_dinoxor ... bench: 1,588.69 ns/iter (+/- 10.42)
test tests::bench_chacha_known_vector_eor ... bench: 392.46 ns/iter (+/- 3.10)
Using dinoxor
only slows down our encryption/decryption process by a factor of 5!
In the future I'd like to research how to optimize the dinoxor
algorithm even further and experiment with fucking up other perfectly sound processor instructions. Hope you have some fun with it.
dg4l <3