Generative Testing Inline Assembly in Rust

thechinesegovernment

      d888888b                         d888888b
   d888    8888b                    d888888   888b
 d88    88  898888b               d8888  888     88b
d8P        88888888b             d88888888888     b8b
88        8888888888             88888888888       88
88       88888888888             8888888888        88
98b     88888888888P             988888888        d8P
 988     888  8888P      _=_      9888898  88    88P
   9888   888888P      q(-_-)p       98888    888P
      9888888P         '_) (_`         9888888P
         88            /__/  \            88
         88          _(<_   / )_          88
        d88b        (__\_\_|_/__)        d88b

Recently, I learned what a small amount of the RISC instructions provided by aarch64 processors actually do (I read like 100 pages of Azeria's book). Government's are constantly caught lacking by the security community. RC4 is still found obfuscating code in big hacks. RC4 is fast, and this is almost always used as the explanation on the write-ups breaking down the reverse engineering of these tools. It always leaves me thinking, "Our computers are so much faster now than when you probably wrote this. Are you just copying source files from 1999 over and over again?" The answer is probably yes, but whatever.

How hard is it to implement ChaCha20? It turns out that, in Rust, it's not hard at all. Even the C code I was using as reference seemed pretty simple, possibly more simple than the Rust implementation. Before we go into the encryption algorithms that I implemented to benchmark my stupid little routines, let's learn about Rust's inline assembly feature.

Introduction

First, I wrote aarch64 assembly that obfuscates the eor instruction. You can find the source code on my Github if you'd like the C and assembly files. They are also available in a less comprehensive package, in the reference directory of this projects Github repository.

There is great documentation provided by Rust's maintainers. It is comprehensive and a lot to take in at once to someone new to the feature. That's why I find it easy to only take what I need. Let's not focus on the extraneous information and what could be done and instead we'll just focus on the parts we use.

───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: src/lib.rs
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ #![feature(test)]
   2   │ #[cfg(test)]
   3   │ extern crate quickcheck;
   4   │ #[cfg(test)]
   5   │ #[macro_use(quickcheck)]
   6   │ extern crate quickcheck_macros;
   7   │ 
   8   │ pub mod dinoxor;
   9   │ pub mod chacha20;
  10   │ #[cfg(test)]
  11   │ mod tests;
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

We want our project to be a library, so that our dinoxor function can be called from other projects using our code as a dependency We want this to be easily shared via crates.io. It's already available, so if you'd like to use it in your 1337 anti-cheat VM, then add a line for it under [dependencies] in your Cargo.toml file.

[dependencies]
thechinesegovernment = "*"

And use it like this:

use thechinesegovernment::dinoxor::dinoxor;

let result = dinoxor(0b11101011, 0b11111111);
assert_eq!(result, 0b10100 )

This will eor, "Exclusive-Or" a.k.a "XOR" two byte sized values. It will not do so quickly, well I mean quicker than your eyes can perceive but not quick relative to the single aarch64 processor instruction eor. I've added benchmarks so we can see exactly how much slower this inline assembly-powered obfuscation routine runs.

src/tests.rs

#[bench]
fn bench_xor(b: &mut Bencher) {

    b.iter(|| {
        let n = black_box(0xFF);

        (0x00..n).fold(0, |old: u8, new| old ^ new)
    });
}

#[bench]
fn bench_dinoxor(b: &mut Bencher) {
    b.iter(|| {
        let n = black_box(0xFF);

        (0x00..n).fold(0, |old: u8, new| dinoxor(old, new))
    });
}

Rust has great docs on what black_box does here. It basically prevents the compiler from optimizing away any code that uses that piece of data. Bencher is the structure that runs your code and keeps track of stats. Above, we are benchmarking our obscure eor (dinoxor), against the regular eor (^).

λ cargo bench
   Compiling thechinesegovernment v0.2.0 (/Users/tg/Projects/thechinesegovernment)
    Finished `bench` profile [optimized] target(s) in 0.55s
     Running unittests src/lib.rs (target/release/deps/thechinesegovernment-861a3ea09a6c5c29)

running 9 tests
test tests::test_calculate_xor_result ... ignored
test tests::test_chacha_known_vector ... ignored
test tests::test_chacha_properties ... ignored
test tests::test_dinoxor ... ignored
test tests::test_prepare_multiplication_table ... ignored
test tests::test_prepare_xor_truth_table ... ignored
test tests::test_spread_bits_to_bytes_registers ... ignored

test tests::bench_dinoxor ... bench:       7,095.95 ns/iter (+/- 111.96)
test tests::bench_xor     ... bench:           6.29 ns/iter (+/- 0.20)

test result: ok. 0 passed; 0 failed; 7 ignored; 2 measured; 0 filtered out; finished in 3.49s

I was actually a little blown away by this. Our assembly, compiled by rustc, inline, as part of a greater Rust library is only ~1,200 times slower than the individual ^ instruction. If you go and actually read the God-forsaken things I am doing to these bits and where I'm putting them, you'll understand my amazement.

The asm! Macro

I left the code heavily commented for educational purposes.

pub fn dinoxor(x: u8, y: u8) -> u8 {
    let result: u8;

    unsafe  { 
        asm!(
            "mov x2, #0",                           // Initialize x2 to 0 (not used later)
            "eor v0.16b, v0.16b, v0.16b",           // Clear the contents of v0 (set all bits to 0)

            "mov w0, {x:w}",                        // Move the first operand byte into w0

            "bl {spread_bits_to_bytes}",            // Call the spread_bits_to_bytes function to load the first operand byte into the lower half of v2
            "mov w0, {y:w}",                        // Move the second operand byte into w0
            "bl {spread_bits_to_bytes}",            // Call the spread_bits_to_bytes function to load the second operand byte into the lower half of v2 
                                                    // (shifting the previous value to upper)

            "bl {prepare_xor_truth_table}",         // Call the prepare_xor_truth_table function to initialize the XOR truth table in v0
            "bl {prepare_multiplication_table}",    // Call the prepare_multiplication_table function to initialize the multiplication table in v1
            "bl {calculate_xor_result}",            // Call the calculate_xor_result function to calculate and store the XOR'd byte in w0.

            "mov {result:w}, w0",                   // Move the result from w0 to the result variable

            x = in(reg) x,
            y = in(reg) y,
            result = inout(reg) x => result,

            spread_bits_to_bytes = sym spread_bits_to_bytes,
            prepare_xor_truth_table = sym prepare_xor_truth_table,
            prepare_multiplication_table = sym prepare_multiplication_table,
            calculate_xor_result = sym calculate_xor_result
        );
    }

    result
}

The asm! macro takes a list of assembly instructions as strings. Variables can be interpolated into the string by using the "{}" interpolation syntax. The variables, such as input/output registers and linked symbols for function calls, are placed after the list of instructions.

List of instructions.

"mov x2, #0",
"eor v0.16b, v0.16b, v0.16b",

"mov w0, {x:w}",

"bl {spread_bits_to_bytes}",
"mov w0, {y:w}",  
"bl {spread_bits_to_bytes}",

"bl {prepare_xor_truth_table}",
"bl {prepare_multiplication_table}",
"bl {calculate_xor_result}",      

Register assignments.

x = in(reg) x,
y = in(reg) y,
lateout("w0") result,
clobber_abi("C"),

This instructs the compiler to bind the variables x and y to registers belonging to the first two arguments passed into the function. Lastly, it uses the register holding the first parameter as the memory location to store the result that gets returned from the function as is the requirement of the aarch64 Calling Conventions. clobber_abi("C") makes sure the runtime doesn't wipe the values we write to the register we specified in lateout("w0") result.

Linker instructions.

spread_bits_to_bytes = sym spread_bits_to_bytes,
prepare_xor_truth_table = sym prepare_xor_truth_table,
prepare_multiplication_table = sym prepare_multiplication_table,
calculate_xor_result = sym calculate_xor_result

These bind our helper functions' memory locations to symbols branch-able from our inline asm.

I find this surprisingly simple when broken down as such. The Rust documentation is great but a little overwhelming if you just want to compile a lil assembly that calls a function with params.

ChaCha20

At the time I wrote this, when I would dig into any of my favorite applications' encryption algorithms, I would find that they were using ChaCha20. A long time ago when I looked at the Signal Protocol it was using ChaCha20 and other stuff. When I looked at OTRv4 it was using ChaCha20 and other stuff. ChaCha20 has replaced or is replacing anything that was or is still using AES I guess. Idk, I'm not a cryptographer.

I figured it would be useful to understand how stream ciphers worked, especially ChaCha20. What better way to do say than writing your own implementation and making it slow af!

My implementation is on my Github. I'm not going to be ably to explain it better than the Wikipedia or my in-depth documentation provided in the source code but I'll give it a shot.

You need to make a message written on an infinitely long fruit roll up impossible to read by anybody who doesn't have the secret password. You eat the fruit roll-up and poop it out in exactly the same form but the message that was written on it when you ate it needs to appear completely random to anyone who does not have the secret password. It's like a human centipede situation since fruit roll-ups are essentially plastic.

The fruit roll-ups get swallowed in 16 byte increments 🫃. You spell out your secret password by cutting pixie dust into a 32 byte super-secret password like "Lets just waste some more time !" and promptly snorting it where it will be kept in your favorite neurotransmitters. The ones that communicate with your stomach and tell it how to digest things. You then decide on a "Completely random 12 byte" name for the message session and snort that too. You write down "I AM CHACHA20, TASTE MY MERCY" on a piece of paper and swallow that too.

Every time a 16 byte chunk of fruit roll-up hits your stomach, your neurotransmitters rotate all the letters, from all the different inputs around in your stomach in such a way that it is impossible to tell what they used to say. You can only perform the same dance, using the same information to reverse the process and make the message readable again. The encrypted fruit roll-up message es expelled from your anus to its intended receiver like some sort of fucked up delicious human centipede type beat.

If we substitute the ^ for our dinoxor in our ChaCha20::process_with_dinoxor and benchmark it to see how noticeable such a change would be, we're met with pretty surprising results.

test tests::bench_chacha_known_vector_dinoxor ... bench:       1,588.69 ns/iter (+/- 10.42)
test tests::bench_chacha_known_vector_eor     ... bench:         392.46 ns/iter (+/- 3.10)

Using dinoxor only slows down our encryption/decryption process by a factor of 5!

In the future I'd like to research how to optimize the dinoxor algorithm even further and experiment with fucking up other perfectly sound processor instructions. Hope you have some fun with it.

dg4l <3