GoogleCTF2022 - eldar writeups

2022-07-03 4885 words 23 minutes

Contents

Note

Use the table of contents on the right to navigate to the section that you are interested in.

Preface

Another year another exotic VM reversing challenge from Google CTF. In my opinion, this year’s challenge is way more difficult and complex than last year, but it is also a lot more interesting as well. There are actually so many little delicate details in this challenge that I don’t think I can recall every single step that I did, but hopefully I can explain how everything works and my solution to solving them.

Introduction

Challenge Info

Given files: eldar, serial.c, Makefile.
Description: We found this app on a recent Ubuntu system, but the serial is missing from serial.c. Can you figure out a valid serial? Just put it into serial.c and run the app with make && make run.
Category: Reverse engineering
Summary: A VM reverse engineering challenge. This VM is created using ELF relocation metadata and symbol metadata. The flag is divided into 2 parts and each of them goes through a series of arithmetic operations and algorithms to reach the final check. Several z3 scripts are required to solve this challenge.

TL;DR

Analyze eldar file -> it’s just a simple file that takes a value serial from libserial.so that we can compile from serial.c, but the check is nowhere to be seen and instead the relocation section of the file is MASSIVE with tons of relocation metadata.
This paper and this talk provide many useful information about using ELF relocation metadata as a VM -> write a script to parse the relocation metadata out as VM instructions.
Analyze the VM code -> figure out that the first 16 bytes of serial and the last 12 bytes go through different calculations -> our goal is to make the output of the VM equals to 0.
Analyze the first part -> figure out that the first 16 bytes go through a shuffling algorithm to generate a 24 bytes buffer, each 2 bytes of serial correspond to 3 bytes of this buffer -> each byte of the buffer is multiplied by a constant and sum together several times with other negative constants -> the sum needs to be equal to 0 for the output to be 0.
Write a script to parse out the constants.
Write a z3 script to solve the 24-byte buffer -> bruteforce the first 16 bytes of serial using this result.
Analyze the second part -> figure out that the last 12 bytes of serial are treated somewhat the same as the 24-byte buffer, except that not only multiplications, the VM jumps to x86 shellcodes to do bitwise operations as well.
Write a script to parse out the multiplication constants from the VM code and the bitwise operations from the pieces of shellcodes.
Write a z3 script to solve the last 12 bytes of serial -> combine the 2 parts to get flag.

Analyzing the binary

The code in the binary itself is super simple. The binary is linked with libserial.so which we can build from the give source file serial.c. This lib’s only purpose is to provide the binary with a string serial in a buffer at address 0x404040, which is expected to be 29 bytes (28 bytes and a NULL byte). There is a value fail at address 0x404060 and if it’s equal to 0, our serial is the correct flag.

The problem here is that there is no other code in the binary, there is no function to check serial, no function to set fail. Instead, if we scroll down in IDA to the relocation section at 0x804000, we can see a huge region with an unsual amount of relocation metadata. These metadata are not processed by the binary itself, but instead by the system’s ELF Loader. So the suspicion here is that the binary actually uses the ELF Loader to perform checks on serial before executing itself, using those metadata.

By the sheer size of the metadata region, I suspected this to be a VM revering challenge (also because Google CTF loves this kind of weird machine challenges).

About the ELF relocation metadata machine

This technique of running a “weird” virtual machine using ELF relocation metadata and symbol metadata has been discovered and explained a long time a go in the paper “Weird Machines” in ELF: A Spotlight on the Underappreciated Metadata and the talk Programming Weird Machines with ELF Metadata (yes they were 10 years ago!). The idea is to use the ELF Loader as an interpreter to execute a weird VM with carefully crafted ELF relocation metadata as VM instructions and symbol metadata as registers.

In our binary eldar, those metadata is located in the section mapped at address 0x804000, which can be inspected in IDA. The symbol metadata table is located at 0x804064, while the relocation metadata is located at 0x80428A. I’d like to talk about relocation metadata first, because these are our VM instructions. Take a look at the structure of each relocation Elf64_Rela:

typedef struct {
    Elf64_Addr r_offset;
    Elf64_Xword r_info;
    Elf64_Sxword r_addend;
} Elf64_Rela;

#define ELF64_R_SYM(info)             ((info)>>32)
#define ELF64_R_TYPE(info)            ((Elf64_Word)(info))

By looking at a few of them, we can see that r_offset contains an address, r_info contains the type of relocation in the lower 32 bits and the index of a symbol in the higher 32 bits, r_addend contains a value, all of these fields are 64 bits in length. We can dump all of these metadata in a neat and pretty format using the command readelf -r eldar. By quickly looking through the relocation metadata, we can see that the binary uses only 4 types of relocation: R_X86_64_RELATIVE, R_X86_64_COPY, R_X86_64_64 and R_X86_64_32, so basically this VM only has 4 types of instruction. Since out binary doesn’t have PIE, the base address of it is considered to be 0. Let sym be the corresponding ELF64_R_SYM part of r_info, this is how each type of relocation works:

R_X86_64_RELATIVE : *r_offset = r_addend
R_X86_64_COPY : memcpy(r_offset, sym.value, sym.size)
R_X86_64_64 : *r_offset = sym.value + r_addend
R_X86_64_32 : *(DWORD*)r_offset = (DWORD)(sym.value + r_addend)

Note

IDA goes absolutely crazy and wrong when displaying the relocation metadata, I don’t really know why it happens, but it’s better to look at the output of readelf -r when inspecting those. Turns out IDA behaves this way because it tries to apply the relocations and because they are self-modifying, they will go all chaotic. Launching IDA with IDA_NORELOC=1 might fix it. Thanks @aaSSfxxx on Twitter for this info.

In this VM, symbols act as registers. By using readelf -s eldar, we can see that the VM uses 10 of them with the names from ^A to ^J:

Num:    Value          Size Type    Bind   Vis      Ndx Name
  0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
  1: 0000000000000000     8 OBJECT  LOCAL  DEFAULT    1 ^A
  2: 0000000000804000     8 OBJECT  LOCAL  DEFAULT    1 ^B
  3: 0000000000804000     8 OBJECT  LOCAL  DEFAULT    1 ^C
  4: 0000000000804000     8 OBJECT  LOCAL  DEFAULT    1 ^D
  5: 0000000000804000     8 OBJECT  LOCAL  DEFAULT    1 ^E
  6: 0000000000804000     8 OBJECT  LOCAL  DEFAULT    1 ^F
  7: 0000000000804000     8 OBJECT  LOCAL  DEFAULT    1 ^G
  8: 0000000000804000     8 OBJECT  LOCAL  DEFAULT    1 ^H
  9: 0000000000804000     8 OBJECT  LOCAL  DEFAULT    1 ^I
 10: 0000000000804000     8 OBJECT  LOCAL  DEFAULT    1 ^J

They are located at address 0x804064 as mentioned before, so by writing to this memory region, it can modify the values of its registers. Let’s take a look at the structure of each symbol metadata to understand it more:

typedef struct {
    Elf64_Word      st_name;
    unsigned char   st_info;
    unsigned char   st_other;
    Elf64_Half      st_shndx;
    Elf64_Addr      st_value;
    Elf64_Xword     st_size;
} Elf64_Sym;

This structure is 24 bytes in size. The first 64 bits contains st_name, st_info, st_other and st_shndx, these are not so important yet so I’ll explain them later when needed. Then the last 2 fields contain the value of the symbol (sym.value) and it’s size (sym.size). Therefore, the VM can use R_X86_64_RELATIVE to move an immediate value to a symbol or any place in memory, R_X86_64_64 or R_X86_64_32 to move the value (64 bits or 32 bits, respectively) of a symbol to another symbol or any place in memory, and R_X86_64_COPY to copy the data at the address in a symbol’s value to another place in memory, and by changing sym.size, it can control how many bytes to copy.

Note

Because the relocation metadata also reside in the same memory region, it can self-modify its “instructions” on the fly as well. This is actually the technique that this VM uses a lot.

After understanding how this ELF relocation metadata VM works, I proceeded to write a parser to parse out the instructions and format them in a way that is easy to read.

Writing a VM parser

Because of the self-modifying nature of this VM, a static code parser is not good enough. Instead, we need to write a parser that also has an emulator inside it, to emulate the execution of the VM at each instruction while parsing them. Luckily for us, there is no jump instruction in this VM, so we can simply emulate and parse the instructions linearly one by one. I used the serial input value of abcdefghijklmnopqrstuvwxyz{\} to easily recognize in the code where our input is used.

For the instructions, I decided to print them in the form of <dest> <- <data>. For the symbol registers. I name each of their 8 bytes as FLAG_X, VAL_X and LEN_X, with X being the name of the symbols from A to J. I also associate their exact address in the memory with them for pretty printing:

dynsym = 0x80406c
regs = {}

for i in range(1, 11):
    regs[dynsym + i*24] = "VAL_" + chr(ord('A') + i - 1)
    regs[dynsym + 8 + i*24] = "LEN_" + chr(ord('A') + i - 1)
    regs[dynsym - 8 + i*24] = "TYPE_" + chr(ord('A') + i - 1)

Below is the main part of the parser code in parse_opcode.py.

while pc < 0xa5a000:
    print("{:08x}: ".format(pc), end="")

    r_offset = read64(pc)
    r_type = read64(pc+8) & 0xffff
    r_sym = read64(pc+8) >> 32
    r_addend = read64(pc+16)

    print_offset = hex(r_offset)[2:]
    print_addend = hex(r_addend)[2:]
    if r_offset in regs:
        print_offset = regs[r_offset]
    if r_addend in regs:
        print_addend = "&" + regs[r_addend]

    sym_addr = dynsym + r_sym*24

    if r_type == 8:
        write64(r_offset, r_addend)
        print("{} \t<- {}".format(print_offset, print_addend))
    elif r_type == 5:
        sym_len = read64(sym_addr + 8)
        writen(r_offset, r_addend + readn(read64(sym_addr), sym_len), sym_len)
        print("{} \t<- [{} + {}]".format(print_offset, regs[sym_addr], print_addend))
    elif r_type == 1:
        write64(r_offset, r_addend + read64(sym_addr))
        print("{} \t<- {} + {}".format(print_offset, regs[sym_addr], print_addend))
    elif r_type == 0xa:
        write32(r_offset, r_addend + read64(sym_addr))
        print("{} \t<- DWORD({} + {})".format(print_offset, regs[sym_addr], print_addend))
    elif r_type == 0:
        break
    else:
        print("UNKNOWN R_TYPE")
        break

    pc += 24

The parser outputs a massive file with 100,000+ lines of instructions. Although this is understandable because the VM can only do assignments and additions, without any control flow instructions, so it will have to unroll every loop and also each mathematical operation takes many instructions to execute. This is a small part of the parsed code for example, reading out input at 0x404040:

...
00806372: VAL_B 	<- 404040
0080638a: LEN_B 	<- 1
008063a2: VAL_G 	<- [VAL_B + 0]
008063ba: VAL_B 	<- &VAL_G
008063d2: LEN_B 	<- 8
008063ea: 806412 	<- [VAL_B + 0]
00806402: VAL_D 	<- VAL_D + 804061
0080641a: 806412 	<- 0
00806432: 8040cd 	<- DWORD(VAL_A + 0)
0080644a: VAL_B 	<- 8042ba
00806462: VAL_F 	<- [VAL_B + 0]
0080647a: VAL_B 	<- &VAL_D
00806492: 8064ba 	<- [VAL_B + 0]
008064aa: VAL_E 	<- VAL_D + 61
008064c2: 8064ba 	<- 0
...

Note

There is a lot of 804000 <- 0 instructions in the code, this is basically just a NOP instruction.

Attempting to one-shot with Z3

As usual with VM flag checker reversing challenges, with the VM code emulator in hand, I decided to try injecting z3 variables to the input and let the emulator solve itself (just like what I did with Google CTF 2021 VM challenge). However, things are not as easy this time. There are some problems that prevents this approach to success:

The first algorithm in the VM code uses out input as index to memory, both on ther read side and on the write side. Unfortunately, z3 variables can’t be used as array index, and using z3 Array() to model the whole VM memory is not feasible at all.
The final part of the VM code actually jumps to x86 shellcode to execute them, then jump back to VM code (will explain later). A lot more work would be needed to parse and emulate those code as well.
There are way too many operations that z3 might not be able to solve by itself without manual simplification (not so sure about this one).

Therefore, in the end, I had to go back to the good old way of doing proper reversing on the VM code to understand its algorithm.

Debugging the VM

As I said, the VM code is quite complex, so statically reading the code and writing scripts are error-prone methods. I needed a way to verify if my assumptions are correct or not. To do this, the easiest way is to debug the VM code. However, this VM is special: relocation metadata are processed by the ELF Loader, not by the binary itself, so debugging the binary is useless because the VM already finishes executing the moment the binary starts to run.

The first debugging method that comes to mind is to debug the loader itself. We can see what loader the binary use with the command ldd eldar, in my case it’s /lib64/ld-linux-x86-64.so.2. Then we can run /lib64/ld-linux-x86-64.so.2 ./eldar to run the loader on the binary and debug it this way. However, the loader is a very big program that does multiple different things, not just processing relocation metadata. This method for me is purely pain and suffering to even try.

The method that I used is that I wrote a simple script to patch the binary and remove parts of relocation metadata. For example, if I want to debug the VM up to the instruction at address 0x956c42, I can just remove all metadata from address 0x956c5a onwards, then set a breakpoint at the binary’s main and inspect the memory of the VM region. This is indeed quite time-consuming, but it’s super easy to do.

data = open("eldar", "rb").read()

start = 0x956c5a - 0x800000
end = 0xa59d3a - 0x800000

new_data = data[:start] + b"\0"*(end-start) + data[end:]

open("eldar-test", "wb").write(new_data)

Now that we have all the tools ready to reverse the code, let me try to explain its algorithms part by part.

Analyzing the first shuffling algorithm

By searching for the address of out input bytes from 0x404040 to 0x40405b, I realized that serial is divided into 2 parts that get checked separately: the first 16 bytes and the last 12 bytes. The first 16 bytes go through 2 stages of calculations and this part will explain the first.

The first algorithm to be applied on the first 16 bytes starts at address 0x804ae2 and ends at 0x97eb2a. Here I’ll try to explain a small bit of code at the beginning of it, then the rest of the code is analyzed using the same method (most of them are just unrolled loops, so they have very similar patterns).

At the beginning, we can see that the code initializes values from 0x00 to 0xff to an int64 array starting at 0x8042ba. Then comes the main part:

00806372: VAL_B 	<- 404040
0080638a: LEN_B 	<- 1
008063a2: VAL_G 	<- [VAL_B + 0]
008063ba: VAL_B 	<- &VAL_G
008063d2: LEN_B 	<- 8
008063ea: 806412 	<- [VAL_B + 0]
00806402: VAL_D 	<- VAL_D + 804061
0080641a: 806412 	<- 0
00806432: 8040cd 	<- DWORD(VAL_A + 0)
0080644a: VAL_B 	<- 8042ba
00806462: VAL_F 	<- [VAL_B + 0]
0080647a: VAL_B 	<- &VAL_D
00806492: 8064ba 	<- [VAL_B + 0]
008064aa: VAL_E 	<- VAL_D + 61
008064c2: 8064ba 	<- 0
008064da: VAL_B 	<- &VAL_E
008064f2: 80651a 	<- [VAL_B + 0]
0080650a: VAL_E 	<- VAL_E + c2
00806522: 80651a 	<- 0
0080653a: 806562 	<- [VAL_B + 0]
00806552: VAL_E 	<- VAL_E + 184
0080656a: 806562 	<- 0
00806582: VAL_E 	<- VAL_E + 8042ba
0080659a: 8042ba 	<- [VAL_E + 0]
008065b2: 8065ca 	<- [VAL_B + 0]
008065ca: 8045c2 	<- VAL_F + 0
008065e2: VAL_B 	<- 8042c2
008065fa: 806622 	<- [VAL_B + 0]
00806612: VAL_D 	<- VAL_D + 1
0080662a: 806622 	<- 0
...

Our serial is at address 0x404040 and contains the value abcdefghijklmnopqrstuvwxyz{|}, so this code starts out by reading the first byte of our input to VAL_G. Because VAL_G initial value is 0x804000, it becomes 0x804061. Then the value of VAL_G is written to address 0x806412. If we take a look at the addresses of the instructions, 0x806412 is actually an address in the nearby code, that’s why we can see the VAL_G’s value 0x804061 appears on the instruction at address 0x806402. The code actually modifies the r_addend of the next instruction’s metadata to load the value of VAL_G into VAL_D. Then it uses a DWORD assignment to clear out the upper bytes of VAL_D, leaving only 0x61, which is the first byte of our serial. Next, VAL_D + 0x61 is loaded into VAL_E and VAL_E is added to itself 2 more times, which in the end means VAL_E = serial[0] * 8. It’s then added to 0x8042ba, which is the beginning of the int64 array initialized earlier. All these works just simply equivalent to getting the address of int64_array[serial[0]], because each element of the array is 8 bytes in size, hence the multiplication by 8 is used to retrieve the exact address. The final few instructions basically swap the value of int64_array[serial[0]] with int64_array[0].

The code then continues to process address 0x404041, then back to 0x404040 and then 0x404041 again. By searching for the value 404040 in the parsed code, we can see that this “loop” is executed 128 times, some calculations are done after that, and then it proceeds to do the same with 0x404042 and 0x404043. Finally, these loops will end with 0x40404e and 0x40404f. Using the same analyzing method as above, I figured out that this part of the code uses this swapping algorithm to generate a 24-byte buffer at address 0x804aca, with each 2 bytes of the serial generate 3 bytes of this buffer. The algorithm is rewritten in python in logic_1.py:

serial = b"abcdefghijklmnopqrstuvwxyz{|}"
wtf = []

def swap(arr, a, b):
    tmp = arr[a]
    arr[a] = arr[b]
    arr[b] = tmp

for i in range(0x8):
    x = 0
    mem = list(range(0x100))
    tmp = [0] * 3
    for j in range(128):
        x = (x + serial[2*i]) & 0xff
        swap(mem, 2*j, x)
        x = (x + mem[2*j+1]) & 0xff

        x = (x + serial[2*i+1]) & 0xff
        swap(mem, 2*j+1, x)
        x = (x + mem[(2*j+2) & 0xff]) & 0xff

    x = 0
    for j in range(3):
        x = (x + mem[j]) & 0xff
        swap(mem, j, x)
        tmp[j] += mem[(mem[j] + mem[x]) & 0xff]
    
    wtf += tmp

I am not an algorithm person or a math guy, so I don’t really know if this algorithm is reversible or not. However, because of the nature of the 2 bytes to 3 bytes mapping, if we know the value of the 24-byte buffer, we can simply bruteforce the first 16 bytes of serial 2 bytes at once, without reversing this algorithm.

Therefore, the goal here is to calculate the correct value for this 24-byte buffer, let’s proceed to see how this buffer is checked.

Analyzing the 24-byte buffer checking algorithm

This part starts at address 0x97eb72 and ends at 0xa29092.

0097eb72: VAL_G 	<- fffffffffffa00b1
0097eb8a: VAL_B 	<- 804aca
0097eba2: LEN_B 	<- 1
0097ebba: VAL_E 	<- [VAL_B + 0]
0097ebd2: VAL_F 	<- 0
0097ebea: VAL_B 	<- &VAL_F
0097ec02: LEN_B 	<- 8
0097ec1a: 97ec42 	<- [VAL_B + 0]
0097ec32: VAL_F 	<- VAL_F + 0
0097ec4a: 97ec42 	<- 0
0097ec62: VAL_B 	<- &VAL_E
0097ec7a: 97eca2 	<- [VAL_B + 0]
0097ec92: VAL_F 	<- VAL_F + e2
0097ecaa: 97eca2 	<- 0
0097ecc2: VAL_B 	<- &VAL_F
0097ecda: 97ed02 	<- [VAL_B + 0]
0097ecf2: VAL_F 	<- VAL_F + e2
0097ed0a: 97ed02 	<- 0
0097ed22: 97ed4a 	<- [VAL_B + 0]
0097ed3a: VAL_F 	<- VAL_F + 1c4
0097ed52: 97ed4a 	<- 0
0097ed6a: 97ed92 	<- [VAL_B + 0]
0097ed82: VAL_F 	<- VAL_F + 388
0097ed9a: 97ed92 	<- 0
0097edb2: VAL_B 	<- &VAL_E
0097edca: 97edf2 	<- [VAL_B + 0]
0097ede2: VAL_F 	<- VAL_F + e2
0097edfa: 97edf2 	<- 0
0097ee12: VAL_B 	<- &VAL_F
0097ee2a: 97ee52 	<- [VAL_B + 0]
0097ee42: VAL_F 	<- VAL_F + 7f2
0097ee5a: 97ee52 	<- 0
0097ee72: VAL_B 	<- &VAL_E
0097ee8a: 97eeb2 	<- [VAL_B + 0]
0097eea2: VAL_F 	<- VAL_F + e2
0097eeba: 97eeb2 	<- 0
0097eed2: VAL_B 	<- &VAL_F
0097eeea: 97ef12 	<- [VAL_B + 0]
0097ef02: VAL_F 	<- VAL_F + 10c6
0097ef1a: 97ef12 	<- 0
0097ef32: 97ef5a 	<- [VAL_B + 0]
0097ef4a: VAL_F 	<- VAL_F + 218c
0097ef62: 97ef5a 	<- 0
0097ef7a: VAL_B 	<- &VAL_E
0097ef92: 97efba 	<- [VAL_B + 0]
0097efaa: VAL_F 	<- VAL_F + e2
0097efc2: 97efba 	<- 0
0097efda: VAL_B 	<- &VAL_F
0097eff2: 97f01a 	<- [VAL_B + 0]
0097f00a: VAL_F 	<- VAL_F + 43fa
0097f022: 97f01a 	<- 0
0097f03a: VAL_B 	<- &VAL_E
0097f052: 97f07a 	<- [VAL_B + 0]
0097f06a: VAL_F 	<- VAL_F + e2
0097f082: 97f07a 	<- 0
0097f09a: VAL_B 	<- &VAL_F
0097f0b2: 97f0da 	<- [VAL_B + 0]
0097f0ca: VAL_G 	<- VAL_G + 88d6
...

Quite a lengthy bit of code, but it’s actually super simple. Firstly, a negative int64 is loaded into VAL_G, then 1 byte is loaded from address 0x804aca (which is the start of the 24-byte buffer) to VAL_F. VAL_F is then continuously added to itself until it reaches the value of 0x88d6, then it’s added to VAL_G. This is basically just a multiplication of the first byte of the 24-byte buffer with a constant and adding the result to VAL_G. We can simply get the multiplication constant by doing 0x88d6 / 0xe2 = 155. This multiplication pattern repeats itself 24 times for all the bytes, each with a different constant, then VAL_G is finally added to VAL_D. After that, VAL_G is loaded with a new negative int64 and the whole big process repeats again, also for 24 times. In the end, we want VAL_D to be 0, so this is basically just a system of 24 equations that must equal to 0, with 24 variables to solve. Suppose the 24-byte buffer is called v, constants are aX, all the equations are in the form of:

a0 * v[0] + a1 * v[1] + ... + a23 * v[23] == 0xffffffffffxxxxxx

I wrote a script to parse out all the constants (parse_consts_1.py), then plug all the equations into z3 to solve. It took a while for z3 to solve, but it’s definitely solvable. The z3 script is solve_1.py:

from z3 import *

s = Solver()

V_LEN = 24
v = [0] * V_LEN
for i in range(V_LEN):
    v[i] = BitVec("v_{}".format(i), 64)
    s.add(And(v[i] >= 0x00, v[i] <= 0xff))

mults = [
    [155, 190, 208, 123, 214, 146, 211, 85, 87, 251, 122, 60, 237, 27, 63, 207, 33, 138, 51, 103, 58, 68, 120, 111],
    [254, 21, 157, 204, 56, 134, 79, 115, 9, 91, 46, 12, 34, 45, 200, 131, 184, 180, 5, 100, 61, 59, 52, 249],
    ...
    [58, 101, 2, 202, 6, 250, 222, 87, 212, 151, 179, 72, 235, 228, 61, 240, 179, 164, 175, 192, 41, 62, 50, 58]
]

consts = [0xfffffffffffa00b1, 0xfffffffffffa608a, 0xfffffffffffa0a1e, 0xfffffffffffa7510, 0xfffffffffff9b125, 0xfffffffffff9e59c, 0xfffffffffff85543, 0xfffffffffff99fda, 0xfffffffffff967ac, 0xfffffffffffa22b9, 0xfffffffffffa49e3, 0xfffffffffff86f81, 0xfffffffffffb7d48, 0xfffffffffff8a759, 0xfffffffffff88a1d, 0xfffffffffff9f131, 0xfffffffffffa59be, 0xfffffffffff78985, 0xfffffffffff9924c, 0xfffffffffff70587, 0xfffffffffff9ac6e, 0xfffffffffff92853, 0xfffffffffffa0636, 0xfffffffffff8dda8]
sum = 0

for i in range(24):
    x = consts[i]
    for j in range(24):
        x = (x + v[j] * mults[i][j]) & 0xffffffffffffffff
    sum = sum + (x & 0xffffff)

s.add(sum == 0)

sol_count = 0
solutions = []

print("Checking")

while sol_count < 5 and s.check() == sat:
    model = s.model()
    result = [model[v[i]].as_long() for i in range(V_LEN)]
    print(result)
    solutions.append(result)
    sol_count += 1

    print("sol_count=", sol_count)

    cond = True
    for i in range(1, V_LEN):
        cond = And(cond, v[i] == result[i])

    s.add(Not(cond))

Then, with the 24-byte buffer ready, I wrote another script to bruteforce the first 16 bytes of the serial (brute_1.py):

wtf = [134, 72, 8, 237, 30, 49, 89, 229, 232, 232, 228, 17, 242, 81, 243, 1, 225, 114, 46, 224, 109, 91, 103, 182]

def swap(arr, a, b):
    tmp = arr[a]
    arr[a] = arr[b]
    arr[b] = tmp

def one_round(a, b):
    x = 0
    mem = list(range(0x100))
    tmp = [0] * 3
    for j in range(128):
        x = (x + a) & 0xff
        swap(mem, 2*j, x)
        x = (x + mem[2*j+1]) & 0xff

        x = (x + b) & 0xff
        swap(mem, 2*j+1, x)
        x = (x + mem[(2*j+2) & 0xff]) & 0xff

    x = 0
    for j in range(3):
        x = (x + mem[j]) & 0xff
        swap(mem, j, x)
        tmp[j] += mem[(mem[j] + mem[x]) & 0xff]
    
    return tmp

for i in range(0, len(wtf), 3):
    for a in range(0x21, 0x7f):
        for b in range(0x21, 0x7f):
            tmp = one_round(a, b)
            if tmp[0] == wtf[i] and tmp[1] == wtf[i+1] and tmp[2] == wtf[i+2]:
                print(chr(a) + chr(b), end='')

And that’s the first part of the flag solved: CTF{H0p3_y0u_l1k

Analyzing the last 12 bytes of serial checking algorithm

This part starts at address 0xa29152 to the end. We are immediately greeted with a very bizarre and unusual pattern right at the start:

00a29152: a290c2 	<- 5500404040c7c748
00a2916a: a290ca 	<- c7e87d8948e58948
00a29182: a290d2 	<- 45c700000000fc45
00a2919a: a290da 	<- 8b3beb00000000f8
00a291b2: a290e2 	<- 458b48d06348fc45
00a291ca: a290ea 	<- 3c00b60fd00148e8
00a291e2: a290f2 	<- 6348fc458b147620
00a291fa: a290fa 	<- d00148e8458b48d0
00a29212: a29102 	<- b807767e3c00b60f
00a2922a: a2910a 	<- b805eb00000001
00a29242: a29112 	<- 4583f84509000000
00a2925a: a2911a 	<- bf7e1bfc7d8301fc
00a29272: a29122 	<- 8b48d06348fc458b
00a2928a: a2912a 	<- b60fd00148e845
00a292a2: a29132 	<- 458bf84509c0b60f
00a292ba: a2913a 	<- c3c35d9848f8
00a292d2: VAL_C 	<- a290c2
00a292ea: TYPE_C 	<- 1000a0000001a
00a29302: VAL_E 	<- VAL_C
00a2931a: VAL_B 	<- a59caa
00a29332: LEN_B 	<- 90
00a2934a: a290c2 	<- [VAL_B + 0]

A bunch of seemingly gibberish data is copied to an empty memory region. Then the start address of that region is assigned to VAL_C. The next 2 instructions are the weird ones: TYPE_C is actually changed to something different, and its value is immediately moved to VAL_E, which doesn’t really make any sense. There are 78 occurences of this pattern, and by debugging them, it doesn’t seem like VAL_C is assigned to VAL_E at all. Also this new type of C is not recognizable by readelf, as it says <OS specific>: 10.

So what’s happening here? Relying on intuition alone, I suspected that these gibberish data are actually x86 shellcode, and the change of TYPE_C somehow allows for running these shellcodes within the relocation VM. I made a little change to the instructions parsing script to dump out all these “shellcodes”, and throw the dump into IDA. And indeed, these are x86 code!

Note

Turns out, the 0xa in 0x1000a0000001a of new TYPE_C means that C now has IFUNC type. When invoking a relocation with type R_X86_64_64 on an IFUNC symbol, the function where its value is pointing at will be executed, and the return value will be saved to r_offset, as described here. Thanks @NopNopGoose on Twitter for this info.

...
elif r_type == 1:
    if read64(0x8040ac) == 0x1000a0000001a and r_sym == 3:
        shell_len = read64(pc+24*2+16)
        shell_addr = read64(0x8040b4)
        shellcode.write(bytes(mem[shell_addr:shell_addr+shell_len]))
        write64(r_offset, 0)
        print("{} \t<- SHELLCODE".format(print_offset))
    else:
        write64(r_offset, r_addend + read64(sym_addr))
        print("{} \t<- {} + {}".format(print_offset, regs[sym_addr], print_addend))
...

There are 78 of them in total. The first one is used to check if all the bytes in serial are printable and if the length is equal to 28. The last one is used to check if VAL_D is equal to 0 or not. The rest are simply one bitwise arithmetic instruction on VAL_E with another constant (including and, or, xor, shl, shr, rol and ror). Because the format of those x86 bitwise arithmetic instructions are very similar, i can write a script to parse them out.

All the other codes in this section follow the similar pattern to the constant multiplication used in the last section. It means that to check the last 12 bytes of serial, in each equation, the VM uses constant multiplication on some of the bytes and bitwise shellcode on the others. The goal is still to make VAL_D equals to 0. The script parse_consts_2.py is written to dump all of the constants out. There are 17 equations in total, here is an example:

76
104
AND 159
ROL 86
OR 71
SHR 0
84
3
SHL 2
OR 184
30
233

The lines that don’t contain a keyword simply mean multiplication. With these constant values all parsed out, I could write another z3 script to solve for the 2nd part of serial (solve_2.py). The result is 3_3LF_m4g1c}.

Combining the 2 parts, the complete value of serial is our flag: CTF{H0p3_y0u_l1k3_3LF_m4g1c}

Appendix

parse_opcode.py - the script to parse and pretty print VM instructions
patch.py - the script to remove some part of the relocation metadata to help debugging
parse_consts_2.py and parse_consts_2.py - the scripts to parse multiplication (and bitwise) constants for the 1st and 2nd part of the flag
logic_1.py and logic_2.py - the VM code checking logic for each part of the flag, rewritten in python
solve_1.py and solve_2.py - the z3 scripts to solve the 24-byte buffer and the 2nd part of the flag
brute_1.py - the script to brute fore the 1st part of the flag from the 24-byte buffer