MoreRSS

site iconEvan MartinModify

I gave Google Chrome five years, from before release to 2012; I touched many pieces but I'm most responsible for the Linux port.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Evan Martin

Unpacking packed executables

2025-04-26 08:00:00

This post is part of a series on retrowin32.

Demoscene programs sometimes are packed with an executable compressor like UPX. An EXE packer ingests an EXE and emits a smaller one that does the same thing. This is useful for meeting demo competition file size limits. (Here's a few 4kb demos from last week. 4kb, including the music! Just the text of this blog post is nearly twice that!)

At a high level, a packed executable is a tiny program that, when run, uncompresses the real program into memory and then runs it. From the perspective of an emulator, packed executables are the same as regular executables; they are all just programs that eventually make some system calls.

But from the perspective of debugging what has gone wrong when running a packed executable, all of the interesting executable state — including even which system functions it calls — is only something that shows up after the executable has unpacked itself. For this reason it is useful to be able to unpack a packed executable, reconstructing the original program, which you can then feed into a program analyzer like Ghidra. (Tools like UPX even provide unpacking themselves, much like a zip utility also provides unzip.)

But what if your program isn't packed with UPX? Or (what I have encountered more) what if it's packed with an old enough version of UPX that current UPX will no longer unpack it?

An emulator can treat the unpacking logic as a black box and just execute it. If you pause the program right at the point after that, you could grab the in-memory unpacked state and dump it back into a new exe. This exe then looks like an ordinary program, ready for analysis.

I recently implemented just this idea in retrowin32 and successfully dumped at least one packed exe. This post talks about how it works.

Problem one: finding main

The first step is figuring out the execution point to dump from. Ideally you'd grab the memory state after the executable has unpacked but before any of its main code runs, but where exactly is that?

A packed exe looks something like:

entry_point:
  ; ...some loop to unpack things here
  jmp some_address  ; jmp to the unpacked code

If you load a packed exe into a tool like Ghidra, the jmp looks like it's jumping to a garbage address, because the code on the other side of that jump only exists after the unpacking logic runs. I think you might be able to automatically spot it because of that.

For now, since I'm deep into analyzing these executables in the first place, suppose I just manually identify the some_address address that we want to break on.

The next problem is that you can't just set an ordinary breakpoint at that point. A software breakpoint works by patching over the target with an int3 instruction, but if we set one of those at startup it gets overwritten by the unpacking loop. So instead, I can set a breakpoint at the last instruction of the unpacking code (the jmp in the above example) and then single step once (to follow the jmp).

This also will work with another type of packer I've seen, which generates code like:

entry_point:
  ; ...some loop to unpack things here,
  ; including:
  push some_address
  ; ...more asm
  ret  ; jmps to some_address

It's easier to set a breakpoint on the ret than it is to find which address it happens to jump to.

From that state I can then create a .exe file by dumping the current memory state, with the exe's entry_point properly set, and everything works... except for one important piece.

Problem two: reconstructing imports

To minimize the file size, a packed executable only declares dependencies on a minimal set of system functions. The unpacking process decompresses a list of all the real dependencies of the underlying executable, and it dynamically loads those dependencies so that they can be called once the underlying executable starts.

For the purposes of offline static analysis, we need these dynamic dependencies resolved. We want to load our executable into Ghidra and have it understand which system functions it calls.

To understand how to manage this you need to understand a little bit about how reasonable programs resolve imports. I wrote earlier about this, including some pictures. I'll resummarize the broad description that's relevant here.

A regular program that calls system functions like GetStartupInfoA() will call via the Import Address Table ("IAT"). In assembly this looks like:

call [imp_GetStartupInfoA]  ; where imp_GetStartupInfoA is a fixed address

That is, each call site says "call the address found at this fixed address, a point within the IAT". The executable loader reads a list of named imports (called the Import Directory Table ("IDT")) and fills in the IAT with the resolved addresses.

A packed executable's IDT is nearly empty because all the interesting work happens at unpack time. But the underlying executable that the packer started with had its own IAT; the packer fills it in as part of its unpacking. To reverse this, in our unpacked executable we need to reconstruct our own IDT that points at the IAT such that Ghidra believes it.

Where is that IAT? We don't know. From the emulator's perspective, the unpacking code does a bunch of grinding, eventually with a series of calls like:

hmodule = LoadLibrary("ddraw.dll")
func = GetProcAddress(hmodule, "DirectDrawCreate")
... stash func in the IAT of the in-memory unpacked executable

Unfortunately it's not obvious how to follow that "stash" operation. (In writing this post, I had the idea that maybe we could observe all memory writes?) But we can observe these calls to GetProcAddress and record which addresses we vended out, and then at dumping time we can scan the program's memory to find where those values ended up. (This is similar to how Scylla, an exe dumping tool, attempts to solve this problem. But Scylla doesn't get to cheat by using an emulator.)

We expect each value to show in memory exactly once, in the unpacked executable's IAT. (What if an value happens to be found in memory more than once due to accidentally colliding with some unrelated sequence of bytes? This hasn't come up yet but one idea I had is I could emulate the unpack sequence twice with different memory layouts such that the GetProcAddress calls return different values, and then compare the two for overlap.)

Once we have the IAT address, we can construct an IDT that tells the loader which functions correspond to which addresses, and stuff that new IDT into our executable as additional data. With a few minor adjustments to the exe headers I now have an exe unpacker that llvm-objdump can understand, and which Ghidra renders complete with system function calls like:

00428c37 6a 05         PUSH     0x5
00428c39 50            PUSH     EAX
00428c3a ff 15 b0      CALL     dword ptr [->USER32.DLL::ShowWindow]
         20 43 00
00428c40 ff 76 10      PUSH     dword ptr [ESI + 0x10]
00428c43 ff 15 ac      CALL     dword ptr [->USER32.DLL::UpdateWindow]
         20 43 00

Rust trait object layout

2025-03-11 08:00:00

Rust makes a distinction between values and references that you generally learn to work with while using the language. This week I learned an interesting new corner around how that distinction applies to trait objects, despite using the language for quite a long time. Maybe it will surprise you too!

Background: dynamically sized types

When you have some x: usize you can say x is the usize itself, while some y: &usize = &x is a reference to that value. y's concrete value is a pointer.

Similarly, the type [u8; 40] means 40 bytes itself. If you put it in a struct or pass it to a function, you're moving around 40 bytes.

Finally, the type [u8] means a series of bytes of (compile-time) unknown size. You don't interact with these often because you can't usually put these in a struct or pass them to a function because of their unknown size. Instead you typically work with references to these as &[u8], which concretely are a pointer and a length. (cheats.rs has nice pictures of this.)

But you still do sometimes see [u8] as a type without a reference, in types like Box<[u8]>. And further, you can wrap a dynamically sized type in a struct as the last member, making that struct dynamically sized as well:

struct Test<X: ?Sized> {
    inner: X,
}

// the type Test<[u8]> is now also dynamically sized

Background: trait objects

The type File represents an open file. Concretely, it's a struct with some private fields.

The trait Read represents things that can be read from, with a .read() method. The trait implies a trait object type dyn Read, which is the type of things that implement the trait. File implements Read, so I will use File and Read for the following examples.

Concretely, the layout of a trait object dyn Read is the same as the underlying value it's wrapping, e.g. a File (spec ref). (This is possibly only useful to know as compiler trivia; even cheats.rs doesn't document this!) Much like [u8], because their concrete size is not known at compile time, you don't typically interact with these directly but instead via references.

The type &dyn Read is a reference to a trait object. Concretely it's a pointer to the object and a pointer to a static vtable of the methods for that type that implement the trait. (More pictures from cheats.rs.) Also like [u8], you might more often use Box<dyn Read>, which holds ownership over the underlying Read-able type.

(It was useful for my understanding to contrast these with C++ objects and vtables. In C++, the vtable pointer is always embedded in the struct itself. In Rust, the struct never contains a vtable pointer. Instead the reference to the trait object is two pointers, to the value and the vtable.)

Background: coercion

Though it's relatively rare in Rust, there are a few places where one type will silently convert to another. One you may have used without thinking is using a &mut T in a place that needs a &T.

When using trait objects there is another coercion:

let f: File = ...;
let r: &dyn Read = &f;

Here, &f is &File, but the compiler converts it to &dyn Read.

Finally, the surprise

Did you know that there is another trait-related coercion involving generics? Consider:

let f: File = ...;
let b: BufReader<File> = BufReader::new(f);
let r: &BufReader<dyn Read> = &b;  // !!! legal

Here, the File inside the BufReader<> was able to coerce to a trait object. Concretely, r here is like to a reference to a trait object, in that it is a pair of a pointer to the BufReader along with a pointer to a Read vtable. (Poking at the compiler, it uses the same Read vtable as you would get from a plain File.)

The underlying spec reference: [coerce.unsized.composite] for why this is allowed is pretty involved! But at a hand-wavy level it's allowed because BufReader<dyn Read> is a dynamically sized type where the last field is the place where the dyn Read is used. Note that, for example, you cannot have a similar coercion to two traits &SomeType<dyn A, dyn B> because the reference can only carry a single vtable.

(How is this useful? It came up in some code I was reviewing from someone new to Rust; I'm not sure.)

(Bonus question: The above doesn't compile if you substitute Box or Rc for BufReader. Why not? Something about the CoercedUnsized impls on those? I don't know the answer.)

Unsized coercion

This does compile though:

let f: File = ...;
let b: Box<File> = Box::new(f);
let r: Box<dyn Read> = b;  // consumes b

This is due to some magic traits implemented by Box (and also Rc, etc.).

I believe this behavior is the ultimate reason for these corners of support in the compiiler. You want to sometimes be able to implicitly convert between a struct and a trait object, and you also sometimes want to be able to wrap things (in e.g. Box or Rc) of unknown size, and then you want those two features to combine.

PS: Playground link, if you'd like to poke at it yourself.

Medium data and small data

2025-03-04 08:00:00

Two related threads, both views on limits to the useful size of data.

Medium data

I have recently been working with WebAssembly and Win32, which both use 32-bit pointers and are limited to 4gb of memory. Meanwhile, modern computers usually use 64-bit pointers. 64-bit bit pointers let you address more than 4gb of memory. (They also have other important uses, including pointer tricks around memory mapping like ASLR.)

But these larger pointers also cost memory. Fancy VMs like v8 employ "pointer compression" to try to reduce the memory cost. As always, you end up trading off CPU and memory. Is Memory64 actually worth using? talks about this tradeoff in the context of WebAssembly's relatively recent extension to support 64-bit pointers. The costs of the i386 to x86-64 upgrade dive in this in the context of x86, where it's difficult to disentangle the separate instruction set changes that accompanied x86-64 from the increased pointer size.

I provide these links as background for an interesting observation I heard about why a 4gb limit turns out to be "big enough" much of the time: to the extent a program needs to work with data sets larger than 4gb, the data is large enough that you end up working with it differently anyway.

In other words, the simple small data that programs typically work with, like command-line flags or configuration files, comfortably fits within a 4gb limit. And in contrast, the kind of bigger data that crosses the 4gb limit fundamentally will have different access patterns — typically smaller views onto the larger space — because it is too large to traverse quickly.

Imagine a program that generates 4gb of data or loads it from disk or network. Even if the program grinds through hundreds of megabytes per second, it still takes over 30 seconds to work through 4gb. This is enough time to probably require a different user interface such as a progress bar. Such a program likely will work with the data in a streaming fashion, paging in/out smaller blocks of the data via some file API that manages blocks of the >4gb data.

A standard example application that will make use of a ton of memory is a database. But a database is expressly designed around managing the larger size of the data, using indexing data structures so that a given query knows exactly which subsets of the larger data to access. Otherwise, large queries like table scans work with the data in a streaming fashion as in the previous paragraph.

Another canonical "needs a lot of memory" application is an image editor, which operates on a lot of pixels. I worked on one of those! To make the software grind through pixels fast you will take efforts to avoid needing to individually traverse all your pixels. To get the pixels onto the screen quickly, you instead load data piecewise into the GPU and let the GPU handle the rest. Or you write specialized "shader" GPU code. Both are again specialized APIs that work with

How about audio and video? These involve large amounts of data, but also have a time dimension, where most operations only work with a portion of the data near a particular timestamp.

In all, a 32-bit address space for the program's normal data structures coupled with some alternative mechanism for poking at the larger data indirectly has surprisingly ended up working out almost as well as a larger flat address space.

A lack of imagination

It's interesting to contrast this observation to a joke from the DOS era: "640kb ought to be enough for anybody". 640kb was a memory limit back then and the phrase was thrown around ironically to remark that reasonable programs actually need more. According to Gates (who it was apocryphally, falsely, attributed to) at the time, 640kb was already understood to be a painful limit.

In contrast, we've been easily fitting most programs in 4gb for the last 30 years — from the 1990s through today, where browsers still limit web pages to 4gb.

Why has this arbitrary threshold held up? You could argue it's a lack of imagination: maybe we have sized our programs to the limits we had? But 64-bit has been around for quite a while, long enough that there ought to be better examples of different kinds of programs that truly make use of it.

One answer is to observe is that many of the limits above are related to human limits. Humans won't wait 30s for a program to load its data; humans can't listen to 4 minutes of audio simultaneously, humans don't write >4gb of configuration files... or to put it together, humans don't consume 4gb of data in one gulp.

And that observation links to an interesting related phenomenon, which I sometimes call:

Small data

Data visualization is the problem of presenting data in a form where your brain can ingest it. What if you have a lot of data? We use the tools of statistics, to summarize collections of data into smaller representations that capture some of its essence.

Imagine building a stock chart, a simple line chart of a value over time. Even though you have data for every second of the day, when charting the data over a larger timeline, it's not useful to draw all of that data. Instead we will summarize, with either an average price per time window, or something that attempts to reduce each time window's data to a few numbers like a violin plot or OLHC.

Why do we do this? Visually, there's only so much you can usefully see. Even with a higher resolution display, shoving more details into smaller pixels does not convey usefully more information. In fact, it's often a better chart when it has fewer visual marks that still tells the same story. Much like the 4gb limit, the bandwidth of information into your brain sets an upper bound on the amount of useful data that can go into a chart.

From this you can derive an interesting sort of principle about data visualization software: your data visualization does not need be especially fast in supporting a lot of marks, beacuse you won't ever need to display many of them. For example, libraries like d3 are fine to do relatively low performance DOM traversal for rendering.

This is kind of a subtle point, so let me be clear: it's of course useful and important to be able to work with large data sets, and data visualization software often will provide critical API for processing them. The point is that once you are at the level of putting actual objects on the screen, you should have already reduced the data mathematically to a relatively easy small set of data, so you don't really need speedy handling of screen objects.

Retorts

But wait, you say, this is a dumb argument — it really is more convenient to just have 64-bit pointers everywhere and not worry about any limits. And wouldn't a charting API where you can hand a million points to a scatter plot be more useful than one where you can't?

I think both of those are preferable — in the absence of performance constraints. I prefer those in the same way I'd prefer, in a hypothetical future where performance is completely free, making a database with no need for indexes and running big AI matmuls without specialized GPU code or hardware. But at least in today's software it is the case that even 64-bit pointers are costly enough that implementers will go to lengthy efforts to reduce their cost.

So don't take this post as advocating that these lower limits are somehow just or correct. Rather, I aimed only to observe why it is that the memory limits from the 90s, or the visualization limits from the pen and paper days, have not been as constraining as you might first expect.