Analyze Python Binaries
Introduction #
This article is a list of methods to help reverse engineer and understand compiled python binaries.
Background #
I also have another article where I described how to compile a python binary
manually using cython
and g++
. Now, it is the time to reverse engineer and learn about their behaviour.
I was a CTF challenge setter for 2 CTFs. I gave 1 + 3 “PyCompiled” binaries for the players to solve. They actually got a lot less solves and I don’t think they knew the internals of a PyCompiled binary. Again, that was nearly a year ago.
Recently, I stumbled upon a python binary which is not compiled with “my” method, but with Nuitka whose output/method is also very similar and [may] offer more features than my method.
In this article, I will use the word “PyCompiled” binary to refer to a binary that is compiled using my method.
Analysis #
One of the best ways to analyze a “PyCompiled” binary is to analyze the behaviour of a cythonized binary.
In this article, we will be using the following Makefile
to analyze or deduce the behaviour of the binary.
Again, I won’t be using the source code a.c
file a lot to make my predictions.
INCLUDE = -I /usr/include/python3.11
LIBDIRS = -L /usr/lib/python3.11/config-3.11-x86_64-linux-gnu
LIBS = -lm -lpython3.11 -g # Let's work with debug (for now)
CFLAGS = -O2 $(INCLUDE) $(LIBS) $(LIBDIRS)
CYTHONFLAGS = -3 --embed
SOURCE = main.py
DESTINATION = PyCompileRE
TMPFILE = a.c
CC = gcc
CYTHON = cython
build $(DESTINATION): $(SOURCE)
@echo "Transpiling source ..."
@$(CYTHON) $(CYTHONFLAGS) $(SOURCE) -o $(TMPFILE)
@echo "Compiling $(TMPFILE) ..."
@$(CC) $(INCLUDE) $(TMPFILE) $(LIBS) -o $(DESTINATION)
@rm $(TMPFILE)
If we want a “PyCompiled” binary of the following file:
# main.py
print("Hello, World")
We can just do a:
pip install cython # get cython
sudo apt-get install build-essential python3-dev # for python.h
$ make # to build the file
Note that we will use Linux for the tutorials, but most of it will be applicable to windows.
Note that the -g
file embeds the debug information into the .debug
sections, thanks to the DWARF file format, most people would say this is because of linux
, but I would argue that this is due to the nature of the compiler. In windows, one can have a debuggable
binary if we use gcc
. But we should stick to gdb to debug the binary. But that wouldn’t matter anyway.
Environment Variables #
The python (cpython) interpreter responds to a lot of environment variables.
The PYTHONVERBOSE
is a variable which our binary seems to definitely respond to.
$ PYTHONVERBOSE=1 ./PyCompiled
import _frozen_importlib # frozen
import _imp # builtin
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
import '_io' # <class '_frozen_importlib.BuiltinImporter'>
import 'marshal' # <class '_frozen_importlib.BuiltinImporter'>
import 'posix' # <class '_frozen_importlib.BuiltinImporter'>
import '_frozen_importlib_external' # <class '_frozen_importlib.FrozenImporter'>
...
Now that we don’t have the source code, there’s a possibility that we can get this verbose output. This output can be useful to detect if the binary attempts to load any other (suspicious) modules.
For example, if we have:
import numpy as np
input('Wait!')
We can also use the memory map of the program since most of numpy is compiled. This would not be possible if the binary tries to load an uncompiled python module.
$ cat /proc/`pidof PyCompileRE`/maps
7f2d3708f000-7f2d37096000 r--p 00000000 00:19 984232 /.../_generator.so
...
7f2d37159000-7f2d3715b000 rw-p 00000000 00:00 0
7f2d3715b000-7f2d3715e000 r--p 00000000 00:19 984241 /.../_sfc64.cpython-311-x86_64-linux-gnu.so
...
7f2d3716a000-7f2d3716d000 r--p 00000000 00:19 984236 /.../_pcg64.cpython-311-x86_64-linux-gnu.so
...
7f2d37183000-7f2d37186000 r--p 00000000 00:19 984238 /.../_philox.cpython-311-x86_64-linux-gnu.so
...
7f2d37198000-7f2d3719b000 r--p 00000000 00:19 984234 /.../_mt19937.cpython-311-x86_64-linux-gnu.so
...
7f2d371af000-7f2d371b3000 r--p 00000000 00:19 984211 /.../_bounded_integers.cpython-311-x86_64-linux-gnu.so
...
Also, if a program is running constantly, we can hit Ctrl+C
and obtain a neat stacktrace. The stacktrace often contains some vital information. So, I made the following script for linux:
import subprocess
import signal
import time
import re
import logging
import numpy as np
logging.basicConfig(level=logging.INFO)
from pprint import pprint
code = {}
def run_process(command, delay):
logging.debug('Starting process %s ...', command)
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, shell=True)
try:
time.sleep(delay)
process.send_signal(signal.SIGINT)
_, stderr = process.communicate(timeout=5)
line, info = stderr.split('\n')[1:3]
info = info.strip()
line = re.search(r'File ".*", line (\d+)', line).group(1)
code[line] = info
logging.debug('Done ...', command)
logging.info('Grabbed: %s:%s', line, info)
except Exception as err:
logging.error('Failed to grab code [%s]: %s', err, stderr)
finally:
if process.poll() is None:
process.terminate()
if __name__ == "__main__":
command_to_run = ['./PyCompileRE']
for i in np.arange(0, 1, .1):
run_process(command_to_run, i)
pprint(code)
Would extract parts of the code. But if the developer intentionally added a stray newline, it would give out newlines and would not be fruitful.
I am trying to explore libpython3.XX
binaries with debug symbols to solve this problem.
More content (research) coming soon!