Emergent Mind

LLM4Decompile: Decompiling Binary Code with Large Language Models

(2403.05286)
Published Mar 8, 2024 in cs.PL and cs.CL

Abstract

Decompilation aims to restore compiled code to human-readable source code, but struggles with details like names and structure. LLMs show promise for programming tasks, motivating their application to decompilation. However, there does not exist any open-source LLM for decompilation. Moreover, existing decompilation evaluation systems mainly consider token-level accuracy and largely ignore code executability, which is the most important feature of any program. Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. The open-source LLMs can serve as baselines for further development in the field. To ensure practical program evaluation, we introduce Decompile-Eval, the first dataset that considers re-compilability and re-executability for decompilation. The benchmark emphasizes the importance of evaluating the decompilation model from the perspective of program semantics. Experiments indicate that our LLM4Decompile has demonstrated the capability to accurately decompile 21% of the assembly code, which achieves a 50% improvement over GPT-4. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile

Pipeline evaluating decompilation processes.

Overview

  • LLM4Decompile introduces an open-source Large Language Model (LLM) pre-trained on C source code and assembly instructions, aimed at improving decompilation processes.

  • Decompile-Eval, a novel benchmark for evaluating decompiled code based on re-compilability and re-executability, is also introduced, focusing on the practical aspects of the decompilation output.

  • The LLM4Decompile models, ranging from 1B to 33B parameters, show significant improvement over existing tools, with achievements in syntax and semantics understanding of the code.

  • The research advances the application of LLMs in decompilation and reverse engineering, suggesting future extensions to other programming languages and more complex decompilation tasks.

Decompiling Binary Code with LLMs: Introducing LLM4Decompile

Introduction to Decompilation and LLMs

Decompilation, the process of translating binary or bytecode back into human-readable source code, poses significant challenges, particularly in terms of preserving details like variable names and structural elements such as loops. Meanwhile, the advancement in LLMs for programming tasks suggests their potential utility in decompilation. As a pioneering effort, we present LLM4Decompile, the first open-source LLM specifically designed for decompilation, pre-trained on a substantial dataset of C source code and corresponding assembly instructions. Additionally, we introduce Decompile-Eval, a novel benchmark focusing on evaluating decompiled code based on re-compilability and re-executability, crucial indicators of a successful decompilation that were previously overlooked.

Key Challenges in Decompilation

Traditional decompilation tools often struggle with generating human-readable code that resembles the original source code in readability and structure. This is due to the inherent difficulty of reversing the compilation process, which loses certain information. Despite some success with Transformer-based models in addressing these issues, their limited size and lack of public availability have constrained their effectiveness and broader application. Furthermore, the absence of a standard benchmark for evaluating decompilation has impeded coherent progress in this field.

Introducing LLM4Decompile and Decompile-Eval

To remedy these limitations, we release LLM4Decompile, pre-trained LLMs ranging from 1B to 33B parameters, tailored for decompilation tasks. These models are trained on a dataset of 4 billion tokens, consisting of C source code and the corresponding assembly code compiled with various optimization levels. Alongside, Decompile-Eval is proposed as the first benchmark focused on re-compilability and re-executability of decompiled code, pioneering a more relevant evaluation framework for decompilation that prioritizes program semantics.

Evaluation and Results

Our models demonstrate a significant improvement over existing decompilation approaches, with our 6B LLM4Decompile achieving 87% re-compilability and 21% re-executability on Decompile-Eval. These figures indicate a comprehensive understanding of both the syntax and semantics of the code, surpassing the GPT-4 model significantly in these contexts.

Methodology

We compile C code into assembly using the GCC compiler across different optimization levels, and fine-tune the DeepSeek-Coder model on these assembly-source pairs. Our evaluation on Decompile-Eval assesses both the syntactic integrity and semantic accuracy of the decompiled code. The methodology adopted demonstrates that focusing on the sequence-to-sequence prediction significantly enhances the model's decompilation capabilities compared to other training strategies.

Theoretical and Practical Implications

Our research establishes a foundation for the application of LLMs in decompilation, significantly advancing the state-of-the-art. The introduction of Decompile-Eval as a benchmark directs future research towards more accurately assessing the practical utility of decompiled code. On a broader level, this work illuminates the path for applying large-scale LLMs to complex reverse-engineering tasks, potentially transforming practices in software maintenance, security analysis, and intellectual property evaluation.

Future Directions

The current scope is limited to C language and x86 architecture, focusing on decompiling single functions without considering external dependencies and cross-references. Future work could extend LLM4Decompile's methodology to other programming languages and architectural platforms, and address the complexities of decompiling entire software applications. This would encompass developing models that can accurately interpret and reconstruct the high-level constructs of complex software systems.

Conclusion

LLM4Decompile represents the forefront of leveraging LLMs for the task of decompilation, addressing both the syntactic and semantic challenges inherent to this process. The novel benchmark Decompile-Eval sets a new standard for evaluating decompilation tools, focusing on the practical usability of decompiled code. This work not only enhances the capabilities in decompilation but also opens new avenues for future research in applying LLMs to reverse engineering and code analysis tasks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
Reddit
LLM4Decompile: Decompiling Binary Code with Large Language Models (30 points, 12 comments) in /r/ReverseEngineering
[R] LLM4Decompile: Decompiling Binary Code with Large Language Models (26 points, 4 comments) in /r/MachineLearning