Learn&Fuzz: Machine Learning for Input Fuzzing (1701.07232v1)

Published 25 Jan 2017 in cs.AI, cs.CR, cs.LG, cs.PL, and cs.SE

Abstract: Fuzzing consists of repeatedly testing an application with modified, or fuzzed, inputs with the goal of finding security vulnerabilities in input-parsing code. In this paper, we show how to automate the generation of an input grammar suitable for input fuzzing using sample inputs and neural-network-based statistical machine-learning techniques. We present a detailed case study with a complex input format, namely PDF, and a large complex security-critical parser for this format, namely, the PDF parser embedded in Microsoft's new Edge browser. We discuss (and measure) the tension between conflicting learning and fuzzing goals: learning wants to capture the structure of well-formed inputs, while fuzzing wants to break that structure in order to cover unexpected code paths and find bugs. We also present a new algorithm for this learn&fuzz challenge which uses a learnt input probability distribution to intelligently guide where to fuzz inputs.

Citations (357)

View on Semantic Scholar

Summary

The paper introduces a novel machine learning approach to synthesize input grammars, reducing manual efforts in fuzzing.
It employs RNN-based sequence-to-sequence models on PDF data to capture detailed structures while probing unexpected code paths.
Experimental results show a trade-off between modes, where 'Sample' achieves higher code coverage and 'SampleSpace' yields improved pass rates.

An Analytical Overview of "Learn{content}Fuzz: Machine Learning for Input Fuzzing"

Introduction to Input Fuzzing

Fuzzing has emerged as a quintessential process in identifying security vulnerabilities within software, primarily within input-parsing modules. The paper introduces a machine learning approach aimed at automating the generation of input grammars for grammar-based fuzzing — a domain where, traditionally, manually written grammars have dictated the efficacy of the fuzzing process.

Methodological Framework

The paper focuses on synthesizing input grammars from sample inputs using neural-network-based machine learning techniques, principally recurrent neural networks (RNNs). The authors explore this within the scope of the Portable Document Format (PDF), utilizing the PDF parser integrated into Microsoft's Edge browser as a case paper.

Learning and Fuzzing Objectives

A paramount tension identified in the paper is between the objectives of learning — which necessitates capturing the intricate structures of well-formed inputs — and fuzzing, which seeks to explore unexpected code paths by disrupting these structures. The research innovatively combines these objectives using the developed "learn{content}fuzz" algorithm, leveraging a learnt probability distribution to guide input fuzzing intelligently.

Experimental Design and Outcomes

The empirical component of the paper employs sequence-to-sequence RNN models to statistically learn the structure of PDF data objects. By implementing both $Sample$ and $SampleSpace$ modes, the research evaluates various strategies for generating new PDF inputs. These strategies aim to maximize parser-code coverage while maintaining robustness in testing unexpected error-handling pathways.

Critically, the experiments show that the $Sample$ mode trained with 40 epochs yields superior code coverage, albeit slightly inferior pass rates. The $SampleSpace$ mode, while providing higher pass rates, does not achieve the same extent of code exploration. This experiment highlights a fundamental compromise between the prevalence of well-formed inputs and the necessity to derogate from this structure to explore potential vulnerabilities more thoroughly.

Implications and Future Directions

From a practical standpoint, the "learn{content}fuzz" framework, demonstrated through the complex input format of PDF objects, provides a compelling case for the application of machine learning models in security testing. This methodology could streamline the notoriously manual process of grammar specification, making fuzzing a more automated and thus scalable process.

Theoretically, this paper opens up interesting avenues for extending neural-network-based learning to more hierarchical input formats and suggests potential synergies with logic-based inferential algorithms. The authors call attention to reinforcement learning as a prospective enhancement to guide and optimize learning models with real-time coverage feedback loops, potentially driving innovations in automated security testing regimes.

Conclusion

The paper lays a foundational framework blending machine learning for generating and fuzzing input grammars, offering a statistically automated approach to identify security vulnerabilities. Future research may harness these insights to develop more sophisticated models for other complex input formats, thereby widening the scope and impact of automated fuzzing technologies in the domain of cybersecurity.

PDF Markdown

Related Papers

Not all bytes are equal: Neural byte sieve for fuzzing (2017)
Deep Reinforcement Fuzzing (2018)
Automated Fuzzing Harness Generation for Library APIs and Binary Protocol Parsers (2023)
Token-Level Fuzzing (2023)
Optimizing seed inputs in fuzzing with machine learning (2019)