Constrained Decoding for Fill-in-the-Middle Code Language Models via Efficient Left and Right Quotienting of Context-Sensitive Grammars (2402.17988v2)
Abstract: LLMs are powerful tools for program synthesis and advanced auto-completion, but come with no guarantee that their output code is syntactically correct. This paper contributes an incremental parser that allows early rejection of syntactically incorrect code, as well as efficient detection of complete programs for fill-in-the-middle (FIM) tasks. We extend the Earley parsing algorithm to allow for left and right quotients of context-free grammars, and develop methods to handle quotienting of several context-sensitive features present in the grammars of many common programming languages. The result of these contributions is an efficient, general, and well-grounded method for left and right quotient parsing. To validate our theoretical contributions -- and the effectiveness of certain design decisions -- we evaluate our method on the particularly difficult case of FIM completion for Python 3, with syntax-correctness constraints. Our results demonstrate that constrained generation can significantly reduce the incidence of syntax errors in recommended code.
- SantaCoder: Don’t Reach for the Stars! https://doi.org/10.48550/arXiv.2301.03988 arXiv:2301.03988 [cs]
- Inc. Amazon Web Services. 2023. AI Code Generator - Amazon CodeWhisperer - AWS. https://aws.amazon.com/codewhisperer/.
- Automorphic. 2023. Trex. automorphic-ai.
- Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255 [cs]
- Language Models Are Few-Shot Learners. https://doi.org/10.48550/arXiv.2005.14165 arXiv:2005.14165 [cs]
- Evaluating Large Language Models Trained on Code. https://doi.org/10.48550/arXiv.2107.03374 arXiv:2107.03374 [cs]
- Dave Clarke. 2012. If L𝐿{{L}}italic_L Is Context-Free and R𝑅{{R}}italic_R Is Regular, Then L/R𝐿𝑅{{L}}/{{R}}italic_L / italic_R Is Context-Free?
- Jay Earley. 1970. An Efficient Context-Free Parsing Algorithm. Commun. ACM 13, 2 (Feb. 1970), 94–102. https://doi.org/10.1145/362007.362035
- Python Software Foundation. 2022. Lexical Analysis. https://docs.python.org/3/reference/lexical_analysis.html.
- InCoder: A Generative Model for Code Infilling and Synthesis. https://doi.org/10.48550/arXiv.2204.05999 arXiv:2204.05999 [cs]
- Inc. Github. 2023. GitHub Copilot ⋅⋅\cdot⋅ Your AI Pair Programmer. https://github.com/features/copilot.
- The Java® Language Specification.
- Dag Hovland. 2010. The Inclusion Problem for Regular Expressions. In Language and Automata Theory and Applications (Lecture Notes in Computer Science), Adrian-Horia Dediu, Henning Fernau, and Carlos Martín-Vide (Eds.). Springer, Berlin, Heidelberg, 309–320. https://doi.org/10.1007/978-3-642-13089-2_26
- Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 839–850. https://doi.org/10.18653/v1/N19-1090
- Evan Jones. 2023. Llama : Add Grammar-Based Sampling. https://github.com/ggerganov/llama.cpp/pull/1773.
- Code Prediction by Feeding Trees to Transformers. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (ISCE ’21). IEEE Press, Madrid, Spain, 150–162. https://doi.org/10.1109/ICSE43902.2021.00026
- The Stack: 3 TB of Permissively Licensed Source Code. https://doi.org/10.48550/arXiv.2211.15533 arXiv:2211.15533 [cs]
- R. C. T. Lee and S. K. Chang. 1974. Structured Programming and Automatic Program Synthesis. In Proceedings of the ACM SIGPLAN Symposium on Very High Level Languages. Association for Computing Machinery, New York, NY, USA, 60–70. https://doi.org/10.1145/800233.807046
- Code Completion with Neural Attention and Pointer Networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, Stockholm, Sweden, 4159–4165.
- StarCoder: May the Source Be with You! https://doi.org/10.48550/arXiv.2305.06161 arXiv:2305.06161 [cs]
- Peter Linz. 2012. An Introduction to Formal Languages and Automata (5th ed ed.). Jones & Bartlett Learning, Sudbury, MA.
- A Unified Multi-Task Learning Model for AST-level and Token-Level Code Completion. Empirical Software Engineering 27, 4 (April 2022), 91. https://doi.org/10.1007/s10664-022-10140-7
- Zohar Manna and Richard Waldinger. 1980. A Deductive Approach to Program Synthesis. ACM Transactions on Programming Languages and Systems 2, 1 (Jan. 1980), 90–121. https://doi.org/10.1145/357084.357090
- Zohar Manna and Richard J. Waldinger. 1971. Toward Automatic Program Synthesis. Commun. ACM 14, 3 (March 1971), 151–165. https://doi.org/10.1145/362566.362568
- Microsoft. 2023a. Guidance. Microsoft.
- Microsoft. 2023b. TypeChat. https://microsoft.github.io/TypeChat/.
- Neural Program Generation Modulo Static Analysis. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., Virtual, 18984–18996.
- Quoc-Sang Phan. 2013. Self-composition by symbolic execution. OpenAccess Series in Informatics 35 (01 2013). https://doi.org/10.4230/OASIcs.ICCSW.2013.95
- Improving Language Understanding by Generative Pre-Training.
- Language Models Are Unsupervised Multitask Learners.
- Code Completion with Statistical Language Models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’14). Association for Computing Machinery, New York, NY, USA, 419–428. https://doi.org/10.1145/2594291.2594321
- Rahul Sengottuvelu. 2023. Jsonformer: A Bulletproof Way to Generate Structured JSON from Language Models.
- Erez Shinan. 2023. Lark - A Parsing Toolkit for Python.
- Grant Slatton. 2023. Added Context Free Grammar Constraints ⋅⋅\cdot⋅ Grantslatton/Llama.Cpp@007e26a. https://github.com/grantslatton/llama.cpp/commit/007e26a99d485007f724957fa8545331ab8d50c3.
- SRI. 2023. LQML. SRI Lab, ETH Zurich.
- Pythia: AI-assisted Code Completion System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2727–2735. https://doi.org/10.1145/3292500.3330699
- Syntax-Aware On-the-Fly Code Completion. arXiv:2211.04673 [cs]
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv:2109.00859 [cs]
- Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models. arXiv:2307.09702 [cs]