Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Constrained Decoding for Fill-in-the-Middle Code Language Models via Efficient Left and Right Quotienting of Context-Sensitive Grammars (2402.17988v2)

Published 28 Feb 2024 in cs.PL, cs.LG, and cs.SE

Abstract: LLMs are powerful tools for program synthesis and advanced auto-completion, but come with no guarantee that their output code is syntactically correct. This paper contributes an incremental parser that allows early rejection of syntactically incorrect code, as well as efficient detection of complete programs for fill-in-the-middle (FIM) tasks. We extend the Earley parsing algorithm to allow for left and right quotients of context-free grammars, and develop methods to handle quotienting of several context-sensitive features present in the grammars of many common programming languages. The result of these contributions is an efficient, general, and well-grounded method for left and right quotient parsing. To validate our theoretical contributions -- and the effectiveness of certain design decisions -- we evaluate our method on the particularly difficult case of FIM completion for Python 3, with syntax-correctness constraints. Our results demonstrate that constrained generation can significantly reduce the incidence of syntax errors in recommended code.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. SantaCoder: Don’t Reach for the Stars! https://doi.org/10.48550/arXiv.2301.03988 arXiv:2301.03988 [cs]
  2. Inc. Amazon Web Services. 2023. AI Code Generator - Amazon CodeWhisperer - AWS. https://aws.amazon.com/codewhisperer/.
  3. Automorphic. 2023. Trex. automorphic-ai.
  4. Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255 [cs]
  5. Language Models Are Few-Shot Learners. https://doi.org/10.48550/arXiv.2005.14165 arXiv:2005.14165 [cs]
  6. Evaluating Large Language Models Trained on Code. https://doi.org/10.48550/arXiv.2107.03374 arXiv:2107.03374 [cs]
  7. Dave Clarke. 2012. If L𝐿{{L}}italic_L Is Context-Free and R𝑅{{R}}italic_R Is Regular, Then L/R𝐿𝑅{{L}}/{{R}}italic_L / italic_R Is Context-Free?
  8. Jay Earley. 1970. An Efficient Context-Free Parsing Algorithm. Commun. ACM 13, 2 (Feb. 1970), 94–102. https://doi.org/10.1145/362007.362035
  9. Python Software Foundation. 2022. Lexical Analysis. https://docs.python.org/3/reference/lexical_analysis.html.
  10. InCoder: A Generative Model for Code Infilling and Synthesis. https://doi.org/10.48550/arXiv.2204.05999 arXiv:2204.05999 [cs]
  11. Inc. Github. 2023. GitHub Copilot ⋅⋅\cdot⋅ Your AI Pair Programmer. https://github.com/features/copilot.
  12. The Java® Language Specification.
  13. Dag Hovland. 2010. The Inclusion Problem for Regular Expressions. In Language and Automata Theory and Applications (Lecture Notes in Computer Science), Adrian-Horia Dediu, Henning Fernau, and Carlos Martín-Vide (Eds.). Springer, Berlin, Heidelberg, 309–320. https://doi.org/10.1007/978-3-642-13089-2_26
  14. Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 839–850. https://doi.org/10.18653/v1/N19-1090
  15. Evan Jones. 2023. Llama : Add Grammar-Based Sampling. https://github.com/ggerganov/llama.cpp/pull/1773.
  16. Code Prediction by Feeding Trees to Transformers. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (ISCE ’21). IEEE Press, Madrid, Spain, 150–162. https://doi.org/10.1109/ICSE43902.2021.00026
  17. The Stack: 3 TB of Permissively Licensed Source Code. https://doi.org/10.48550/arXiv.2211.15533 arXiv:2211.15533 [cs]
  18. R. C. T. Lee and S. K. Chang. 1974. Structured Programming and Automatic Program Synthesis. In Proceedings of the ACM SIGPLAN Symposium on Very High Level Languages. Association for Computing Machinery, New York, NY, USA, 60–70. https://doi.org/10.1145/800233.807046
  19. Code Completion with Neural Attention and Pointer Networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, Stockholm, Sweden, 4159–4165.
  20. StarCoder: May the Source Be with You! https://doi.org/10.48550/arXiv.2305.06161 arXiv:2305.06161 [cs]
  21. Peter Linz. 2012. An Introduction to Formal Languages and Automata (5th ed ed.). Jones & Bartlett Learning, Sudbury, MA.
  22. A Unified Multi-Task Learning Model for AST-level and Token-Level Code Completion. Empirical Software Engineering 27, 4 (April 2022), 91. https://doi.org/10.1007/s10664-022-10140-7
  23. Zohar Manna and Richard Waldinger. 1980. A Deductive Approach to Program Synthesis. ACM Transactions on Programming Languages and Systems 2, 1 (Jan. 1980), 90–121. https://doi.org/10.1145/357084.357090
  24. Zohar Manna and Richard J. Waldinger. 1971. Toward Automatic Program Synthesis. Commun. ACM 14, 3 (March 1971), 151–165. https://doi.org/10.1145/362566.362568
  25. Microsoft. 2023a. Guidance. Microsoft.
  26. Microsoft. 2023b. TypeChat. https://microsoft.github.io/TypeChat/.
  27. Neural Program Generation Modulo Static Analysis. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., Virtual, 18984–18996.
  28. Quoc-Sang Phan. 2013. Self-composition by symbolic execution. OpenAccess Series in Informatics 35 (01 2013). https://doi.org/10.4230/OASIcs.ICCSW.2013.95
  29. Improving Language Understanding by Generative Pre-Training.
  30. Language Models Are Unsupervised Multitask Learners.
  31. Code Completion with Statistical Language Models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’14). Association for Computing Machinery, New York, NY, USA, 419–428. https://doi.org/10.1145/2594291.2594321
  32. Rahul Sengottuvelu. 2023. Jsonformer: A Bulletproof Way to Generate Structured JSON from Language Models.
  33. Erez Shinan. 2023. Lark - A Parsing Toolkit for Python.
  34. Grant Slatton. 2023. Added Context Free Grammar Constraints ⋅⋅\cdot⋅ Grantslatton/Llama.Cpp@007e26a. https://github.com/grantslatton/llama.cpp/commit/007e26a99d485007f724957fa8545331ab8d50c3.
  35. SRI. 2023. LQML. SRI Lab, ETH Zurich.
  36. Pythia: AI-assisted Code Completion System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2727–2735. https://doi.org/10.1145/3292500.3330699
  37. Syntax-Aware On-the-Fly Code Completion. arXiv:2211.04673 [cs]
  38. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv:2109.00859 [cs]
  39. Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models. arXiv:2307.09702 [cs]
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com