Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 161 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 471 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

LLM-Powered Test Case Generation for Detecting Bugs in Plausible Programs (2404.10304v2)

Published 16 Apr 2024 in cs.SE and cs.LG

Abstract: Detecting tricky bugs in plausible programs, those that pass existing test suites yet still contain bugs, remains a significant challenge in software testing. To address this problem, we propose TrickCatcher, an LLM-powered approach to generating test cases for uncovering bugs in plausible programs. TrickCatcher operates in three stages: First, it uses an LLM to generate program variants based on the program under test (PUT) and its specification. Second, it employs an LLM to construct an input generator from the specification for producing test inputs. Finally, these inputs are executed on both the PUT and its program variants to detect inconsistencies in their outputs. We evaluate TrickCatcher on two datasets, TrickyBugs and EvalPlus, which include 366 human-written and 151 AI-generated plausible programs with tricky bugs. TrickCatcher achieves recall, precision, and F1 scores that are 1.80x, 2.65x, and 1.66x those of the state-of-the-art baselines, respectively. Code and data used are available at https://github.com/RinCloud/TrickCatcher.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. [n. d.]. EvalPlus Pre-Generated LLM Code Samples. https://github.com/evalplus/evalplus/releases/tag/v0.1.0
  2. [n. d.]. TrickyBugs. https://github.com/RinCloud/TrickyBugs
  3. An orchestrated survey of methodologies for automated software test case generation. Journal of systems and software 86, 8 (2013), 1978–2001.
  4. The oracle problem in software testing: A survey. IEEE transactions on software engineering 41, 5 (2014), 507–525.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  6. Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering. 2130–2141.
  7. Jon Edvardsson. 1999. A survey on automatic test data generation. In Proceedings of the 2nd Conference on Computer Science and Engineering. 21–28.
  8. Robert B Evans and Alberto Savoia. 2007. Differential testing: a new approach to change detection. In The 6th Joint Meeting on European software engineering conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers. 549–552.
  9. Large Language Models for Software Engineering: Survey and Open Problems. arXiv:2310.03533 [cs.SE]
  10. Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419.
  11. Pal: Program-aided language models. In International Conference on Machine Learning. PMLR, 10764–10799.
  12. Automatic generation of oracles for exceptional behaviors. In Proceedings of the 25th international symposium on software testing and analysis. 213–224.
  13. Investigating and Detecting Silent Bugs in PyTorch Programs. ([n. d.]).
  14. An empirical study on fine-tuning large language models of code for automated program repair. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1162–1174.
  15. Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, 14–26.
  16. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=1qvx610Cu7
  17. TrickyBugs: A Dataset of Corner-case Bugs in Plausible Programs. In Proceedings of the 21st International Conference on Mining Software Repositories (MSR 2024). https://doi.org/10.1145/3643991.3644870
  18. Who Judges the Judge: An Empirical Study on Online Judge Tests. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 334–346. https://doi.org/10.1145/3597926.3598060
  19. Towards More Realistic Evaluation for Neural Test Oracle Generation. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 589–600. https://doi.org/10.1145/3597926.3598080
  20. Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 168–172.
  21. Phil McMinn. 2004. Search-based software test data generation: a survey. Software testing, Verification and reliability 14, 2 (2004), 105–156.
  22. Phil McMinn. 2011. Search-based software testing: Past, present and future. In 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops. IEEE, 153–163.
  23. What do we know about defect detection methods?[software testing]. IEEE software 23, 3 (2006), 82–90.
  24. Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM. arXiv preprint arXiv:2402.00097 (2024).
  25. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering 50, 1 (2024), 85–105. https://doi.org/10.1109/TSE.2023.3334955
  26. Silent bugs in deep learning frameworks: An empirical study of Keras and TensorFlow. Empirical Software Engineering 29, 1 (2024), 10.
  27. @ tcomment: Testing javadoc comments to detect comment-code inconsistencies. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. IEEE, 260–269.
  28. Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617 (2020).
  29. Generating accurate assert statements for unit test cases using pretrained transformers. In Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test. 54–64.
  30. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498 (2022).
  31. Software testing with large language model: Survey, landscape, and vision. IEEE Transactions on Software Engineering (2024).
  32. On learning meaningful assert statements for unit test cases. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1398–1409.
  33. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
  34. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494.
  35. ChatUniTest: a ChatGPT-based automated unit test generation tool. arXiv preprint arXiv:2305.04764 (2023).
  36. Automated conformance testing for javascript engines via deep compiler fuzzing. In Proceedings of the 42nd ACM SIGPLAN international conference on programming language design and implementation. 435–450.
  37. arXiv:2305.04207 [cs.SE]
  38. Michal Zalewski. 2015. American Fuzzy Lop (AFL). lcamtuf.coredump.cx/afl/
  39. C2S: translating natural language comments to formal program specifications. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 25–37.
Citations (14)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 2 likes.

Upgrade to Pro to view all of the tweets about this paper: