Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RE#: High Performance Derivative-Based Regex Matching with Intersection, Complement and Lookarounds (2407.20479v1)

Published 30 Jul 2024 in cs.FL

Abstract: We present a tool and theory RE# for regular expression matching that is built on symbolic derivatives, does not use backtracking, and, in addition to the classical operators, also supports complement, intersection and lookarounds. We develop the theory formally and show that the main matching algorithm has input-linear complexity both in theory as well as experimentally. We apply thorough evaluation on popular benchmarks that show that RE# is over 71% faster than the next fastest regex engine in Rust on the baseline, and outperforms all state-of-the-art engines on extensions of the benchmarks often by several orders of magnitude.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Valentin Antimirov. 1996. Partial Derivatives of Regular Expressions and Finite Automata Constructions. Theoretical Computer Science 155 (1996), 291–319. https://doi.org/10.1007/3-540-59042-0_96
  2. POSIX Lexing with Derivatives of Regular Expressions (Proof Pearl). In Interactive Theorem Proving (LNCS, Vol. 9807), Jasmin Christian Blanchette and Stephan Merz (Eds.). Springer, 69–86. https://doi.org/10.1007/978-3-319-43144-4_5
  3. cvc5: A Versatile and Industrial-Strength SMT Solver. In Tools and Algorithms for the Construction and Analysis of Systems - 28th International Conference, TACAS 2022, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022, Munich, Germany, April 2-7, 2022, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 13243), Dana Fisman and Grigore Rosu (Eds.). Springer, 415–442. https://doi.org/10.1007/978-3-030-99524-9_24
  4. Aurèle Barrière and Clément Pit-Claudel. 2024. Linear Matching of JavaScript Regular Expressions. PLDI’2024 8 (2024), 201:1–201:25. https://doi.org/10.1145/3656431
  5. Formalising and implementing Boost POSIX regular expression matching. Theoretical Computer Science 857 (2021), 147–165. https://doi.org/10.1016/j.tcs.2021.01.010
  6. Regular Expressions with Lookahead. Journal of Universal Computer Science 27, 4 (2021), 324–340. https://doi.org/10.3897/jucs.66330
  7. Janusz A. Brzozowski. 1964. Derivatives of regular expressions. JACM 11 (1964), 481–494. https://doi.org/10.1145/321239.321249
  8. Benjamin Carle and Paliath Narendran. 2009. On Extended Regular Expressions. In Language and Automata Theory and Applications, Third International Conference, LATA 2009, Tarragona, Spain, April 2-8, 2009. Proceedings (Lecture Notes in Computer Science, Vol. 5457), Adrian-Horia Dediu, Armand-Mihai Ionescu, and Carlos Martín-Vide (Eds.). Springer, 279–289. https://doi.org/10.1007/978-3-642-00982-2_24
  9. Partial Derivatives of an Extended Regular Expression. In Language and Automata Theory and Applications - 5th International Conference, LATA 2011, Tarragona, Spain, May 26-31, 2011. Proceedings (Lecture Notes in Computer Science, Vol. 6638), Adrian-Horia Dediu, Shunsuke Inenaga, and Carlos Martín-Vide (Eds.). Springer, 179–191. https://doi.org/10.1007/978-3-642-21254-3_13
  10. Solving String Constraints with Regex-Dependent Functions through Transducers with Priorities and Variables. Proc. ACM Program. Lang. 6, POPL, Article 45 (jan 2022), 31 pages. https://doi.org/10.1145/3498707
  11. Nariyoshi Chida and Tachio Terauchi. 2023. On Lookaheads in Regular Expressions with Backreferences. IEICE Trans. Inf. Syst. 106, 5 (2023), 959–975. https://doi.org/10.1587/transinf.2022edp7098
  12. Russ Cox. 2010. Regular Expression Matching in the Wild. https://swtch.com/r̃sc/regexp/regexp3.html
  13. Loris D’Antoni and Margus Veanes. 2021. Automata Modulo Theories. Commun. ACM 64, 5 (May 2021), 86–95.
  14. James C. Davis. 2019. Rethinking Regex Engines to Address ReDoS. In Proceedings of ESEC/FSE’19 (Tallinn, Estonia) (ESEC/FSE 2019). ACM, New York, NY, USA, 1256–1258. https://doi.org/10.1145/3338906.3342509
  15. The Impact of Regular Expression Denial of Service (ReDoS) in Practice: An Empirical Study at the Ecosystem Scale. In Proceedings of ESEC/FSE’18 (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). ACM, New York, NY, USA, 246–256. https://doi.org/10.1145/3236024.3236027
  16. Leonardo de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In TACAS’08 (LNCS). Springer, 337–340. https://doi.org/10.1007/978-3-540-78800-3_24
  17. A Play on Regular Expressions: Functional Pearl. SIGPLAN Not. 45, 9 (2010), 357–368. https://doi.org/10.1145/1863543.1863594
  18. Alain Frisch and Luca Cardelli. 2004. Greedy Regular Expression Matching. In Automata, Languages and Programming (ICALP’04) (LNCS, Vol. 3142), Josep Díaz, Juhani Karhumäki, Arto Lepistö, and Donald Sannella (Eds.). Springer, 618–629. https://doi.org/10.1007/978-3-540-27836-8_53
  19. Andrew Gallant. 2024. BurntSushi: rebar. https://github.com/BurntSushi/rebar.
  20. Wouter Gelade and Frank Neven. 2012. Succinctness of the Complement and Intersection of Regular Expressions. ACM Trans. Comput. Log. 13, 1 (2012), 4:1–4:19. https://doi.org/10.1145/2071368.2071372
  21. Regular Expression Matching using Bit Vector Automata. Proc. ACM Program. Lang. 7, OOPSLA1 (2023), 492–521. https://doi.org/10.1145/3586044
  22. V. M. Glushkov. 1961. The abstract theory of automata. Russian Math. Surveys 16 (1961), 1–53. https://doi.org/10.1070/RM1961v016n05ABEH004112
  23. GNU. 2023. grep. https://www.gnu.org/software/grep/.
  24. Google. 2024. RE2. https://github.com/google/re2.
  25. Kleenex: compiling nondeterministic transducers to deterministic streaming transducers. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 284–297. https://doi.org/10.1145/2837614.2837647
  26. Fast Matching of Regular Patterns with Synchronizing Counting. In Foundations of Software Science and Computation Structures - 26th International Conference, FoSSaCS 2023, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2023, Paris, France, April 22-27, 2023, Proceedings (Lecture Notes in Computer Science, Vol. 13992), Orna Kupferman and Pawel Sobocinski (Eds.). Springer, 392–412. https://doi.org/10.1007/978-3-031-30829-1_19
  27. Alec Koumjian. 2024. akoumjian: datefinder. https://github.com/akoumjian/datefinder.
  28. Dexter Kozen. 1997. Kleene algebra with tests. TOPLAS 19, 3 (1997), 427–443. https://doi.org/10.1145/256167.256195
  29. V. Laurikari. 2000. NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions. In 7th International Symposium on String Processing and Information Retrieval. 181–187. https://doi.org/10.1109/SPIRE.2000.878194
  30. A Decision Procedure for Regular Membership and Length Constraints over Unbounded Strings?. In Frontiers of Combining Systems, FroCoS 2015 (LNCS, Vol. 9322). Springer, 135–150. https://doi.org/10.1007/978-3-319-24246-0_9
  31. Sound Regular Expression Semantics for Dynamic Symbolic Execution of JavaScript. In PLDI’19. ACM, 425–438. https://doi.org/10.1145/3314221.3314645
  32. Konstantinos Mamouras and Agnishom Chattopadhyay. 2024. Efficient Matching of Regular Expressions with Lookaround Assertions. Proc. ACM Program. Lang. 8, POPL (2024), 2761–2791. https://doi.org/10.1145/3632934
  33. R. McNaughton and H. Yamada. 1960. Regular expressions and state graphs for automata. IEEE Trans. Elec. Comp. 9 (1960), 39–47.
  34. Microsoft. 2021a. CredScan. https://secdevtools.azurewebsites.net/helpcredscan.html.
  35. Microsoft. 2021b. Regular Expression Language - Quick Reference. https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference.
  36. Microsoft. 2022. .NET Regular Expressions. https://github.com/dotnet/runtime/tree/main/src/libraries/System.Text.RegularExpressions.
  37. Takayuki Miyazaki and Yasuhiko Minamide. 2019. Derivatives of Regular Expressions with Lookahead. J. Inf. Process. 27 (2019), 422–430. https://doi.org/10.2197/ipsjjip.27.422
  38. Akimasa Morihata. 2012. Translation of Regular Expression with Lookahead into Finite State Automaton. Computer Software 29, 1 (2012), 147–158. https://doi.org/10.11309/jssst.29.1_147
  39. Derivative Based Nonbacktracking Real-World Regex Matching with Backtracking Semantics. In PLDI ’23: 44th ACM SIGPLAN International Conference on Programming Language Design and Implementation, Florida, USA, June 17-21, 2023, Nate Foster et al. (Ed.). ACM, 1–2.
  40. OWASP. 2024. Regular expression Denial of Service - ReDoS. https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
  41. Regular-expression Derivatives Re-examined. J. Funct. Program. 19, 2 (2009), 173–190. https://doi.org/10.1017/S0956796808007090
  42. Damien Pous. 2015. Symbolic Algorithms for Language Equivalence and Kleene Algebra with Tests. ACM SIGPLAN Notices – POPL’15 50, 1 (2015), 357–368. https://doi.org/10.1145/2775051.2677007
  43. Teddy: An Efficient SIMD-based Literal Matching Engine for Scalable Deep Packet Inspection. In ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9 - 12, 2021, Xian-He Sun, Sameer Shende, Laxmikant V. Kalé, and Yong Chen (Eds.). ACM, 62:1–62:11. https://doi.org/10.1145/3472456.3473512
  44. Rust. 2024. The Rust Programming Language: regex. https://github.com/rust-lang/regex.
  45. Symbolic Regex Matcher. In Tools and Algorithms for the Construction and Analysis of Systems (LNCS, Vol. 11427), Tomáš Vojnar and Lijun Zhang (Eds.). Springer, 372–378. https://doi.org/10.1007/978-3-030-17462-0_24
  46. SMT-LIB. 2021. The Satisfiability Modulo Theories Library. http://smtlib.cs.uiowa.edu/
  47. Henry Spencer. 1994. Software Solutions in C. Academic Press Professional, Inc., San Diego, CA, USA, Chapter A Regular-expression Matcher, 35–71. http://dl.acm.org/citation.cfm?id=156626.184689
  48. Symbolic Boolean Derivatives for Efficiently Solving Extended Regular Expression Constraints. In PLDI’21. ACM, 620–635. https://doi.org/10.1145/3453483.3454066
  49. Martin Sulzmann and Kenny Zhuo Ming Lu. 2012. Regular Expression Sub-Matching Using Partial Derivatives. In Proceedings of the 14th Symposium on Principles and Practice of Declarative Programming (PPDP’12). ACM, New York, NY, USA, 79–90. https://doi.org/10.1145/2370776.2370788
  50. Chengsong Tan and Christian Urban. 2023. POSIX Lexing with Bitcoded Derivatives. In 14th International Conference on Interactive Theorem Proving (LIPICS, 26), A. Naumowicz and R. Thiemann (Eds.). Dagstuhl Publishing, 26:1–26:18.
  51. Ken Thompson. 1968. Programming Techniques: Regular Expression Search Algorithm. Commun. ACM 11, 6 (jun 1968), 419–422. https://doi.org/10.1145/363347.363387
  52. Counting in Regexes Considered Harmful: Exposing ReDoS Vulnerability of Nonbacktracking Matchers. In 31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 4165–4182. https://www.usenix.org/conference/usenixsecurity22/presentation/turonova
  53. Regex Matching with Counting-Set Automata. Proc. ACM Program. Lang. 4, OOPSLA, Article 218 (Nov. 2020). https://doi.org/10.1145/3428286
  54. Christian Urban. 2023. POSIX Lexing with Derivatives of Regular Expressions. Journal of Automated Reasoning 67 (July 2023), 1–24. https://doi.org/10.1007/s10817-023-09667-1
  55. Ian Erik Varatalu. 2024. Accompanying web application for the article. https://cs.taltech.ee/staff/iavara/regex/
  56. Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 631–648. https://www.usenix.org/conference/nsdi19/presentation/wang-xiang
  57. Lean Formalization of Extended Regular Expression Matching with Lookarounds. In Proceedings of the 13th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2024, London, UK, January 15-16, 2024, Amin Timany, Dmitriy Traytel, Brigitte Pientka, and Sandrine Blazy (Eds.). ACM, 118–131. https://doi.org/10.1145/3636501.3636959

Summary

We haven't generated a summary for this paper yet.