2000 character limit reached
RE#: High Performance Derivative-Based Regex Matching with Intersection, Complement and Lookarounds (2407.20479v1)
Published 30 Jul 2024 in cs.FL
Abstract: We present a tool and theory RE# for regular expression matching that is built on symbolic derivatives, does not use backtracking, and, in addition to the classical operators, also supports complement, intersection and lookarounds. We develop the theory formally and show that the main matching algorithm has input-linear complexity both in theory as well as experimentally. We apply thorough evaluation on popular benchmarks that show that RE# is over 71% faster than the next fastest regex engine in Rust on the baseline, and outperforms all state-of-the-art engines on extensions of the benchmarks often by several orders of magnitude.
- Valentin Antimirov. 1996. Partial Derivatives of Regular Expressions and Finite Automata Constructions. Theoretical Computer Science 155 (1996), 291–319. https://doi.org/10.1007/3-540-59042-0_96
- POSIX Lexing with Derivatives of Regular Expressions (Proof Pearl). In Interactive Theorem Proving (LNCS, Vol. 9807), Jasmin Christian Blanchette and Stephan Merz (Eds.). Springer, 69–86. https://doi.org/10.1007/978-3-319-43144-4_5
- cvc5: A Versatile and Industrial-Strength SMT Solver. In Tools and Algorithms for the Construction and Analysis of Systems - 28th International Conference, TACAS 2022, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022, Munich, Germany, April 2-7, 2022, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 13243), Dana Fisman and Grigore Rosu (Eds.). Springer, 415–442. https://doi.org/10.1007/978-3-030-99524-9_24
- Aurèle Barrière and Clément Pit-Claudel. 2024. Linear Matching of JavaScript Regular Expressions. PLDI’2024 8 (2024), 201:1–201:25. https://doi.org/10.1145/3656431
- Formalising and implementing Boost POSIX regular expression matching. Theoretical Computer Science 857 (2021), 147–165. https://doi.org/10.1016/j.tcs.2021.01.010
- Regular Expressions with Lookahead. Journal of Universal Computer Science 27, 4 (2021), 324–340. https://doi.org/10.3897/jucs.66330
- Janusz A. Brzozowski. 1964. Derivatives of regular expressions. JACM 11 (1964), 481–494. https://doi.org/10.1145/321239.321249
- Benjamin Carle and Paliath Narendran. 2009. On Extended Regular Expressions. In Language and Automata Theory and Applications, Third International Conference, LATA 2009, Tarragona, Spain, April 2-8, 2009. Proceedings (Lecture Notes in Computer Science, Vol. 5457), Adrian-Horia Dediu, Armand-Mihai Ionescu, and Carlos Martín-Vide (Eds.). Springer, 279–289. https://doi.org/10.1007/978-3-642-00982-2_24
- Partial Derivatives of an Extended Regular Expression. In Language and Automata Theory and Applications - 5th International Conference, LATA 2011, Tarragona, Spain, May 26-31, 2011. Proceedings (Lecture Notes in Computer Science, Vol. 6638), Adrian-Horia Dediu, Shunsuke Inenaga, and Carlos Martín-Vide (Eds.). Springer, 179–191. https://doi.org/10.1007/978-3-642-21254-3_13
- Solving String Constraints with Regex-Dependent Functions through Transducers with Priorities and Variables. Proc. ACM Program. Lang. 6, POPL, Article 45 (jan 2022), 31 pages. https://doi.org/10.1145/3498707
- Nariyoshi Chida and Tachio Terauchi. 2023. On Lookaheads in Regular Expressions with Backreferences. IEICE Trans. Inf. Syst. 106, 5 (2023), 959–975. https://doi.org/10.1587/transinf.2022edp7098
- Russ Cox. 2010. Regular Expression Matching in the Wild. https://swtch.com/r̃sc/regexp/regexp3.html
- Loris D’Antoni and Margus Veanes. 2021. Automata Modulo Theories. Commun. ACM 64, 5 (May 2021), 86–95.
- James C. Davis. 2019. Rethinking Regex Engines to Address ReDoS. In Proceedings of ESEC/FSE’19 (Tallinn, Estonia) (ESEC/FSE 2019). ACM, New York, NY, USA, 1256–1258. https://doi.org/10.1145/3338906.3342509
- The Impact of Regular Expression Denial of Service (ReDoS) in Practice: An Empirical Study at the Ecosystem Scale. In Proceedings of ESEC/FSE’18 (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). ACM, New York, NY, USA, 246–256. https://doi.org/10.1145/3236024.3236027
- Leonardo de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In TACAS’08 (LNCS). Springer, 337–340. https://doi.org/10.1007/978-3-540-78800-3_24
- A Play on Regular Expressions: Functional Pearl. SIGPLAN Not. 45, 9 (2010), 357–368. https://doi.org/10.1145/1863543.1863594
- Alain Frisch and Luca Cardelli. 2004. Greedy Regular Expression Matching. In Automata, Languages and Programming (ICALP’04) (LNCS, Vol. 3142), Josep Díaz, Juhani Karhumäki, Arto Lepistö, and Donald Sannella (Eds.). Springer, 618–629. https://doi.org/10.1007/978-3-540-27836-8_53
- Andrew Gallant. 2024. BurntSushi: rebar. https://github.com/BurntSushi/rebar.
- Wouter Gelade and Frank Neven. 2012. Succinctness of the Complement and Intersection of Regular Expressions. ACM Trans. Comput. Log. 13, 1 (2012), 4:1–4:19. https://doi.org/10.1145/2071368.2071372
- Regular Expression Matching using Bit Vector Automata. Proc. ACM Program. Lang. 7, OOPSLA1 (2023), 492–521. https://doi.org/10.1145/3586044
- V. M. Glushkov. 1961. The abstract theory of automata. Russian Math. Surveys 16 (1961), 1–53. https://doi.org/10.1070/RM1961v016n05ABEH004112
- GNU. 2023. grep. https://www.gnu.org/software/grep/.
- Google. 2024. RE2. https://github.com/google/re2.
- Kleenex: compiling nondeterministic transducers to deterministic streaming transducers. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 284–297. https://doi.org/10.1145/2837614.2837647
- Fast Matching of Regular Patterns with Synchronizing Counting. In Foundations of Software Science and Computation Structures - 26th International Conference, FoSSaCS 2023, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2023, Paris, France, April 22-27, 2023, Proceedings (Lecture Notes in Computer Science, Vol. 13992), Orna Kupferman and Pawel Sobocinski (Eds.). Springer, 392–412. https://doi.org/10.1007/978-3-031-30829-1_19
- Alec Koumjian. 2024. akoumjian: datefinder. https://github.com/akoumjian/datefinder.
- Dexter Kozen. 1997. Kleene algebra with tests. TOPLAS 19, 3 (1997), 427–443. https://doi.org/10.1145/256167.256195
- V. Laurikari. 2000. NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions. In 7th International Symposium on String Processing and Information Retrieval. 181–187. https://doi.org/10.1109/SPIRE.2000.878194
- A Decision Procedure for Regular Membership and Length Constraints over Unbounded Strings?. In Frontiers of Combining Systems, FroCoS 2015 (LNCS, Vol. 9322). Springer, 135–150. https://doi.org/10.1007/978-3-319-24246-0_9
- Sound Regular Expression Semantics for Dynamic Symbolic Execution of JavaScript. In PLDI’19. ACM, 425–438. https://doi.org/10.1145/3314221.3314645
- Konstantinos Mamouras and Agnishom Chattopadhyay. 2024. Efficient Matching of Regular Expressions with Lookaround Assertions. Proc. ACM Program. Lang. 8, POPL (2024), 2761–2791. https://doi.org/10.1145/3632934
- R. McNaughton and H. Yamada. 1960. Regular expressions and state graphs for automata. IEEE Trans. Elec. Comp. 9 (1960), 39–47.
- Microsoft. 2021a. CredScan. https://secdevtools.azurewebsites.net/helpcredscan.html.
- Microsoft. 2021b. Regular Expression Language - Quick Reference. https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference.
- Microsoft. 2022. .NET Regular Expressions. https://github.com/dotnet/runtime/tree/main/src/libraries/System.Text.RegularExpressions.
- Takayuki Miyazaki and Yasuhiko Minamide. 2019. Derivatives of Regular Expressions with Lookahead. J. Inf. Process. 27 (2019), 422–430. https://doi.org/10.2197/ipsjjip.27.422
- Akimasa Morihata. 2012. Translation of Regular Expression with Lookahead into Finite State Automaton. Computer Software 29, 1 (2012), 147–158. https://doi.org/10.11309/jssst.29.1_147
- Derivative Based Nonbacktracking Real-World Regex Matching with Backtracking Semantics. In PLDI ’23: 44th ACM SIGPLAN International Conference on Programming Language Design and Implementation, Florida, USA, June 17-21, 2023, Nate Foster et al. (Ed.). ACM, 1–2.
- OWASP. 2024. Regular expression Denial of Service - ReDoS. https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- Regular-expression Derivatives Re-examined. J. Funct. Program. 19, 2 (2009), 173–190. https://doi.org/10.1017/S0956796808007090
- Damien Pous. 2015. Symbolic Algorithms for Language Equivalence and Kleene Algebra with Tests. ACM SIGPLAN Notices – POPL’15 50, 1 (2015), 357–368. https://doi.org/10.1145/2775051.2677007
- Teddy: An Efficient SIMD-based Literal Matching Engine for Scalable Deep Packet Inspection. In ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9 - 12, 2021, Xian-He Sun, Sameer Shende, Laxmikant V. Kalé, and Yong Chen (Eds.). ACM, 62:1–62:11. https://doi.org/10.1145/3472456.3473512
- Rust. 2024. The Rust Programming Language: regex. https://github.com/rust-lang/regex.
- Symbolic Regex Matcher. In Tools and Algorithms for the Construction and Analysis of Systems (LNCS, Vol. 11427), Tomáš Vojnar and Lijun Zhang (Eds.). Springer, 372–378. https://doi.org/10.1007/978-3-030-17462-0_24
- SMT-LIB. 2021. The Satisfiability Modulo Theories Library. http://smtlib.cs.uiowa.edu/
- Henry Spencer. 1994. Software Solutions in C. Academic Press Professional, Inc., San Diego, CA, USA, Chapter A Regular-expression Matcher, 35–71. http://dl.acm.org/citation.cfm?id=156626.184689
- Symbolic Boolean Derivatives for Efficiently Solving Extended Regular Expression Constraints. In PLDI’21. ACM, 620–635. https://doi.org/10.1145/3453483.3454066
- Martin Sulzmann and Kenny Zhuo Ming Lu. 2012. Regular Expression Sub-Matching Using Partial Derivatives. In Proceedings of the 14th Symposium on Principles and Practice of Declarative Programming (PPDP’12). ACM, New York, NY, USA, 79–90. https://doi.org/10.1145/2370776.2370788
- Chengsong Tan and Christian Urban. 2023. POSIX Lexing with Bitcoded Derivatives. In 14th International Conference on Interactive Theorem Proving (LIPICS, 26), A. Naumowicz and R. Thiemann (Eds.). Dagstuhl Publishing, 26:1–26:18.
- Ken Thompson. 1968. Programming Techniques: Regular Expression Search Algorithm. Commun. ACM 11, 6 (jun 1968), 419–422. https://doi.org/10.1145/363347.363387
- Counting in Regexes Considered Harmful: Exposing ReDoS Vulnerability of Nonbacktracking Matchers. In 31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 4165–4182. https://www.usenix.org/conference/usenixsecurity22/presentation/turonova
- Regex Matching with Counting-Set Automata. Proc. ACM Program. Lang. 4, OOPSLA, Article 218 (Nov. 2020). https://doi.org/10.1145/3428286
- Christian Urban. 2023. POSIX Lexing with Derivatives of Regular Expressions. Journal of Automated Reasoning 67 (July 2023), 1–24. https://doi.org/10.1007/s10817-023-09667-1
- Ian Erik Varatalu. 2024. Accompanying web application for the article. https://cs.taltech.ee/staff/iavara/regex/
- Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 631–648. https://www.usenix.org/conference/nsdi19/presentation/wang-xiang
- Lean Formalization of Extended Regular Expression Matching with Lookarounds. In Proceedings of the 13th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2024, London, UK, January 15-16, 2024, Amin Timany, Dmitriy Traytel, Brigitte Pientka, and Sandrine Blazy (Eds.). ACM, 118–131. https://doi.org/10.1145/3636501.3636959