RegexPSPACE: Regex LLM Benchmark
RegexPSPACE는 LLM의 공간 복잡도 한계를 평가하기 위해 PSPACE-완전 정규표현식 문제(동등성 결정과 최소화)를 기반으로 한 최초의 벤치마크를 제안한다. 100만 개 이상의 정규표현식 인스턴스를 포함하는 대규모 데이터셋을 구축하고, 6개 LLM과 5개 LRM을 대상으로 평가를 수행해 LLM의 장황함과 반복 같은 공통 실패 패턴을 발견했다. 이 연구는 LLM과 LRM의 고급 추론 능력과 공간 계산 한계를 정량적으로 분석하는 새로운 평가 프레임워크를 제공한다.

RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems
Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational limits, particularly spatial complexity constrained by finite context windows, remain poorly understood. While recent works often focus on problems within the NP complexity class, we push the boundary by introducing a novel benchmark grounded in two PSPACE-complete regular expression (regex) problems: equivalence decision (RegexEQ) and minimization (RegexMin). PSPACE-complete problems serve as a more rigorous standard for assessing computational capacity, as their solutions require massive search space exploration. We perform a double-exponential space exploration to construct a labeled dataset of over a million regex instances with a sound filtering process to build the benchmark. We conduct extensive evaluations on 6 LLMs and 5 LRMs of varying scales, revealing common failure patterns such as verbosity and repetition. With its well-defined structure and quantitative evaluation metrics, this work presents the first empirical investigation into the spatial computational limitations of LLMs and LRMs, offering a new framework for evaluating their advanced reasoning capabilities. Our code is available at https://github.com/hyundong98/RegexPSPACE .



