This program is tentative and subject to change.
To generate valid test inputs for a system, one needs a \emph{specification} of its input language—typically a \emph{context-free grammar} that describes input syntax. But where can one get such a grammar from? In the past years, the field of \emph{input grammar mining} has emerged, with creative approaches to extract input grammars from inputs, code, or both. But how good are these approaches? In particular; How \emph{accurate} are the grammars they mine?
In this study, we systematically \emph{evaluate} grammar miners for these questions. Notably, we find that the previous evaluations conducted by the respective authors—producing a set of inputs from a golden grammar and having them checked by the mined grammar, or vice versa—are insufficient, as they have a strong bias towards short, possibly unrealistic inputs. We therefore also measure the \emph{diversity} of the mined grammars using \emph{$k$-path coverage} with varying depths~$k$ to find how many \emph{combinations} of grammar elements are actually represented.
Ideally, a mined grammar should have perfect precision and recall regardless of the depth $k$. However, our results show that for all approaches presented so far, precision and recall can drop significantly compared to reported results when increasing~$k$ and thus checking for ``deeper'' diversity, especially for complex input languages such as Lisp, JSON, or Tiny-C. For instance, the Tiny-C grammar mined by Arvada achieves a precision of 75% when considering $k$-paths with $k = 1$ (the originally reported precision was 73%), but this drops to 46% for $k = 5$. White-box approaches based on program analysis, such as Mimid and Stalagmite, are more stable with varying depth $k$, but can be challenged by complex parsers such as mjs. Raising the bars for evaluation, our study shows that there is still room for improvement in grammar mining.
This program is tentative and subject to change.
Thu 16 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
14:00 - 15:30 | Testing and Analysis 12Research Track at Oceania II Chair(s): Sam Malek University of California at Irvine | ||
14:00 15mTalk | Generator Solving for Symbolic Execution Research Track | ||
14:15 15mTalk | How Good are Input Grammar Miners? An Empirical Study Research Track Leon Bettscheider CISPA Helmholtz Center for Information Security, Andreas Zeller CISPA Helmholtz Center for Information Security | ||
14:30 15mTalk | LSPRAG: LSP-Guided RAG for Language-Agnostic Real-Time Unit Test Generation Research Track Gwihwan Go Tsinghua University, Quan Zhang East China Normal University, Chijin Zhou East China Normal University, Zhao Wei Tencent, Yu Jiang Tsinghua University | ||
14:45 15mTalk | Breaking Single-Tester Limits: Multi-Agent LLMs for Multi-User Feature Testing Research Track Sidong Feng Monash University, Changhao Du Jilin University, huaxiao liu Jilin University, Qingnan Wang Jilin University, Zhengwei Lv ByteDance, Mengfei Wang ByteDance, Chunyang Chen TU Munich | ||
15:00 15mTalk | Testing Deep Learning Libraries via Neurosymbolic Constraint Learning Research Track M M Abid Naziri North Carolina State University, Shinhae Kim Cornell University, Feiran Qin North Carolina State University, Saikat Dutta Cornell University, Marcelo d'Amorim North Carolina State University | ||
15:15 15mTalk | MioHint: LLM-Assisted Request Mutation for Whitebox REST API Testing Research Track Jia Li The Chinese University of Hong Kong, Jiacheng Shen Duke Kunshan University, Yuxin Su Sun Yat-sen University, Michael Lyu The Chinese University of Hong Kong | ||