A Multi-Modal Robotic Framework for Physical GUI Interaction Testing
Wed 15 Apr 2026 15:30 - 16:00 at Catering and Exhibition Hall (Europa I to IV) - Doctoral Symposium Poster Session (Wednesday)
Automated testing of mobile applications remains challenging when the goal is to reproduce realistic user interactions and verify GUI behavior without relying on invasive mechanisms such as instrumentation or ADB-level control. Traditional approaches like DLD and R-DLD detect data loss caused by activity restarts, but their dependence on system-level access limits applicability across real devices and heterogeneous environments.
This doctoral research proposes \textbf{RFNIT}, a multi-modal, non-invasive robotic framework for physical GUI interaction testing on smartphones. RFNIT enables end-to-end automation that perceives the device state through external sensing and performs physical interactions that closely approximate human behavior while preserving reproducibility and precision.
The framework integrates a vision-based perception module, a reasoning agent powered by large language models (LLMs), and a robotic actuator capable of executing taps, swipes, and other gestures. RFNIT collects multimodal signals—external camera feeds, bounding-box detections, and OCR-extracted text—to identify GUI elements and determine subsequent test actions.
Preliminary results show that RFNIT can successfully recognize buttons, text fields, and virtual keyboards from external images, achieving over 85% visual detection accuracy. Future work will focus on improving robustness under varying lighting conditions, enhancing repeatability across long testing sessions, and expanding coverage of complex GUI states.