Element-Aware Fine-Tuning of Vision-Language Models for Cost-Efficient GUI Testing in an Industrial Setting
This program is tentative and subject to change.
User Interface (UI) testing is crucial for quality assurance of industrial mobile applications, and yet it remains labor-intensive and challenging to automate effectively. Recent advances in Vision-Language Models (VLMs) present a promising solution for automating GUI testing through mapping natural language instructions to pixels, significantly reducing the manual effort required for writing test scripts and even designing test cases. While numerous VLMs have been proposed and evaluated for GUI testing, they often fail to meet two critical industrial requirements: (1) effectiveness and reliability when handling complex, multi-step workflows in complex industrial applications, and (2) efficiency and cost-effectiveness for large-scale, high-frequency testing environments typical in industrial settings. Toward addressing the preceding industrial requirements, in this paper, we report our experiences in developing and deploying \toolname{}, a three-stage approach that enables VLMs to explicitly detect and reason over discrete GUI elements, thereby overcoming the limitations of pixel-based reasoning for both efficiency and effectiveness improvement. In the first stage, \toolname{} integrates a lightweight UI-element detector named OmniParser to decompose UI screenshots into structured element representations with semantic annotations and spatial relationships. In the second stage, \toolname{} fine-tunes a VLM to enable it to reason about natural language instructions over the detected UI elements, empowering efficient small models to achieve superior performance against expensive large models. Comprehensive evaluations on public benchmarks and deployment at WeChat show that \toolname{} consistently achieves superior accuracy and efficiency compared to state-of-the-art VLMs. Specifically, \toolname{} enables a fine-tuned Qwen2.5-VL-3B model to outperform a 72B model with 75% less training data, validating the effectiveness of incorporating domain knowledge into VLM-based GUI testing. We summarize three major lessons learned from developing and deploying \toolname{}.

