Prioritizing Test Smells: An Empirical Evaluation of Quality Metrics and Developer Perceptions
Test smells, suboptimal patterns in test code, impair software maintainability and reliability, especially in resource-constrained open-source Python projects. While detection tools such as PyNose identify python-specific test smells, prioritizing them for refactoring remains a challenge due to the lack of test-specific frameworks. This study proposes a metric-driven approach that integrates Change Proneness (CP) and Fault Proneness (FP) metrics, computed via Spearman’s rank correlation, to quantify maintenance and reliability risks across 15 test smells in 52 open-source Python projects. Complementing this, a survey of 45 developers captures subjective severity perceptions. By applying Martin Fowler’s Technical Debt Quadrant, we classify smells based on empirical risk and developer insights into four categories, enabling better prioritization. Out of the 15 analyzed smells, Conditional Test Logic, Duplicate Assert, Obscure In-Line Setup, and Redundant Assertion belong to the highest-priority category for refactoring. These smells are characterized by both high empirical risk and strong developer agreement. This integrated framework advances test smell prioritization by combining data-driven analysis with practitioner perspectives, facilitating efficient refactoring decisions and improved test suite quality.