An Empirical Study of Waterfall-style Multi-Agent Workflows for Class-Level Code Generation
Modern software systems demand code that is not only functional but also maintainable and well-structured. While Large Language Models (LLMs) show potential in automating such development, most studies evaluate isolated, single-agent generation at the function level. This paper investigates how process structure and role specialization influence multi-agent LLM workflows for class-level code generation. We simulate a Waterfall-style software development life cycle spanning Requirement, Design, Implementation, and Testing using three LLM models (GPT-4o-mini, DeepSeek-Chat, and Claude-3.5-Haiku) on 100 Python tasks from the ClassEval bench mark. Our results reveal that multi-agent workflows restructure, rather than uniformly improve, performance. Waterfall-style collaboration yields cleaner and more maintainable code, yet often lowers functional correctness (–37.8% for GPT-4o-mini, –39.8% for DeepSeek-Chat), except for Claude-3.5-Haiku (+9.5%). Crucially, process constraints shift the nature of failures, reducing structural issues like missing code but amplifying semantic and validation errors. Among activities, Testing has the greatest influence, strengthening verification but introducing reasoning faults, whereas Requirement and Design have a limited effect. Overall, this study provides the first empirical evidence that software process structure fundamentally shapes how LLMs reason and fail, highlighting trade-offs between disciplined workflow control and adaptive problem-solving in multi-agent code generation.