PROMISE 2026
Sun 5 Jul 2026 Montreal, Canada
co-located with FSE 2026

Modern software systems demand code that is not only functional but also maintainable and well-structured. While Large Language Models (LLMs) show potential in automating such development, most studies evaluate isolated, single-agent generation at the function level. This paper investigates how process structure and role specialization influence multi-agent LLM workflows for class-level code generation. We simulate a Waterfall-style software development life cycle spanning Requirement, Design, Implementation, and Testing using three LLM models (GPT-4o-mini, DeepSeek-Chat, and Claude-3.5-Haiku) on 100 Python tasks from the ClassEval bench mark. Our results reveal that multi-agent workflows restructure, rather than uniformly improve, performance. Waterfall-style collaboration yields cleaner and more maintainable code, yet often lowers functional correctness (–37.8% for GPT-4o-mini, –39.8% for DeepSeek-Chat), except for Claude-3.5-Haiku (+9.5%). Crucially, process constraints shift the nature of failures, reducing structural issues like missing code but amplifying semantic and validation errors. Among activities, Testing has the greatest influence, strengthening verification but introducing reasoning faults, whereas Requirement and Design have a limited effect. Overall, this study provides the first empirical evidence that software process structure fundamentally shapes how LLMs reason and fail, highlighting trade-offs between disciplined workflow control and adaptive problem-solving in multi-agent code generation.