- The paper introduces CIAO, a workflow that extracts system-level architecture documentation using GPT-5 from GitHub repositories.
- It employs an eight-section, standards-aligned template to ensure traceability between code artifacts and architectural views.
- Empirical evaluation with 22 expert developers demonstrates high accuracy in component mapping and significant integration value.
Automated System-Level Software Architecture Documentation via CIAO Workflow
Introduction and Motivation
The absence or insufficiency of system-level architectural documentation remains a critical bottleneck for software comprehension, onboarding, and long-term maintenance. Code-centric development, time pressure, and architectural drift frequently undermine documentation quality, leading to architectural erosion and debt. Prevailing standards such as ISO/IEC/IEEE 42010 and SEI’s Views and Beyond stipulate that architectural documentation should systematically address stakeholder concerns. While LLM-driven documentation generation has matured, its application has largely been restricted to artifact-centric or local descriptions, not end-to-end system-level architectural narratives. This paper presents CIAO (Code In Architecture Out), an automated workflow leveraging GPT-5 for extracting and synthesizing high-level architectural documentation from GitHub repositories, following an expert-validated, standards-oriented template aligned with ISO/IEC/IEEE 42010, SEI Views and Beyond, and the C4 model (2604.08293).
Standards-Aligned Template and Workflow Design
CIAO structures system-level documentation into eight sections, each targeting specific architectural concerns identified via expert-driven iterative refinement. The template covers:
- System Overview: Outlines system scope and purpose, mapping system-of-interest concepts.
- Architectural Context: Characterizes external dependencies, actors, APIs, and integration points (C4 Level 1).
- Containers: Details runtime deployable units, their interfaces, and responsibilities (C4 Level 2).
- Components: Documents internal architectural decomposition and subsystem relationships (C4 Level 3).
- Code-Level Mapping: Provides direct traceability between architectural abstractions and source artifacts (C4 Level 4).
- Cross-Cutting Concerns: Summarizes quality-related concerns spanning multiple architectural layers.
- Quality Attributes and Rationale: Articulates performance, maintainability, scalability, and security reasoning extracted from design choices.
- Deployment: Specifies operational infrastructure, artifact allocation, and execution boundaries.
Role-based prompt engineering (profile and instruction), task decomposition for section-specific generation, and evidence-based, few-shot scaffolding ensure architectural correctness and minimize hallucinations. The workflow accepts flattened repository inputs, generates parallel section prompts, synthesizes documentation, and renders diagrams via PlantUML.
Figure 1: CIAO end-to-end workflow, encompassing repository flattening, prompted section generation, assembly, and visual diagram synthesis.
Empirical Evaluation: Methodology and Results
Twenty-two developers with advanced architectural expertise participated, each evaluating CIAO documentation for repositories ranging from 81 to 238,951 LOC across heterogeneous technology stacks. Perceptions were captured using structured Likert-scale items and qualitative open-ended queries, addressing value (RQ1), comprehensibility (RQ2), accuracy (RQ3), limitations (RQ4), and generation efficiency (RQ5).
Figure 2: Likert-scale ratings for perceived value (RQ1), blue=agree/strongly agree, yellow=neutral, red=disagree/strongly disagree.
Key Quantitative Findings
- Perceived Value (RQ1): 72.7% indicated documentation was valuable for integration, especially diagrams and component-level descriptions.
- Comprehensibility (RQ2): 68.2% found documentation clear and sufficiently detailed; architectural terminology was consistently accurate.
- Accuracy (RQ3): Highest accuracy observed for code-grounded sections (Components and Containers, 86.4% strongly agree/agree). Weakest ratings for interpretive, high-level views (Overview, Context).
- Limitations (RQ4): Main deficiencies were diagram errors (truncation, misleading representations), deployment inaccuracies, section inconsistencies, and occasional missing information. Suggestions included improved diagram synthesis, reduction of verbosity, and human-in-the-loop refinement.
- Efficiency (RQ5): Mean generation time ≈ 3 minutes; mean API cost ≈ $1.19 per repository.
Figure 3: Likert-scale scores for comprehensibility (RQ2) across clarity, terminology, redundancy, and detail.
Figure 4: Ratings for accuracy and correspondence (RQ3)—general architectural consistency and reliability.
Figure 5: Section-level accuracy ratings (RQ3): highest for code and component mapping; lowest for interpretive diagrams.
Practical and Theoretical Implications
CIAO demonstrates that structured, standards-aligned prompt engineering coupled with task decomposition enables LLMs to generate introductory, actionable system-level architectural documentation that is operable across diverse repository scales. The integration of explicit mappings between architectural concepts and code artifacts facilitates traceability—a key requirement for regulatory compliance and team onboarding. Practitioners in regulated domains (e.g., safety-critical systems) and those experiencing architectural drift showed particular interest in automated, traceable documentation.
From a theoretical standpoint, this reinforces the viability of paradigm extensions from artifact-centric to holistic system-level architecture recovery via LLMs, bridging the gap between reverse engineering and architectural synthesis. The principal limitations remain rooted in diagram synthesis, high-level context modeling, and deployment representation, evidencing the constraints of purely textual prompt-driven generation and the need for hybrid approaches incorporating static/dynamic code analysis or retrieval-augmented generation.
Future Directions
Enhancement of diagram reliability and semantic infrastructure modeling is forecasted as the main avenue, potentially requiring modular integration with static analysis frameworks and architectural artifact retrieval. Expanded evaluation in large-scale, industrial contexts, and human-in-the-loop workflows will be essential for adoption and quality improvements. Further research should address edge cases in architectural drift, configuration management, and evolution tracking within LLM-generated system-level documentation.
Conclusion
CIAO operationalizes LLM-driven, standards-compliant architectural documentation synthesis, empirically validated across representative real-world repositories. Results indicate that practitioners perceive the output as valuable, accurate, and actionable, with manageable operational cost and time. Theoretical implications underline the expansion of LLM capabilities from code artifact summarization to systematic architectural recovery. Major limitations—diagram fidelity, deployment specification, and context abstraction—point toward future integration of hybrid analysis methods and collaborative refinement for higher adoption and reliability.