Beyond Segmentation: Structurally Informed Facade Parsing from Imperfect Images

Published 10 Apr 2026 in cs.CV, cs.GR, and cs.LG | (2604.09260v1)

Abstract: Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction. We address this limitation by augmenting the YOLOv8 training objective with a custom lightweight alignment loss. This regularization encourages grid-consistent arrangements of bounding boxes during training, effectively injecting geometric priors without altering the standard inference pipeline. Experiments on the CMP dataset demonstrate that our method successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion while maintaining a controllable trade-off with standard detection accuracy.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a pairwise alignment loss added to YOLOv8’s training, significantly improving the grid regularity of facade detections.
It demonstrates that calibrated alignment weights yield enhanced SVD-based regularity with minimal [email protected] degradation, validating the method on the CMP facade dataset.
The approach provides structurally coherent outputs for downstream neuro-symbolic modeling, though its dependency on multiple elements and scale sensitivity pose limitations.

Structurally Informed Facade Parsing via Pairwise Alignment in Object Detection

Problem Formulation and Motivation

Accurate recovery of facade structure from monocular imagery is foundational for urban procedural modeling, digital twins, and content generation workflows. Prevailing neural detectors—exemplified by YOLOv8—yield high recall and plausible local predictions but disregard global regularities typical in architectural facades, such as repeated windows aligned on implicit grids. This lack of structured coherence leads to downstream instability and ambiguity when such detections are consumed by inverse procedural modeling pipelines. Prior research has either relied on generative grammars or post-processing repairs, but no efficient, end-to-end solution exists that encourages geometric consistency throughout the detection pipeline, especially in the presence of occlusions, perspective distortions, and annotation noise.

Methodology

This work introduces a lightweight architectural prior directly into the training objective of YOLOv8, augmenting the standard detection loss with a pairwise alignment regularizer. The method operates exclusively at training time and requires no modification to the fast inference pipeline or network architecture. The alignment loss is formulated to bias bounding boxes of the same semantic class towards being grid-aligned in both axes. This is implemented by examining pairs of positive bounding boxes—based on their pixel-wise proximity and lack of overlap—and penalizing coordinate discrepancies below a tunable threshold $T$ . The contribution of this geometric loss is calibrated by a weight parameter $W$ , allowing explicit control of the trade-off between classical detection accuracy and emergent structural regularity.

Experimental Evaluation

Evaluations are carried out on the CMP facade dataset, which provides pixel-level annotation for diverse European building facades. The key metrics for analysis are [email protected] (to measure instance-level detection fidelity) and an SVD-based regularity score (which quantifies how well predicted layouts approximate low-rank structures, reflecting repetitive grid-like organization). The results demonstrate that for moderate values of the alignment loss weight ( $W$ ), SVD-based regularity improves significantly, indicating that the method yields predictions with strong geometric coherence. Importantly, these structural gains are achieved with minimal degradation to mAP, up to a critical point where over-regularization (excessively high $W$ or low $T$ ) induces a sharp decline in detection rates. Qualitative inspection confirms these findings: the model robustly recovers partially occluded windows and removes systematic alignment errors arising from perspective and annotation imperfections, while occasionally missing small or singleton elements under aggressive regularization.

Implications and Limitations

The principal implication is that detector-side architectural priors—imposed via loss-based geometric regularization—can lead to facade parses that are not only locally plausible but also structurally consistent and ready for further procedural manipulation. This substantially increases the utility of detector outputs for downstream symbolic modeling, as grid-regularized detections closely match the assumptions of IPM systems (e.g., FaçAID, Pro-DG).

There are, however, empirically grounded limitations. The approach is fundamentally dependent on the presence of multiple candidate elements, and its effectiveness is constrained in facades with high semantic variability or severe perspective effects. Anchoring the threshold $T$ in absolute pixel coordinates introduces scale sensitivity, making cross-dataset and resolution-variant applications less straightforward.

Theoretical and Practical Directions

The study opens avenues for end-to-end urban scene understanding pipelines where detection and procedural abstraction are tightly coupled via consistency priors. Extending the alignment loss to three-dimensional contexts would better accommodate oblique imagery and unrectified scenes and could facilitate direct 3D layout estimation. More generally, the approach embodies a broader shift toward integrating symbolic priors (regularity, symmetry, repetition) into deep visual representations for improved structure-aware perception.

Conclusion

In summary, this paper presents a detector-level structural regularization framework that improves the architectural plausibility of facade parses in object detectors. By incorporating a pairwise alignment loss in the YOLOv8 pipeline, the method achieves quantifiable gains in grid regularity while preserving real-time inference and competitive accuracy. This makes it particularly suitable as a preprocessing step for neuro-symbolic modeling systems and highlights the benefit of embedding geometric priors into end-to-end learning architectures. Future work will address scale sensitivity, expand to 3D applications, and further integrate with procedural completion methods for urban scene understanding.

Markdown Report Issue