- The paper introduces an energy-efficient UAV system that uses a two-stage audio processing pipeline to detect and localize victim sounds in challenging environments.
- It employs a circular microphone array with MAE-based anomaly detection and TDoA/DoA methods to achieve robust performance across diverse scenarios.
- Significant energy savings are demonstrated by activating high-power multichannel processing only upon detecting anomalies, extending UAV mission endurance.
Sky-Ear: An Energy-Efficient UAV System for Victim Sound Detection and Localization
Introduction
"Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System" (2604.12455) addresses the operational limitations of vision-centric UAV-based search-and-rescue (SAR) by introducing a robust energy-efficient system leveraging acoustic sensing for victim detection and localization in challenging environments. The design pivots on a circular microphone array and a two-stage (Sentinel/Responder) hierarchical audio-processing pipeline, underpinning reliability and power conservation in extended SAR operations across adverse conditions such as deserts and forests.
Figure 1: Architecture of the "Sky-Ear" UAV-enabled victim sound detection and localization system highlighting the two-stage audio processing and array configuration.
System Architecture
Hardware and Signal Acquisition
Sky-Ear is configured with a compact M-element circular microphone array integrated into a UAV, with centralized and peripherally distributed sensors. The system continuously records M-channel audio, managed by a rolling buffer to support low-latency event-driven processing while minimizing onboard storage demands.
Two-Stage Audio Processing Pipeline
- Sentinel Stage: Operates on a single central microphone (a0), continuously monitoring the audio stream. An anomaly detection mechanism based on a Masked Autoencoder (MAE) analyzes Mel-spectrograms for non-background events, parameterized by masking ratios (ρ) optimized for scenario-specific noise statistics.
- Responder Stage: Activated only upon Sentinel anomaly detection. Utilizes multichannel inputs A for fine-grained time-difference-of-arrival (TDoA) and direction-of-arrival (DoA) estimation to localize potential victims. The system synthesizes multi-observation results from the UAV's trajectory, enabling continuous and precise geolocation through geometric optimization.
MAE-Based Anomaly Detection and Fine-Tuning
The core innovation in the Sentinel stage is the deployment of the MAE on Mel-spectrogram domain features for anomaly (potential victim sound) identification. The MAE reconstructs masked sections of input spectrograms, comparing the reconstruction error to a learned threshold, with a Top-K patch scoring mechanism attenuating masking from environmental outliers. Multiple MAE models are fine-tuned for desert and forest backgrounds, considering varying sound pressure levels and UAV ego-noise in the pretraining corpus.
Figure 2: Anomaly detection accuracy as a function of masking ratio ρ for MAE across desert and forest scenarios. Maximal accuracies are indicated.
Empirical results validate that low masking ratios (ρ≈0.10) optimize MAE discriminability, attributed to the decoder's access to richer context, improving generalization to subtle anomalies. Forest environments yield lower detection accuracy than deserts, reflecting the increased acoustic complexity and propagation path loss in vegetative environments.
Multichannel Continuous Localization
Upon anomaly detection, the Responder stage leverages TDoA across the array to calculate DoA per event. The formulation uses multichannel cross-correlation and geometric solutions for 3D vector estimation; multiple observations along the UAV path are fused for robust localization via weighted projection intersection.
Figure 3: Continuous localization results during UAV flight in desert and forest scenarios, depicting error convergence as the UAV approaches the victim.
The continuous localization method is shown to converge rapidly as UAVs move toward the sound source, with tighter error bounds in open (desert) terrain due to more predictable path losses and lower multipath/interference compared to forests. The system exhibits "silent" flight segments where the Sentinel stage remains active and the Responder is not invoked, evidencing effective energy management.
Experimental Protocol and Results
A comprehensive acoustic dataset underpins the evaluation, comprising UAV ego-noise, authentic environmental ambience, and a substantial corpus of victim signals (children, adult males) across wide SNRs. Model evaluation employs simulated flyovers at varying altitudes and lateral displacements, enforcing realistic sound propagation with scenario-dependent path loss exponents (α=2 for desert, α=2.5 for forest).
Key quantitative outcomes:
- Detection Accuracy: Peak values achieved at M0, with desert scenarios yielding higher rates due to lower acoustic clutter.
- Localization Error: Sharp reduction as UAV approaches victim; errors stabilize near ground truth when Responder events are frequent and trajectories densely sample the search grid.
- Energy Efficiency: Demonstrated by the significant reduction in high-power, multi-microphone operation time—over 90% of the mission, only low-cost monitoring is active.
Theoretical and Practical Implications
The integration of transformer-based MAE for real-time anomaly detection in resource-constrained UAV platforms sets a precedent for scalable, adaptive SAR acoustic monitoring. Sky-Ear's multi-observation data fusion approach provides generalizable methods for robust continuous localization under realistic environmental attenuation models.
Practically, this framework is extensible to heterogeneous SAR fleets, cooperative UAV swarms, and urban disaster scenes where LoS and visual sensors are ineffectual. Theoretically, the demonstrated synergy between unsupervised feature reconstruction and Top-M1 scoring mechanisms signifies a valuable direction for anomaly detection under unknown, dynamically varying backgrounds.
Future Directions
Future research trajectories may include online MAE adaptation to evolving noise distributions, distributed multi-agent localization under network constraints, and integration with multi-modal sensing (RF, thermal, visual). Application domains extend to wildlife monitoring, military reconnaissance, and disaster site acoustic mapping, all demanding robust, low-power, high-precision geolocation under adversarial or obfuscated conditions.
Conclusion
Sky-Ear substantiates the efficacy of MAE-based anomaly detection and multi-channel, energy-efficient audio localization for UAV-enabled SAR missions. The architecture achieves high detection and localization performance via judicious division of computational labor and adaptive signal processing, underscoring its applicability to real-world SAR deployments where robustness and operational endurance are paramount.