- The paper introduces a dual-layer memory architecture that persistently tracks object state changes, enhancing change awareness for blind and low vision users.
- It employs hierarchical pose filtering, 3D backprojection, and VLM inference, achieving F1=83.1% with low latency (<1.42 sec) in real-world revisits.
- Practical evaluations demonstrate improved safety and task completion for BLV individuals, highlighting the system’s potential for scalable assistive technology.
StateScribe: A Memory-Augmented Vision System for Accessible Change Awareness Across Revisits
Motivation and Context
Environmental changes, including object appearance or removal, spatial rearrangements, or content updates, present significant accessibility and safety challenges for blind and low vision (BLV) individuals. Existing visual assistive tools—ranging from human-powered services to AI-driven applications—provide only ephemeral, session-local support, lacking persistent memory to surface changes across revisits to real-world spaces. The StateScribe system explicitly targets this gap, proposing a persistent, scalable, and low-latency architecture enabling longitudinal, structured awareness of meaningful changes during repeated real-world encounters.
System Architecture and Technical Approach
StateScribe introduces a dual-layer memory architecture on commodity smartphones, integrating Episodic Scene Memory (ESM) and Object-Centric Temporal Memory (OTM). ESM is a bounded, windowed repository of recent scene captures (RGB-D frames, pose, and embeddings) supporting pose-conditioned rapid retrieval. OTM persistently tracks key objects, archiving a chronological series of snapshots (status, visual/text embeddings, 3D bounding box) per object when a substantive state change is detected.
During each visit, as users explore with their phone, StateScribe streams sensor data into ESM while continuously performing deduplication, segmentation, and distributed storage. During revisits, change detection operates through a hierarchical process: (1) filter ESM frames by pose similarity and bidirectional visibility overlap, leveraging 3D backprojection for precise region-of-interest identification; (2) cluster candidates temporally (via DBSCAN); (3) select the optimal reference; (4) invoke a VLM on reference–current pairs, extracting object-level change metadata (type, confidence, bounding boxes, description); and (5) back-project detected changes into 3D to resolve object identity and location, updating OTM via 3D IoU and multimodal embedding similarity.
Live scene descriptions are generated in parallel via a single VLM with aggressive redundancy suppression (cosine or text embedding similarity filter). A priority-aware delivery buffer governs whether users hear live, change, or Q&A responses, ensuring timely and contextually relevant output.
Empirical Evaluation
Across 11 recorded revisits per environment (291 annotated changes in three scenes), StateScribe achieves F1​=83.1%, with recall 84.9% and precision 81.3%—substantially outperforming the Live VLM baseline (F1​=40.1%) and an offline video analysis baseline (F1​=27.4%). The error rates are robust both in synthetic and real-user settings, with clock-direction and distance errors maintained at low levels (mean $0.24$ hours and $0.68$ feet, respectively, markedly better than baselines). These results indicate effective mitigation of view-dependent artifacts and suppression of detection noise.
StateScribe demonstrates low-latency operation (mean <1.42 seconds per change detection) and memory efficiency (<55 MB OTM footprint for 110 revisits), without degradation in retrieval or update performance as storage grows—a direct consequence of the windowed ESM and sparse OTM design. VLM inference dominates processing time, but system-level contributions to latency and IO are minimal due to selective object-centric archiving.
User Study
A user study with nine BLV participants across three real-world locations confirmed that StateScribe substantially improves change awareness (task completion rates >82% across scenarios). Subjective Likert ratings corroborate accuracy (mean 6.0/7), trust (5.7/7), and willingness to deploy in daily life (5.6/7). Participants specifically valued the memory-based, temporally anchored change announcements—capabilities missing from existing assistive technologies.
Methodological Contributions
The paper introduces a conservative, evidence-grounded change-detection pipeline based on hierarchical pose/visibility filtering, VLM-in-the-loop analysis, and 3D back-projection for spatially resolved updates. Conservative prompting and post-processing minimize false positives (highly weighted as safety risks), and the priority schema for description output ensures critical information is surfaced promptly without overwhelming the user.
The OTM structure yields an efficient, persistent store for object-centric state transitions, supporting both backward- and forward-linked temporal queries across arbitrary time intervals, and enabling downstream conversational Q&A regarding 'what changed, where, and when'.
Limitations and Future Directions
StateScribe’s architecture is optimized for object-level and static attribute changes under a 1 FPS acquisition rate, potentially missing rapid or activity-centric events. Fast motion or multi-modal (e.g., acoustic, haptic) change inference would require additional sensors and model capacity, suggesting future directions for real-world robustness. Furthermore, the current reliance on manual revisit selection presents scalability challenges; integrating GPS, WiFi, or other localization signals for automatic clustering would enhance usability.
The system currently treats information delivery as primarily user-initiated or purely location-cursor driven. Integrating richer real-time context signals (hand-object interactions, task cues, user preferences) could render information even more adaptive, allowing for intent-aware or activity-aware assistive companions. Expansion towards holistic, structured scene/space memory—supporting not just object state transitions but also higher-order semantic relationships and global map understanding—remains open, with potential leverage from community-contributed spatial data and long-term egocentric video memory modules (Fan et al., 2024, Hu et al., 28 May 2025, Zhu et al., 4 Jun 2025).
Practical and Theoretical Implications
Practically, StateScribe sets a new standard for persistent environmental monitoring assistance—demonstrating that lightweight, memory-augmented VLM systems can operate scalably in the field on commodity devices, providing immediate, safety-critical change awareness for BLV users. The system signals a paradigm shift from per-session, stateless assistive tools towards persistent, context-aware, AI-powered companions.
Theoretically, StateScribe’s modular, object-centric memory abstraction, hierarchical pose/visibility-based retrieval, and integration of VLMs for grounded change detection advance the broader field of embodied AI memory systems. The architecture is portable to embodied agents and robotics for persistent scene understanding and could directly inform neuroscience-inspired AI models with episodic and semantic memory separation for continual learning and adaptation.
Conclusion
StateScribe (2604.23749) represents a substantive advance in accessible, memory-augmented real-world vision systems. By fusing windowed episodic and persistent object-centric memory with real-time VLM capabilities and conservative evidence integration, it establishes high-precision, low-latency longitudinal change awareness for BLV individuals during real-world revisits. Numerical results demonstrate its superiority over VLM-only baselines, both in technical accuracy and user experience. The dual-layer design principles, pipeline modularity, and memory–retrieval efficiency will inform future work on scalable, context-adaptive AI companions capable of supporting broader memory functions and complex, evolving user interaction paradigms.