Preparation Meets Opportunity: Enhancing Data Preprocessing for ML Training With Seneca

Published 24 Sep 2025 in cs.OS, cs.AI, and cs.LG | (2511.13724v1)

Abstract: Input data preprocessing is a common bottleneck when concurrently training multimedia ML models in modern systems. To alleviate these bottlenecks and reduce the training time for concurrent jobs, we present Seneca, a data loading system that optimizes cache partitioning and data sampling for the data storage and ingestion (DSI) pipeline. The design of Seneca contains two key techniques. First, Seneca uses a performance model for the data pipeline to optimally partition the cache for three different forms of data (encoded, decoded, and augmented). Second, Seneca opportunistically serves cached data over uncached ones during random batch sampling so that concurrent jobs benefit from each other. We implement Seneca by modifying PyTorch and demonstrate its effectiveness by comparing it against several state-of-the-art caching systems for DNN training. Seneca reduces the makespan by 45.23% compared to PyTorch and increases data processing throughput by up to 3.45x compared to the next best dataloader.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces Seneca, a system that enhances data preprocessing by using model-driven cache partitioning and opportunistic data sampling.
It demonstrates a 45.23% reduction in makespan and up to a 3.45x increase in data processing throughput across diverse hardware setups.
By partitioning cache into encoded, decoded, and augmented forms, Seneca efficiently addresses ML training bottlenecks and prevents overfitting.

Efficient Data Preprocessing for ML Training with Seneca

Introduction

Seneca is a data loading system that addresses bottlenecks associated with input data preprocessing in ML training tasks. The system significantly enhances data storage and ingestion (DSI) pipeline efficiency by introducing two core techniques: a performance-driven cache partitioning strategy and opportunistic data sampling.

Seneca optimizes DSI throughput by strategically partitioning the cache into three forms: encoded, decoded, and augmented. By doing so, it mitigates common bottlenecks that arise during the training of multimedia models across multiple concurrent jobs. The implementation of Seneca has been evaluated against state-of-the-art caching systems, showcasing its superior cache utilization and throughput enhancements.

Challenges in Data Preprocessing

The DSI pipeline, a critical component of ML training, involves fetching, decoding, transforming, and loading data into GPUs. Inefficiencies in data preprocessing can bottleneck GPU training performance—a problem exacerbated by the growing disparity between CPU and GPU compute capabilities (Figure 1).

A particular challenge lies in choosing the optimal form of data to cache. Encoded data is storage-efficient but requires extensive preprocessing, while decoded and augmented forms reduce computational overhead at the cost of cache memory. Making trade-offs to achieve the best throughput under different hardware and dataset constraints remains an open problem, which Seneca aims to solve.

Seneca Design

Model-Driven Partitioning (MDP)

MDP employs a high-level performance model to partition cache space optimally. This method predicts DSI throughput based on the training node's capabilities, such as CPU and GPU throughput, as well as the potential bandwidth of cache and storage services. By determining the slowest pipeline component, MDP effectively allocates cache to maximize throughput.

Opportunistic Data Sampling (ODS)

ODS enhances cache efficiency by ensuring increased cache hit rates. It replaces uncached data samples with cached ones, maintaining randomness and pseudo-random sequence requirements crucial for ML model training. This sampling approach prevents overfitting due to redundant data usage, especially in concurrent job scenarios where multiple models train over the same dataset.

Figure 2: The DSI pipeline model.

Implementation and Evaluation

Seneca is implemented by extending the PyTorch dataloader framework, with caching managed using Redis. The system was rigorously evaluated across different hardware configurations, showing a remarkable improvement in reducing the makespan by 45.23% over PyTorch and increasing data processing throughput by up to 3.45 times compared to the next best dataloader.

Figure 3: The model training time for 12 image classification jobs (50 epochs each) for 5 different models on the AWS server.

Performance Validation

Seneca's performance model was validated against empirical data, achieving a high correlation, as depicted in Figure 4. This validation confirms the model's aptitude for accurately predicting throughput under various configurations, demonstrating that intelligent cache partitioning and opportunistic data sampling can alleviate bottlenecks in ML training.

Figure 4: Throughput for training 2 jobs concurrently on different hardware platforms. Seneca performs well across a wide range of system configurations.

Conclusion

Seneca brings substantial improvements to the ML training process, particularly in environments where data-intensive preprocessing tasks constitute a major bottleneck. Through its dual-strategy approach of model-driven cache partitioning and opportunistic caching, Seneca maximizes resource efficiency and accelerates model convergence. The system sets a precedent for further research into dynamic caching strategies tailored to the diverse computational needs of modern machine learning models.

Markdown Report Issue