Data Science and Digital Systems: The 3Ds of Machine Learning Systems Design

Published 26 Mar 2019 in cs.LG and cs.AI | (1903.11241v1)

Abstract: Machine learning solutions, in particular those based on deep learning methods, form an underpinning of the current revolution in "artificial intelligence" that has dominated popular press headlines and is having a significant influence on the wider tech agenda. Here we give an overview of the 3Ds of ML systems design: Data, Design and Deployment. By considering the 3Ds we can move towards \emph{data first} design.

Abstract PDF Upgrade to Chat

Authors (1)

Neil D. Lawrence

Citations (6)

View on Semantic Scholar

Summary

The paper highlights decomposition by breaking complex tasks into manageable sub-tasks suitable for machine learning automation.
The paper emphasizes a data-first approach, focusing on rigorous cleaning and quality assurance to improve model performance.
The paper advocates continuous deployment with real-time testing and monitoring to ensure system adaptability and resilience.

Overview of the "3Ds" in Machine Learning Systems Design

The paper "Data Science and Digital Systems: The 3Ds of Machine Learning Systems Design" outlines a comprehensive framework for designing machine learning systems, focusing on three critical components: Decomposition, Data, and Deployment. These components are collectively referred to as the "3Ds" of machine learning systems design. The paper emphasizes practical considerations and challenges encountered in developing machine learning systems, particularly in the context of deploying deep learning methods that significantly influence modern artificial intelligence.

Decomposition

Decomposition refers to the problem of breaking down complex tasks into components that can be effectively automated using machine learning models. This involves identifying sub-tasks within a larger decision-making process that are amenable to mathematical representation and computational automation. The paper stresses the importance of careful task selection and decomposition to align with model limitations and ensure efficiency in automation.

The paper introduces the concept of "pigeonholing," where a complex task is separated into manageable sub-tasks suitable for machine learning algorithms. This process may require significant preprocessing, feature engineering, and problem reformulation to align tasks with model capabilities. The paper notes that companies like Facebook rely heavily on feature engineering to address similar challenges, employing dedicated pipelines like FBLearner for task-specific modeling.

Data

Data is emphasized as a crucial factor in the success of machine learning systems, described as the "other half of the equation" alongside the models themselves. The paper addresses the often-overlooked challenges of data cleaning, management, and quality assurance, which are essential due to their direct impact on model performance and reliability in deployment.

The paper highlights a growing "data crisis" analogous to the historical "software crisis," characterized by the increasing complexity and cost of handling large, often low-quality data. To address this, the paper proposes adopting a "data first" paradigm, emphasizing the importance of data quality, readiness, and management over traditional software-centered approaches. The introduction of Data Readiness Levels outlines a structured framework for assessing and improving data quality, akin to Technology Readiness Levels used in engineering disciplines.

Deployment

Deployment of machine learning models necessitates continuous monitoring and updating to adapt to changing data distributions and preserve system performance. The paper critiques the lack of standardized practices and tools for managing machine learning deployments, calling for a shift toward "continuous regression testing" or "progression testing" to maintain system reliability over time.

To achieve effective deployment, the paper suggests adopting streaming architectures like Apache Kafka that support asynchronous operation and data persistence. This infrastructure facilitates real-time monitoring, data visualization, and quality control, ensuring systems remain responsive and adaptable to evolving environmental conditions. Notably, the paper advocates for "data as a service" frameworks, where system architecture revolves around data management and quality rather than traditional software service deployment.

Conclusion

The paper argues for a significant reevaluation of current practices in machine learning systems design, emphasizing the need for a data-centric approach that integrates Decomposition, Data management, and Deployment strategies. By adopting a "data first" paradigm and redefining how machine learning models are integrated into operational systems, the paper envisions a more robust and adaptable framework that can address contemporary challenges in automated decision-making and machine learning deployment.

Markdown Report Issue