"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning (2403.16795v1)

Published 25 Mar 2024 in cs.HC

Abstract: Organizations rely on machine learning engineers (MLEs) to deploy models and maintain ML pipelines in production. Due to models' extensive reliance on fresh data, the operationalization of machine learning, or MLOps, requires MLEs to have proficiency in data science and engineering. When considered holistically, the job seems staggering -- how do MLEs do MLOps, and what are their unaddressed challenges? To address these questions, we conducted semi-structured ethnographic interviews with 18 MLEs working on various applications, including chatbots, autonomous vehicles, and finance. We find that MLEs engage in a workflow of (i) data preparation, (ii) experimentation, (iii) evaluation throughout a multi-staged deployment, and (iv) continual monitoring and response. Throughout this workflow, MLEs collaborate extensively with data scientists, product stakeholders, and one another, supplementing routine verbal exchanges with communication tools ranging from Slack to organization-wide ticketing and reporting systems. We introduce the 3Vs of MLOps: velocity, visibility, and versioning -- three virtues of successful ML deployments that MLEs learn to balance and grow as they mature. Finally, we discuss design implications and opportunities for future work.

References (111)

Citations (7)

View on Semantic Scholar

Summary

The paper reveals that operationalizing ML in production requires iterative, human-centered workflows to manage data, experiments, and deployment.
It identifies four key stages—data preparation, experimentation, evaluation/deployment, and monitoring—to systematically address production challenges.
The study emphasizes the critical roles of velocity, visibility, and versioning to enhance model performance and reliability in real-world settings.

How Engineers Operationalize Machine Learning

The paper "We Have No Idea How Models will Behave in Production until Production: How Engineers Operationalize Machine Learning" explores the operational challenges faced by Machine Learning Engineers (MLEs) as they deploy and maintain ML models in various production settings. Through ethnographic interviews, the authors elucidate the iterative, collaborative, and complex workflows that characterize MLOps practices. This essay summarizes the key findings and implications for practitioners and researchers in the field.

MLOps Workflow

The paper identifies a common workflow among MLEs that includes four primary stages: data preparation, experimentation, evaluation and deployment, and monitoring and response. Each stage entails significant human-centered practices not typically automated in production systems.

Data Preparation

Data preparation is primarily handled by automated pipelines, often managed by dedicated teams of data engineers. However, MLEs remain responsible for ensuring data quality at scale, especially regarding labeling tasks and addressing feedback delays in acquiring ground-truth labels.

Figure 1: Color-coded Overview of Transcripts

Experimentation

Experimentation involves iterating on both data-driven and model-driven changes to improve ML performance. MLEs prioritize innovations in feature engineering, often involving collaboration with domain experts and other stakeholders. Despite advancements in AutoML, practitioners prefer maintaining manual control over experiment selection to ensure thoroughness and accuracy.

Evaluation and Deployment

Evaluation and deployment are intertwined processes featuring multi-stage model promotions, where experimental changes are pushed to increasing fractions of users. MLEs use dynamic validation datasets and product-specific metrics for evaluation, highlighting the need for manual oversight throughout deployment stages.

Monitoring and Response

Monitoring involves supervising pipeline performance and responding to production failures. On-call rotations and service-level objectives are common practices enabling MLEs to manage pipeline reliability. A significant challenge arises from managing false-positive alerts and the accumulation of "pipeline jungles" due to patching known model bugs with hard-coded rules.

Three Vs of MLOps

The authors introduce the "Three Vs of MLOps"—Velocity, Visibility, and Versioning—as critical virtues underpinning successful ML deployments:

Velocity: High experimentation and debugging speed are essential. However, unchecked velocity can lead to technical debt, requiring strategic collaboration and validation processes.
Visibility: Comprehensive pipeline observability facilitates collaboration between MLEs and stakeholders, enhancing debugging precision and prioritizing meaningful metrics.
Versioning: Effective versioning of models, data, and code supports reproducibility and collaboration, reducing cognitive load for engineers working with complex production systems.

Implications for Tooling and Future Work

The findings highlight opportunities for developing tools that bolster human-centered practices in MLOps. Tools can address retraining cadence, label quality management, dynamic validation datasets, and streamlined deployment processes. Additionally, tools should prioritize enhancing the three Vs to aid practitioners in navigating the complex landscape of production ML.

Future research can explore interactions between MLEs and other stakeholders, explore security and fairness concerns of production ML systems, and paper automated workflows within the ML lifecycle.

Conclusion

The paper underscores the multifaceted nature of MLOps, illustrating how engineers navigate technical and collaborative hurdles in operationalizing ML. By emphasizing the interplay of Velocity, Visibility, and Versioning, the authors provide a framework for refining MLOps practices, paving the way for improved tooling and methodologies in deploying reliable ML systems.