From 951c24f34506f2f4efc772d41742ca88bdad77d5 Mon Sep 17 00:00:00 2001 From: Vijay Janapa Reddi Date: Sun, 12 Jan 2025 08:34:08 -0500 Subject: [PATCH] Zishen's fixes Co-authored-by: zishenwan@gmail.com --- contents/core/workflow/workflow.qmd | 67 +++-------------------------- 1 file changed, 6 insertions(+), 61 deletions(-) diff --git a/contents/core/workflow/workflow.qmd b/contents/core/workflow/workflow.qmd index f81c160b..db98269f 100644 --- a/contents/core/workflow/workflow.qmd +++ b/contents/core/workflow/workflow.qmd @@ -40,9 +40,9 @@ The machine learning lifecycle is a systematic, interconnected process that guid ![The ML lifecycle.](images/png/ML_life_cycle.png){#fig-ml-lifecycle} -The data then enters the preparation stage, where it is transformed into machine learning-ready datasets through processes such as splitting and versioning. These datasets are used in the model training stage, where machine learning algorithms are applied to create predictive models. The resulting models are rigorously tested in the model evaluation stage, where performance metrics, such as key performance indicators (KPIs), are computed to assess reliability and effectiveness. The validated models move to the ML system validation phase, where they are verified for deployment readiness. Once validated, these models are integrated into production systems during the ML system deployment stage, ensuring alignment with operational requirements. The final stage tracks the performance of deployed systems in real time, enabling continuous adaptation to new data and evolving conditions. +The data then enters the preparation stage, where it is transformed into machine learning-ready datasets through processes such as splitting and versioning. These datasets are used in the model training stage, where machine learning algorithms are applied to create predictive models. The resulting models are rigorously tested in the model evaluation stage, where performance metrics, such as key performance indicators (KPIs), are computed to assess reliability and effectiveness. The validated models move to the ML system validation phase, where they are verified for deployment readiness. Once validated, these models are integrated into production systems during the ML system deployment stage, ensuring alignment with operational requirements. The final stage tracks the performance of deployed systems in real time, enabling continuous adaptation to new data and evolving conditions. -This general lifecycle forms the backbone of machine learning systems, with each stage contributing to the creation, validation, and maintenance of scalable and efficient solutions. While the lifecycle provides a detailed view of the interconnected processes in machine learning systems, it can be distilled into a simplified framework for practical implementation. +This general lifecycle forms the backbone of machine learning systems, with each stage contributing to the creation, validation, and maintenance of scalable and efficient solutions. While the lifecycle provides a detailed view of the interconnected processes in machine learning systems, it can be distilled into a simplified framework for practical implementation. Each stage aligns with one of the following overarching categories: @@ -235,69 +235,15 @@ Feedback loops are another critical aspect of this lifecycle perspective. Insigh The scaling of data collection introduces emergent behaviors that must be managed holistically. While individual clinics may function well in isolation, the simultaneous operation of multiple clinics can lead to system-wide patterns like network congestion or storage bottlenecks. These behaviors reinforce the importance of considering data collection as a system-level challenge rather than a discrete, isolated task. -### Lifecycle Implications - -The success of ML systems depends on how effectively data collection integrates with the entire lifecycle. Decisions made in this stage affect not only the quality of the initial model but also the system's ability to evolve and adapt. For instance, data distribution shifts or changes in imaging equipment over time require the system to handle new inputs without compromising performance. - -In the DR project, embedding lifecycle thinking into data management strategies ensured the system remained robust and scalable as it expanded to new clinics and regions. By proactively addressing variability and quality during data collection, the team minimized the need for costly downstream adjustments, aligning the system with long-term goals and operational realities. - -## Model Development and Training - -Model development and training form the core of machine learning systems, yet this stage presents unique challenges that extend far beyond selecting algorithms and tuning hyperparameters. It involves designing architectures suited to the problem, optimizing for computational efficiency, and iterating on models to balance performance with deployability. In high-stakes domains like healthcare, the stakes are particularly high, as every design decision impacts clinical outcomes. - -For DR detection, the model needed to achieve expert-level accuracy while handling the high resolution and variability of retinal images. Using a deep neural network trained on their meticulously labeled dataset, the team achieved an F-score of 0.95, slightly exceeding the median score of the consulted ophthalmologists (0.91). This outcome highlights the effectiveness of state-of-the-art methods, such as transfer learning, and the importance of interdisciplinary collaboration between data scientists and medical experts to refine features and interpret model outputs. - -### Model Requirements and System Impact - -The requirements for model development emerge not only from the specific learning task but also from broader system constraints. In the DR project, the model needed high sensitivity and specificity to detect different stages of retinopathy. However, achieving this purely from an ML perspective was not sufficient. The system had to meet operational constraints, including running on limited hardware in rural clinics, producing results quickly enough to fit into clinical workflows, and being interpretable enough for healthcare providers to trust its outputs. - -These requirements shaped decisions during model development. While state-of-the-art accuracy might favor the largest and most complex models, such approaches were infeasible given hardware and workflow constraints. The team focused on designing architectures that balanced accuracy with efficiency, exploring lightweight models that could perform well on constrained devices. - -Approaching model development through a systems perspective reveals its connections to every other stage of the ML lifecycle. As we saw in @fig-ml-lifecycle-feedback, these connections are bidirectional---model evaluation feeds back "Validation Issues" to trigger retraining, while "Deployment Constraints" from production systems can necessitate fundamental changes to model architecture. In the DR project, these feedback loops manifested when model evaluation revealed performance issues with certain image qualities, triggering model retraining, while deployment in resource-constrained clinics required architectural changes to reduce model size. - -This balancing act influenced every part of the system lifecycle. Decisions about model architecture affected data preprocessing, shaped the training infrastructure, and determined deployment strategies. For example, choosing to use an ensemble of smaller models instead of a single large model altered data batching during training, required changes to inference pipelines, and introduced complexities in how model updates were managed in production. - -### Model Development Workflow - -The model development workflow reflects the complex interplay between data, compute resources, and human expertise. In the DR project, this process began with data exploration and feature engineering, where data scientists collaborated with ophthalmologists to identify image characteristics indicative of retinopathy. - -This initial stage required tools capable of handling large medical images and facilitating experimentation with preprocessing techniques. The team needed an environment that supported collaboration, visualization, and rapid iteration while managing the sheer scale of high-resolution data. - -As the project advanced to model design and training, computational demands escalated. Training deep learning models on high-resolution images required extensive GPU resources and sophisticated infrastructure. The team implemented distributed training systems that could scale across multiple machines while managing large datasets, tracking experiments, and ensuring reproducibility. - -Model development was inherently iterative, with each cycle—adjusting DNN architectures, refining hyperparameters, or incorporating new data—producing extensive metadata, including checkpoints, validation results, and performance metrics. Managing this information across the team required robust tools for experiment tracking and version control to ensure that progress remained organized and reproducible. - -### Scale and Distribution Challenges - -As ML systems scale in both data volume and model complexity, the challenges of model development grow exponentially. The DR project's evolution from prototype models to production-ready systems highlights these hurdles. Expanding datasets, more sophisticated models, and concurrent experiments demanded increasingly powerful computational resources and meticulous organization. - -Distributed training became essential to meet these demands. While it significantly reduced training time, it introduced complexities in data synchronization, gradient aggregation, and fault tolerance. The team relied on advanced frameworks to optimize GPU clusters, manage network latency, and address hardware failures, ensuring training processes remained efficient and reliable. - -The need for continuous experimentation and improvement compounded these challenges. Over time, the team managed an expanding repository of model versions, training datasets, and experimental results. This growth required scalable systems for tracking experiments, versioning models, and analyzing results to maintain consistency and focus across the project. - -### Systems Thinking - -Approaching model development through a systems perspective reveals its connections to every other stage of the ML lifecycle. Decisions about model architecture ripple through the system, influencing preprocessing requirements, deployment strategies, and clinical workflows. For instance, adopting a complex model might improve accuracy but increase memory usage, complicating deployment in resource-constrained environments. - -Feedback loops are inherent to this stage. Insights from deployment inform adjustments to models, while performance on test sets guides future data collection and annotation. Understanding these cycles is critical for iterative improvement and long-term success. - -Scaling model development introduces emergent behaviors, such as bottlenecks in shared resources or unexpected interactions between multiple training experiments. Addressing these behaviors requires robust planning and the ability to anticipate system-wide patterns that might arise from local changes. +In the following chapters, we will step through each of the major stages of the lifecycle shown in @fig-ml-lifecycle-feedback. We will consider several key questions like what influences data source selection, how feedback loops can be systematically incorporated, and how emergent behaviors can be anticipated and managed holistically. -The boundaries between model development and other lifecycle stages often blur. Feature engineering overlaps with data preparation, while optimization for inference spans both development and deployment. Navigating these overlaps effectively requires careful coordination and clear interface definitions. +In addition, by adopting a systems thinking approach, we emphasize the iterative and interconnected nature of the ML lifecycle. How do choices in data collection and preparation ripple through the entire pipeline? What mechanisms ensure that monitoring insights and performance evaluations effectively inform improvements at earlier stages? And how can governance frameworks and infrastructure design evolve to meet the challenges of scaling while maintaining fairness and efficiency? These questions will guide our exploration of the lifecycle, offering a foundation for designing robust and adaptive ML systems. ### Lifecycle Implications -Model development is not an isolated task; it exists within the broader ML lifecycle. Decisions made here influence data preparation strategies, training infrastructure, and deployment feasibility. The iterative nature of this stage ensures that insights gained feed back into data collection and system optimization, reinforcing the interconnectedness of the lifecycle. - -In subsequent chapters, we will explore key questions that arise during model development: - -- How can scalable training infrastructures be designed for large-scale ML models? -- What frameworks and tools help manage the complexity of distributed training? -- How can model reproducibility and version control be ensured in evolving projects? -- What trade-offs must be made to balance accuracy with operational constraints? -- How can continual learning and updates be handled in production systems? +The success of ML systems depends on how effectively data collection integrates with the entire lifecycle. Decisions made in this stage affect not only the quality of the initial model but also the system's ability to evolve and adapt. For instance, data distribution shifts or changes in imaging equipment over time require the system to handle new inputs without compromising performance. -These questions highlight how model development sits at the core of ML systems, with decisions in this stage resonating throughout the entire lifecycle. +In the DR project, embedding lifecycle thinking into data management strategies ensured the system remained robust and scalable as it expanded to new clinics and regions. By proactively addressing variability and quality during data collection, the team minimized the need for costly downstream adjustments, aligning the system with long-term goals and operational realities. ## Model Development and Training @@ -387,7 +333,6 @@ Distribution challenges extended beyond infrastructure variability. The team nee Despite achieving high performance metrics during development, the DR system faced unexpected challenges in real-world deployment. As illustrated in @fig-ml-lifecycle-feedback, these challenges create multiple feedback paths---"Deployment Constraints" flowing back to model training to trigger optimizations, while "Performance Insights" from monitoring could necessitate new data collection. For example, when the system struggled with images from older camera models, this triggered both model optimizations and targeted data collection to improve performance under these conditions. - Another critical scaling challenge was training and supporting end-users. Clinicians and staff needed to understand how to operate the system, interpret its outputs, and provide feedback. The team developed comprehensive training programs and support channels to facilitate this transition, recognizing that user trust and proficiency were essential for system adoption. ### Ensuring Robustness and Reliability