MLOps Pipeline Architecture: Building Production-Grade ML Systems in 2026

Q: When should I use shadow deployment versus canary deployment for ML models?

Use shadow deployment for first-time deployments or when model errors have significant consequences — the new model runs on production traffic but results are discarded. Use canary deployment when you have a baseline model in production and can measure quality in real-time. Many teams use shadow first, then switch to canary for subsequent updates.

Production MLOps in 2026 is defined by a single principle: treat ML artifacts — data, features, models, and configurations — with the same rigor that software engineering applies to code. The teams that ship ML reliably build pipelines with five discrete stages (data ingestion, training, validation, deployment, monitoring), version every artifact at every stage, and enforce automated quality gates between them. The most consequential architectural decision is not which orchestrator to choose but whether to invest in reproducibility infrastructure — experiment tracking, data versioning, and model registries — before scaling past a handful of models. Teams that skip this step invariably hit a wall where no one can explain why a production model behaves differently from the one that passed evaluation.

The MLOps Landscape in 2026

MLOps has matured from a buzzword into a well-defined engineering discipline. The core problem it solves has not changed: most ML models that perform well in notebooks never reach production, and those that do frequently degrade without anyone noticing. What has changed is the tooling ecosystem and the organizational patterns around it.

Three forces are reshaping MLOps architecture in 2026:

Foundation model integration: Pipelines must now handle both traditional ML models (trained from scratch on proprietary data) and fine-tuned or prompt-engineered foundation models. This duality demands pipeline architectures that support fundamentally different training, evaluation, and deployment patterns within the same infrastructure.
Feature platform convergence: Feature stores have evolved from standalone tools into integrated feature platforms that unify offline and online serving, handle feature computation, and enforce feature-level governance. The feature platform is now a core pipeline component, not an optional add-on.
Regulatory pressure: EU AI Act enforcement, expanding US state-level regulations, and sector-specific compliance requirements mean that model lineage, data provenance, and audit trails are no longer optional. Pipelines must produce compliance artifacts as a first-class output.

"The gap between deploying one ML model and managing fifty is the same gap between writing a script and running a distributed system. MLOps is the discipline that closes that gap."
— Chip Huyen, Designing Machine Learning Systems

The enterprise ML strategy guide covers the organizational and business-level decisions that precede pipeline architecture. This article focuses on the engineering: how to design, build, and operate the pipelines themselves.

Anatomy of a Production ML Pipeline

A production ML pipeline is a directed acyclic graph (DAG) with five primary stages. Each stage produces versioned artifacts, and automated quality gates control the transitions between them.

Stage 1: Data Ingestion and Validation

The pipeline begins where the data pipeline ends. Raw data flows in from source systems — databases, event streams, APIs, file stores — and is validated against schema and statistical expectations before proceeding.

Schema validation: Enforce column types, value ranges, and nullability constraints. Tools like Great Expectations and Pandera define these as code, version-controlled alongside the pipeline.
Distribution drift detection: Compare incoming data distributions against reference baselines. Flag batches where feature distributions shift beyond configurable thresholds (KL divergence, PSI, Wasserstein distance).
Freshness checks: Verify that data timestamps fall within expected windows. Stale data entering a training pipeline produces models that are already outdated at deployment.

Data that fails validation halts the pipeline and triggers alerts. This is the first quality gate — and the one most teams skip, leading to silent model degradation downstream.

Stage 2: Feature Engineering and Training

Validated data flows into feature engineering, where raw inputs are transformed into the numerical representations that models consume. In mature organizations, this stage reads from a feature store rather than computing features inline, ensuring consistency between training and serving.

Training itself is parameterized and reproducible:

Hyperparameters, model architecture, and training configuration are stored as versioned config files — never hardcoded in training scripts.
Training runs are logged to an experiment tracker with full metadata: dataset version, feature set version, code commit, environment specification, and all metrics.
Compute is provisioned dynamically. GPU instances spin up for training and terminate when complete. Spot/preemptible instances reduce costs by 60-80% for fault-tolerant training workloads.

Stage 3: Model Validation and Testing

A trained model is a candidate, not a release. The validation stage applies automated quality gates that determine whether the candidate proceeds to deployment:

Validation Type	What It Checks	Gate Criteria
Performance metrics	Accuracy, precision, recall, F1, AUC on holdout set	Must exceed current production model or minimum threshold
Slice analysis	Performance across demographic/segment subgroups	No subgroup performance below fairness threshold
Regression tests	Known edge cases and failure modes	Must pass all regression test cases
Latency benchmarks	Inference time at p50, p95, p99 percentiles	Must meet SLA requirements for target deployment
Resource profiling	Memory footprint, GPU utilization, batch throughput	Must fit within allocated infrastructure budget
Adversarial testing	Robustness to perturbed or out-of-distribution inputs	Graceful degradation, no catastrophic failures

Models that pass all gates are promoted to the model registry with a "staging" label. Models that fail are logged with detailed failure reports for debugging. This is the second quality gate, and the most important one for production reliability.

Stage 4: Deployment

Deployment moves a validated model from the registry to a serving environment. The deployment stage handles model packaging (containerization, serialization), infrastructure provisioning, traffic routing, and rollback configuration. We cover deployment patterns in detail below.

Stage 5: Monitoring and Feedback

The pipeline does not end at deployment. Model monitoring closes the loop by tracking production behavior and triggering retraining when performance degrades. Monitoring covers:

Prediction drift: Statistical shifts in model output distributions that indicate environmental change.
Feature drift: Changes in input feature distributions that may precede prediction quality drops.
Performance metrics: When ground truth labels become available (often delayed), compare actual vs. predicted outcomes.
Operational metrics: Latency, throughput, error rates, resource utilization — the same observability you apply to any production service.

When monitoring detects drift or performance degradation beyond configured thresholds, it triggers an automated retraining run — feeding new data back to Stage 1 and completing the feedback loop.

CI/CD for Machine Learning

CI/CD for ML borrows the principles of software CI/CD but differs in critical ways. Understanding these differences prevents teams from force-fitting software CI/CD patterns where they do not apply.

Dimension	Software CI/CD	ML CI/CD
What is versioned	Code	Code + data + model + config + features
What triggers a build	Code commit	Code commit, data change, schedule, drift alert
Build duration	Minutes	Minutes to hours (training step)
Test determinism	Deterministic (same code → same result)	Stochastic (same code + data may → different model)
Artifact size	Megabytes (container image)	Gigabytes (model weights + container)
Rollback	Deploy previous container version	Load previous model version + verify feature compatibility
Environment	CPU-based build agents	GPU-equipped build agents for training and validation

The Three-Pipeline Pattern

Production ML teams typically maintain three interconnected CI/CD pipelines rather than one:

Code pipeline: Triggered by code commits. Runs unit tests on data transformations, feature engineering logic, model training code, and serving code. Validates that the pipeline definition itself is correct — before any data is processed or models are trained. This pipeline runs in minutes on standard CI infrastructure. Apply the same CI/CD optimization patterns you use for software builds.
Training pipeline: Triggered by code pipeline success, data changes, or scheduled intervals. Executes the full training DAG: data ingestion, feature engineering, model training, and model validation. Produces a validated model artifact in the registry. This pipeline runs in minutes to hours and requires GPU-equipped infrastructure.
Deployment pipeline: Triggered by a model being promoted in the registry (manually or automatically after validation). Packages the model, provisions or updates serving infrastructure, configures traffic routing, and runs smoke tests against the deployed endpoint. This pipeline must support rollback within seconds.

Branch Strategy for ML Repositories

ML repositories benefit from a trunk-based development model with short-lived feature branches. However, the repository structure itself differs from typical software projects:

ml-project/
├── pipelines/           # Pipeline definitions (DAGs)
│   ├── training.py
│   ├── evaluation.py
│   └── deployment.py
├── src/
│   ├── features/        # Feature engineering code
│   ├── models/          # Model architecture definitions
│   ├── data/            # Data loading and validation
│   └── serving/         # Inference server code
├── configs/
│   ├── training/        # Hyperparameter configs (versioned)
│   ├── serving/         # Serving configs (batch size, concurrency)
│   └── infrastructure/  # IaC definitions
├── tests/
│   ├── unit/            # Fast tests for all src/ modules
│   ├── integration/     # Pipeline integration tests
│   └── model/           # Model quality regression tests
├── notebooks/           # Exploration (never imported by pipeline)
└── dvc.yaml             # Data versioning config

The critical rule: notebooks are for exploration only. Any logic that enters the pipeline must be refactored into tested, importable Python modules in src/. Notebooks that are called by pipelines are the single most common source of unreproducible ML systems.

Experiment Tracking & Model Registry

Experiment tracking and model registries solve the same fundamental problem: without them, no one can answer "why does this model behave differently than the one from last month?" They are distinct but complementary systems.

Experiment Tracking

An experiment tracker records the full context of every training run: parameters, metrics, code version, data version, environment, and artifacts. It answers questions like:

Which hyperparameter combination produced the best validation score?
What dataset version was used for the model currently in production?
How did performance change when we added the new feature last Tuesday?

The dominant tools are MLflow Tracking, Weights & Biases (W&B), and Neptune. All three offer comparable core functionality — the differentiator is ecosystem integration and team preference. For teams already invested in the AWS ML ecosystem, SageMaker Experiments provides native integration.

Model Registry

A model registry is the handoff point between training and deployment. It stores model artifacts alongside metadata and manages lifecycle stages (development, staging, production, archived). Key capabilities:

Versioning: Every model is an immutable, versioned artifact. You can always roll back to any previous version.
Lineage: Links each model version to the experiment run, dataset version, and code commit that produced it.
Stage management: Models move through lifecycle stages with approvals and automated gates. A model in "staging" has passed automated validation. A model in "production" has been approved for live traffic.
Serving metadata: Input/output schemas, preprocessing requirements, resource profiles, and SLA specifications travel with the model — not stored separately in wiki pages that go stale.

"If you cannot reproduce a model from its registry entry — same data, same code, same result — your registry is a file server with extra steps."
— adapted from Google's ML Engineering best practices

Orchestration Tools Comparison

Orchestration is the control plane that executes your pipeline DAGs: scheduling tasks, managing dependencies, handling retries, and provisioning compute. The choice of orchestrator is not as consequential as the quality of the pipeline it runs, but it does constrain your operational patterns.

Feature	Kubeflow Pipelines	Apache Airflow	Prefect	Dagster
ML-native	Yes — built for ML	No — general purpose	No — general purpose	Partial — data-aware
Kubernetes required	Yes	No (but common)	No	No
GPU scheduling	Native	Via K8s executor	Via infrastructure blocks	Via resource tags
Pipeline definition	Python SDK / YAML	Python (DAG files)	Python (decorated functions)	Python (assets + ops)
Caching / memoization	Built-in step caching	Manual implementation	Built-in result caching	Built-in asset memoization
Experiment tracking	Integrated (metadata store)	None — external required	None — external required	Asset observation built-in
Learning curve	Steep (K8s + KFP SDK)	Moderate	Low	Moderate (asset model)
Managed offering	Google Vertex AI Pipelines	MWAA, Astronomer, GCC	Prefect Cloud	Dagster Cloud
Best for	K8s-native ML teams	Existing Airflow shops	Small-mid ML teams	Data-centric ML teams

Selection Guidance

Choose Kubeflow Pipelines if your organization already runs Kubernetes and your ML workloads need tight integration with K8s-native GPU scheduling and distributed training. It offers the most ML-specific features but demands significant Kubernetes expertise.

Choose Airflow if your data engineering team already uses it and you want to integrate ML pipelines into existing data workflows. Airflow is not ML-native, but its ecosystem is unmatched, and most ML tooling provides Airflow operators.

Choose Prefect if you want the fastest path from notebook-based experiments to orchestrated pipelines. Its decorator-based API requires minimal refactoring of existing training code, and Prefect Cloud eliminates operational overhead.

Choose Dagster if your ML pipelines are heavily data-dependent and you want the orchestrator to understand data assets natively. Dagster's asset-based model aligns well with feature engineering and data transformation workflows.

Data & Model Versioning

Versioning in ML extends far beyond git. Code versioning is necessary but insufficient — you must also version datasets, feature definitions, model weights, and pipeline configurations. Without this, reproducibility is impossible.

Data Versioning with DVC

DVC (Data Version Control) extends git to handle large files and datasets. It stores lightweight pointer files in git while the actual data lives in remote storage (S3, GCS, Azure Blob). This gives you git-like semantics — branches, tags, diffs — for multi-gigabyte datasets.

# Track a training dataset
dvc add data/training/v3/
git add data/training/v3.dvc .gitignore
git commit -m "Add training dataset v3 - includes Q1 2026 labels"
dvc push

# Reproduce a specific training run
git checkout v2.1.0  # Checks out code + DVC pointers
dvc pull             # Downloads the exact dataset used for v2.1.0
python pipelines/training.py

DVC pipelines (dvc.yaml) define reproducible DAGs that track dependencies between data, code, parameters, and outputs. Running dvc repro re-executes only stages whose inputs have changed — a critical optimization when training datasets take hours to process.

Model Versioning with MLflow

MLflow's model registry provides model versioning with lifecycle management. Each registered model has an ordered list of versions, and each version links back to the experiment run that produced it:

import mlflow

# Log model during training
with mlflow.start_run() as run:
    mlflow.log_params({"learning_rate": 0.001, "epochs": 50})
    # ... training code ...
    mlflow.sklearn.log_model(model, "model", registered_model_name="fraud-detector")

# Promote to production (after validation passes)
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="fraud-detector",
    version=12,
    stage="Production"
)

The Versioning Matrix

Every production model must be traceable through a complete versioning chain:

Code version: Git commit hash of the training pipeline and feature engineering code
Data version: DVC commit or dataset hash identifying the exact training data
Feature version: Feature store snapshot or feature definition version
Config version: Hyperparameter and training configuration file version
Environment version: Docker image tag or conda environment spec hash
Model version: Registry version number linking to all of the above

If any link in this chain is missing, you cannot reproduce the model. And a model you cannot reproduce is a model you cannot debug, audit, or improve with confidence.

Deployment Patterns for ML Models

Deploying ML models carries more risk than deploying code because model behavior is harder to predict from tests alone. A model can pass every validation check and still behave unexpectedly on production traffic patterns. Deployment patterns mitigate this risk by controlling exposure.

Pattern	How It Works	Risk Level	Best For
Shadow deployment	New model runs on production traffic but results are discarded; only logged for comparison	Zero (no user impact)	High-stakes models, first-time deployments
Canary deployment	Route 1-5% of traffic to new model, monitor metrics, gradually increase	Very low	Models with measurable real-time metrics
Blue-green deployment	Maintain two identical environments; switch traffic atomically from old to new	Low (instant rollback)	Models requiring zero-downtime updates
A/B testing	Route percentage of users to new model, run statistical significance test on business metrics	Controlled	Measuring business impact of model changes
Multi-armed bandit	Dynamically shift traffic toward the better-performing model variant using an exploration-exploitation algorithm	Controlled	Continuous optimization, many model variants

Shadow Deployments in Practice

Shadow deployment is the most underutilized pattern in ML. The new model receives a copy of production requests and generates predictions in parallel, but only the existing model's predictions are served to users. You compare the new model's outputs against the existing model (and, when available, ground truth) to validate behavior on real traffic before any user exposure.

The implementation requires forking the inference request at the serving layer:

# Simplified shadow deployment logic
async def predict(request):
    # Primary model serves the response
    primary_result = await primary_model.predict(request)

    # Shadow model runs async — does not affect latency
    asyncio.create_task(shadow_predict_and_log(request, primary_result))

    return primary_result

async def shadow_predict_and_log(request, primary_result):
    shadow_result = await shadow_model.predict(request)
    metrics.log_comparison(primary_result, shadow_result)

Shadow deployments add compute cost (you are running two models) but eliminate deployment risk. For models where errors have significant consequences — fraud detection, medical diagnosis, content moderation — shadow deployment should be mandatory before any live traffic exposure.

Canary Deployment with Automated Rollback

Canary deployments gradually shift traffic from the current model to the new model while monitoring key metrics. Automated rollback triggers if metrics degrade beyond acceptable thresholds:

Deploy new model to a separate endpoint
Route 1% of traffic to the new endpoint
Monitor prediction distribution, latency, error rates for a configurable window (e.g., 30 minutes)
If metrics are healthy, increase to 5%, then 25%, then 50%, then 100%
If any metric breaches its threshold, instantly route all traffic back to the previous model

This pattern integrates directly into the infrastructure scaling patterns. Use service mesh traffic splitting (Istio, Linkerd) or load balancer weighted routing to control the traffic distribution without application code changes.

Infrastructure as Code for ML

ML infrastructure involves compute types (GPUs, TPUs) and resource patterns (bursty training, steady-state serving) that differ fundamentally from web application infrastructure. Infrastructure as Code (IaC) for ML must handle these differences.

Training Infrastructure

Training workloads are bursty and GPU-intensive. IaC definitions should:

Provision GPU instances on demand using auto-scaling groups or managed training services
Configure spot/preemptible instances with checkpointing so training resumes after interruption
Set cost limits — training jobs that exceed budget are halted before they drain your cloud account
Define storage volumes that are large enough for datasets and checkpoints, provisioned with the IOPS to feed data to GPUs without bottlenecking

For cost optimization, separate training infrastructure from serving infrastructure completely. Training instances should not remain running between training jobs.

Serving Infrastructure

Serving workloads are steady-state with latency requirements. IaC definitions should:

Auto-scale based on request queue depth rather than CPU utilization — GPU inference workloads have different scaling characteristics than CPU-bound web services
Define health checks that include model inference validation, not just HTTP readiness
Configure rolling updates that respect in-flight requests during model swaps
Provision model caching (local SSD or NVMe) to avoid downloading multi-gigabyte models from object storage on every instance launch

Terraform Example: ML Serving Infrastructure

# Simplified Terraform for model serving on EKS
resource "kubernetes_deployment" "model_serving" {
  metadata {
    name = "fraud-detector-serving"
    labels = {
      app     = "fraud-detector"
      version = var.model_version
    }
  }

  spec {
    replicas = var.min_replicas

    template {
      spec {
        container {
          name  = "model-server"
          image = "${var.ecr_repo}:model-${var.model_version}"

          resources {
            requests = {
              "nvidia.com/gpu" = "1"
              memory           = "8Gi"
            }
            limits = {
              "nvidia.com/gpu" = "1"
              memory           = "16Gi"
            }
          }

          liveness_probe {
            http_get {
              path = "/health"
              port = 8080
            }
          }

          readiness_probe {
            http_get {
              path = "/predict"  # Validates model is loaded
              port = 8080
            }
          }
        }
      }
    }
  }
}

resource "kubernetes_horizontal_pod_autoscaler" "model_hpa" {
  metadata {
    name = "fraud-detector-hpa"
  }

  spec {
    min_replicas = var.min_replicas
    max_replicas = var.max_replicas

    scale_target_ref {
      kind = "Deployment"
      name = "fraud-detector-serving"
    }

    metric {
      type = "Pods"
      pods {
        metric {
          name = "inference_queue_depth"
        }
        target {
          type          = "AverageValue"
          average_value = "5"
        }
      }
    }
  }
}

This pattern, combined with GitOps (ArgoCD or Flux), ensures that serving infrastructure is declarative, version-controlled, and auditable. Model deployments become pull requests — reviewable, reversible, and traceable. The same principles apply whether you deploy on AWS, GCP, or Azure.

Common MLOps Anti-Patterns

After working with dozens of ML teams, these anti-patterns surface repeatedly. Each one seems reasonable in the moment but creates compounding technical debt.

1. The Notebook Pipeline

Running production training by executing Jupyter notebooks in sequence — often manually. Notebooks mix exploration code, dead cells, hardcoded paths, and implicit state. They cannot be meaningfully unit tested, code reviewed, or monitored. Refactor training logic into testable Python modules and orchestrate them with a proper pipeline tool.

2. The Manual Model Promotion

A data scientist trains a model, sends the weights to an engineer via Slack, who deploys it by hand. No validation gates, no lineage, no rollback plan. This works for the first three models and becomes a liability by the tenth. Automate the promotion path through a registry with defined quality gates.

3. The Monolithic Pipeline

A single pipeline definition that handles data ingestion, feature engineering, training, evaluation, and deployment for all models. Changes to one model's training logic risk breaking other models. Decompose into modular, per-model pipelines that share common infrastructure but are independently deployable.

4. Training-Serving Skew

Feature engineering logic that differs between training and serving — different code paths, different libraries, different rounding behavior. This produces models that perform well in evaluation but poorly in production. The solution is a shared feature computation layer (feature store or shared library) used by both training and serving paths.

5. The Unmonitored Model

Deploying a model without monitoring for data drift, prediction drift, or performance degradation. The model silently degrades over weeks or months, and no one notices until a downstream metric collapses. Every deployed model needs a monitoring configuration defined at deployment time, not added as an afterthought. See the monitoring and observability guide for implementation details.

6. The GPU Hoarder

Keeping GPU instances running 24/7 for training workloads that run a few hours per week. At $2-8/hour per GPU instance, idle GPUs cost thousands per month for no return. Use ephemeral compute that provisions on pipeline trigger and terminates on completion.

"The most expensive ML infrastructure mistake is not choosing the wrong tool — it is keeping resources running when no one is using them."
— common wisdom in platform engineering teams

Frequently Asked Questions

What is the minimum viable MLOps setup for a team running 2-3 models in production?

At minimum, you need version control for code and data (Git + DVC), an experiment tracker (MLflow or W&B), a model registry (MLflow or cloud-native), a simple orchestrator (Prefect or Airflow), and basic monitoring (prediction logging + drift detection). Skip the complexity of Kubeflow or custom platforms — start with managed services and migrate when the managed offering becomes a constraint. The goal is reproducibility and automated validation, not architectural completeness.

How does CI/CD for ML differ from traditional software CI/CD?

ML CI/CD must version and test four artifacts (code, data, model, config) instead of one (code). Builds are non-deterministic — identical inputs may produce slightly different models due to stochastic training. Build times are measured in hours rather than minutes because of training. Tests include statistical quality checks, not just pass/fail assertions. And the deployment artifact (model weights) is often gigabytes rather than megabytes. Teams need GPU-equipped CI infrastructure for training and validation steps.

When should I use shadow deployment versus canary deployment for ML models?

Use shadow deployment when you are deploying a model for the first time, when model errors have significant consequences (financial, safety, legal), or when you lack real-time performance metrics and need offline analysis to compare models. Use canary deployment when you have a baseline model already in production, you can measure model quality in real-time, and you want to minimize the time to full rollout. Many teams use shadow first, then switch to canary for subsequent model updates once they have established monitoring baselines.

How do I handle training-serving skew in practice?

The most reliable solution is a feature store that computes and serves features through a single code path for both training and inference. If a feature store is not yet in your stack, at minimum extract all feature engineering into a shared Python library imported by both the training pipeline and the serving application. Write integration tests that send identical raw inputs through both paths and assert that the computed features match within acceptable numerical tolerance. Never duplicate feature logic across training and serving codebases.

What orchestration tool should I choose if I am starting from scratch in 2026?

For most teams starting fresh, Dagster or Prefect offers the best balance of ML-friendliness and operational simplicity. Dagster if your pipelines are data-centric and you want the orchestrator to understand data lineage natively; Prefect if you want the lowest friction path from scripts to orchestrated pipelines. Choose Kubeflow only if you already run Kubernetes and need its ML-specific features (distributed training, GPU scheduling, pipeline metadata). Choose Airflow only if your organization already operates it and the data engineering team provides a managed instance.