Best practices in Enterprise Data Science governance in Azure

12 min readFeb 19, 2021

Now that your enterprise data science is on Azure, you can take advantage of cloud agility and a set of great tools to improve development productivity and simplify ML operationalization.

However, even in a great platform with built-in support for modern ML practices, there’re still practical challenges to address due to the unique nature of data science.

Some practical challenges are:

Difficult to measure and track progress of data science projects: Data science projects are experimental in nature with open objectives, and many will never make it to production. Data scientists at exploratory phase may produce various interim artifacts that they don’t think it’s worth to push to team repo.
Only a very limited number of people such as data scientist responsible for the core solution. There’s not a lot of motivation to interact with others.
In organizations with multiple data science teams and projects, often science teams work in isolation, without knowing about each other’s work. Teams sometime develop the same work that was already created by another team. Even when a team wants to reuse the work from another, it’s difficult because each team have their own way of packaging model, organizing experiment & dependencies, deploying and operationalizing… This result in lack of reuse and sharing across teams and projects.
Operationalizing a ML model requires skills in both ML engineering skill (ML algorithm, Framework) and the skills in application and infrastructure platform on which ML solution is deployed.

This post offers following best practices in governing enterprise data science to address these challenges:

Organize work and resources.
Apply reproducibility practice.
Apply source code version control and model management.
Manage and monitor data
Standardize model packaging and deployment methods.
Implement MLOps
Democratize Model Discovery and Consumption

Organizing work and resources

A team or a department practicing data science development run one or more data science experiments/projects, producing modeling artifacts, code and operational solutions. The team/department needs compute resources to run modeling or inferencing work. They may need multiple environments depending on the release strategy.

Translating to Azure ML, a typical way to organize this is assignment a separate workspace for each team/department under which they can be assigned training and inference compute resources, model repo, data stores. Under this workspace, the team may setup one or more experiments that corresponds to different projects.

*Mapping between functional organization and Azure AML*

Each team/department should be setup with data stores corresponding the enterprise data’s data access policy. Typically, the data used for ML experiment should be in separate zone which is copied from operation environment.

Dataset should be setup as a snapshot copy and used for experiments. Each experiment should be linked to a dataset or a dataset version.

Computes could be training compute such as AML compute instance (one user), AML compute cluster (job type cluster), attached compute such as Databricks or Inference compute such as ACI and AKS. Appropriate authorization and roles should be setup so that the right people are in charge of compute resource setup and security hardening can be done if required by company’s policy.

Environment policy

It’s a good practice to create for each release environment(Dev, Test, Prod) a separate workspace to isolate data, access authorization and services. The components and configurations of each environment can also be different from each other.

For examples: In QA, you may want to introduce additional Azure Services to perform end to end testing of ML model integration in a complete functional workflow such as Azure Data Factory to schedule ML pipeline runs which was not the focus of the core ML development in Dev. In Production, there may be the need for model telemetry data collection for monitoring and analysis.

2. Apply reproducibility practice.

Unlike a regular software program where the output from input is mostly deterministic given the source code, data science experiment requires specific data input, experiment environment, hyper parameters, source code and other configurations to produce the same result. These settings can be very complex to setup and can be very difficult to reproduce for anyone who is not the original author.

Reproducibility is also very important to evaluate the ML model by the ability to examine all the data input and settings that produced the result.

So, efforts and best practices need to be applied to ensure and preferably simplify reproducibility of the experiment.

First, the input data used for the experiment need to be captured and versioned. The association with the experiment run need to be recorded.

Second, the environment under which to conduct the experiment need to be captured. Modern framework such as Azure ML and MLFlow allows automatic capturing of the ML run time environment and even to recreate it automatically from the environment file.

With Azure ML, it’s the best practice for complex ML experiment and ML trainings to be wrapped under the form of Azure ML Pipeline and is published. By doing this, a client can simply call a Rest API to be able to rerun the entire ML experiment run with ease. Inside ML Pipeline, all the steps needed to acquire data from a designated dataset/version, build ML environment and provision of the required compute capacity can be handled automatically.

Apply operation readiness at different levels:

Rerunnable experiment: code capture, environment captured
Reproducible experiment: dataset capture, configuration/hyper parameters captured
Accommodate changes: modularize & use pipeline

3. Apply source code version control and model management.

Use of source code version control in data science.

The use of source code version control system for data science experiment is not as common as in software development mainly because of the experimental nature of data science. When a project first started, data scientists prefer to use a personal notebook environment to quickly explore different hypotheses and solutions. However, as the exploration phase comes closer to a concrete direction, it is important to apply modularization and version control to facilitate collaboration with the broader team.

A simple yet effective strategy is to follow feature branch strategy in software development. Data scientist may create a feature branch to work on general exploratory analysis or try with a particular modeling direction in a notebook style development. Then when the work is ready, it is time to modularize the code (create modules, classes, runnable python files as opposed to keep everything in a notebook which is difficult to operationalize and collaborate on) and commit to a team branch.

Model Management

Model is at the heart of data science. When a model produced by data science experiment is used for business applications and downstream processes, it is important to keep track of information about the model.

Depending on actual business scenario, the information needed by model user can be following:

Technology details of the model such as ML framework, deployment environment
Links to the experiment and data that the model was trained on.
Model’s performance benchmark such as accuracy, recall in a particular test dataset.
Model version
Model files & data

Azure ML Model repository or MLFlow Model repository can be used host ML models. The technologies provide APIs to register, retrieve, track and package model efficiently.

Model tagging is the main method to store model’s meta information. It is important define a standard meta data template and use consistently across the teams. Having a consistent meta template facilitates model discovery, comparison and analysis.

Example of model metadata in a AzureML’s model repository

4. Manage and Monitor Data

ML model’s behavior is strongly influenced by the input observations it was trained on. Same algorithm may produce different models when input datasets are different. In production, if the distribution of actual data is quite different from distribution of the data in train & test then the model may produce undesirable results such as bias or even substantially worse performance.

Capture input data snapshot and link dataset to experiment

As input data is so influential to model’s behavior, it should be captured as snapshot from operational environment with version control applied to changes. Then ML training experiment should refer to the dataset and the version that it uses.

Azure ML comes with definition of Dataset objects, dataset versioning, profiling and metadata that can be used to keep track of dataset’s details.

Profile data

When running ML training, it’s a good practice to document the profile of input datasets, including train, test and validation. The data profile normally includes general statistics about distribution of every feature and label. Depending on problem domain, technology and dataset, there can be domain specific statistics about the dataset such as in computer vision, time series or NLP.

Dataset profiling can be a mini custom step in ML training pipeline of which the statistics outputs are logged if the operation is simple, but it can be a dedicated run to properly produce detail analysis of a dataset like in a Azure ML’s dataset profiling run.

Monitor data drift

When the input datasets change substantially whether in training or in inferencing, it may result in substantial change in down stream applications and processes that use the model. It is therefore a useful to be proactive in dealing with changes by applying drift monitoring.

Dataset versioning and lineage is useful to monitor data drift. You can establish a baseline profile from a baseline dataset version and compare with sequent versions in monitoring. Thresholds can be established to determine when to trigger a drift alert. Thresholds can be based on critical assumptions about data distribution that data scientists made when modeling. A periodic manual review of data drift can provide more comprehensive analysis on data drift.

Azure ML provides a data drift monitoring capability in preview.

For model in production, drift monitoring can be implemented as part of model monitoring with App Insights. Key statistics that comes from critical modeling data distribution assumptions can be computed and collected. Alerts then can be triggered if there’s a substantial deviation.

5. Standardize model packaging and deployment methods.

A model needs to be packaged to be deployed at a different environment than the training/testing environment.

If the model is packaged in a proprietary way, then it’s difficult for software engineer to deploy and work with the model.

Using a standard packaging method like Azure ML deployment framework or MLFlow make it easy to deploy and score your model.

A few tips in packaging the model:

Accompany the packaged model with environment file to specify all library dependencies that the model requires.
Include dependent source files and data files.
For Rest API, include swagger decorator with sample data to demonstrate how to call the model.
MLFlow packaging method makes it easy to test the packaged model locally without having to actually deploy. MLFlow also supports different model packaging flavors such as python, spark, tensorflow, pyspark…Use python if you need to customize the way model is loaded

One can go further than just packaging by creating scoring pipelines that can take on data stream, score and return result. Here are some methods to create scoring pipelines:

Real time event triggered scoring pipeline: taking advantage of EventGrid mechanism in Azure when data arrives at Storages, EventHubs..to trigger execution of scoring with Azure Function
Batch scoring with AML parallel run step: a large volume, parallel execution of scoring can be done with PRS. Multiple threads can run in parallel, instantiate the model package and process data & score
Batch scoring with Databricks/Spark Pandas UDF: as part of Databricks data transformation pipeline, data can be scored using Pandas UDF techniques. Model can be instantiated with iterator inside UDF to reduce the frequency of potentially expensive initialization operations.

6. Implement MLOPS.

Implement continuous integration (CI)

The first step in implementing MLOps is to automate code testing, model training and testing across environment when changes are delivered. The change may come from a code/configuration updates, new training data is available or a timed event. The idea of CI is to minimize manual effort to validate the entire pipeline to run it repeatedly and reliably. Here are following steps to implement CI:

Setup CI/CD tool: At the heart of MLOps is a CI/CD tool such as Azure DevOps (ADO). CI/CD tool connects to code/model repos and various ML development services such as Azure ML, Databricks to monitor for changes then orchestrates and executes tasks such as code validation, model training, model deployment…across multiple technologies used by ML development. The CI/CD tool will need to support common ML technology interfaces in Azure environment such as executing Python script, Azure ML pipeline, Databricks notebook run, Azure CLI, Databricks CLI…It is important for ML engineers to be familiar with how to configure CI/CD tool to work with different ML technologies, some of which require task implemented by 3rd party.
Structure ML development to be MLOps ready: MLOps requires ML pipeline to be repeatedly runnable in a different environment than the data scientist’s environment. An MLOps ready pipeline should be implemented with the followings: Parameterization of configurations and inputs that may change at each run such as learning rate, hyper parameter, input dataset…

Avoid environment specific dependencies such as path to a local directory of the current machine or model file in current directory
Library dependency configuration that specifies all needed libraries with required versions.
Specification of the run time environment such as AML compute cluster, Databricks cluster in a property file or as parameters

3. Establish a stable structure of elements of the pipeline while allow detail implementation to evolve. For example, you can design a pipeline with predefined steps of preprocessing, training step and validation with implementation of each step may change over time but rarely you need to change the composition of steps

4. ML training modules are runnable and deployable with well-defined inputs and outputs so that they can be rerun repeatedly. This is unlike the exploratory training environment where entire logic is contained in a notebook with multiple commands but may require code change to rerun with a different configuration or in a different environment.

5. Modularization: codes are structured into modules with well-defined interfaces to support parallel teamwork and rapid evolvement.

Implement continuous deployment (CD)

The result of CI could be a new model produced that triggers deployment/redeployment to currently operating environment. The subsequent step is normally to validate if the newly produced model can be worth for deployment by testing it on a test dataset and compare the performance to a threshold either predefined or is the performance of the currently deployed model. A manual approval step can be introduced here in addition to automated testing & performance comparison. You can setup this step at the end CI or at the beginning of CD. A common technique is to tag the model with performance data and other meta data so that it can be used to compare with later versions later on.

When the model is good enough, a deployment to QA environment should be done before deploying to production. Here, the DevOps tool can automate the step of packaging and deploying the model using a framework API such as Azure ML deployment API or MLFlow API. This step also needs to be carefully built because this can be complex.

This step requires several elements:

A packaging module that packages the model, its dependencies and produced a packaged file following MLFLow or AML package standard (see Packaging section above)
A deployment module that builds the deployment environment from a specification then deploy the packaged model. This can be done using AML framework when deploying to AML’s inference environment or ADO to build required Databricks cluster, copy the package code when deploying to Databricks.
Optionally, a smoke test should be done to make sure that the deployed model works correctly.

When deploying to production, depending on the target environment’s supportability, an advanced deployment strategy such as Canary deployment can be used to gracefully and safely transition the old model to new model.

Reference implementations: microsoft/MLOpsPython: MLOps using Azure ML Services and Azure DevOps (github.com)

7. Democratize Model Discovery and Consumption

In a multi-team data science environment, great models developed can be buried within the environment and knowledge of the team who developed it. Only that team knows how to deploy and use the model. The team used their proprietary method & technologies to deploy, and it is even difficult for them to adapt their deployment method to a different business and technology scenario.

To promote reuse and sharing of knowledge, it’s a good practice to make these assets available for discovery and simplify the process to deploy and consume.

A shared workspace can be used to host models, code repo, pipelines, demo datasets and with the power of ADO, provides the ability for any data science team or business team to quickly deploy the template to their own environment and use it with minimum effort. To make it more friendly, a web service can be developed to host solutions/model gallery, model search capabilities and an UI for one click deployment.

Team’s solutions from their workspace can be easily promoted to the gallery if they all follow the standardized conventions described above.

Best practices in Enterprise Data Science governance in Azure

Written by James Nguyen