Data Science on Azure with Metaflow
In this post, we are going to introduce Metaflow - data scientists’ new best friend. It makes infrastructure management easy, allowing data scientists to focus on delivering value. Metaflow recently added support for Azure as a backend, and we took it for a spin. Here are our impressions.
Data is becoming an integral part of businesses across a variety of industries. Consequently, data scientists are hired to make use of the data. They combine methods such as statistics and artificial intelligence (AI) with programming to extract insights from data. Data science is a quickly evolving field, meaning that development practices can vary greatly between different companies and even between individual data scientists within the same organization.
While the practices and tools may be disparate, most data science applications do have some common requirements for the development stack. These items are as follows:
Data - discovery and access
Compute - data science is compute-heavy and needs lots of computing capacity
Orchestration - organized execution of the multiple units of data processing steps
Versioning - it should be possible to track projects and their iterations
Deployment - moving the application to the production environment to deliver actual value
Modeling - quality and performance optimization of the models
This could be called the full stack of data science. Each layer of the stack is necessary to make it possible for data scientists to do their jobs effectively. To enable all the layers of the stack, both infrastructure and tools that make use of it are needed.
There exists a plethora of tools that focus on some areas of the stack. These may be paid services or open-source software. Additionally, each cloud vendor has its own set of tools. These tools are usually quite comprehensive, but at the same time, they are tightly integrated into the cloud platform.
Tight vendor lock-in might be counterproductive, especially for fast-moving organizations. Besides, you never know when your favorite proprietary tool will be given the end-of-life treatment. Development practices can be bound directly to the cloud vendor’s way of doing things, making it hard to change providers later on.
Metaflow is an open-source data science framework that was initially developed by Netflix. Its development is now continued by Outerbounds. Metaflow provides a platform that allows data scientists to focus on delivering end-to-end data science workflows, from experiments to production. Metaflow’s features cover the whole data science stack and it can be adapted to different kinds of use cases.
Metaflow projects are composed of flows that are modeled as steps of a directed acyclic graph. Graphs are a very robust way of defining workflows as they are easy to visualize, test and optimize. The data, arguments and parameters that are loaded by a flow are versioned by Metaflow. Additionally, all intermediate and output data can be compressed and archived. Metaflow orchestrates the execution of the flows, runs steps of the flow in parallel when possible, and tracks flow versions. While flows can be executed locally, sometimes additional tracking, computing capacity and storage are required. This is where Metaflow’s cloud backends step in.
Metaflow integrates with AWS and now also with Azure. This makes it possible to utilize cloud computing capacity to run the flows at scale. In addition to the compute layer, also data storage, user interface, as well as versioning service, can be run in the cloud. This is beneficial for larger data science teams since all the runs and data are stored in a centralized location, providing a boost to collaboration and experiment tracking.
As the final step, the data science project can be deployed with Metaflow to the cloud for production use. The project could be, for example, a machine learning model that is used by other applications. After deployment, Metaflow’s tracking features can be used to track the model’s performance when new versions of the flow are developed.
Metaflow on Azure: Comparison to Azure Machine Learning
Metaflow was integrated previously only with AWS, but now it also comes with Azure support. This is beneficial for the many organizations that already use Azure as their main cloud. Metaflow's Azure integration makes it possible to keep all your precious data inside a single cloud system for data science tasks. Azure’s competing proprietary offering is Azure Machine Learning (AML). What are the advantages and disadvantages of Metaflow when it is compared to Azure Machine Learning?
The first advantage of Metaflow is that you have more freedom. Metaflow is open source while AML requires that you use many Azure-specific Python libraries when developing your data science models. Besides vendor lock-in, the AML Python SDK v2 is still in preview and is hence not recommended for production workloads. AML runs “pipelines” that are similar to Metaflow’s flows. However, Metaflow is easier to operate and the flows are simpler to write compared to AML. The programmatic workflow creation experience in Metaflow is very pleasant, and it allows data scientists to focus on their job without cluttering the code with infrastructure details.
AML Python SDK can be used to create managed compute instances and clusters programmatically. Metaflow currently uses Terraform templates to create Azure compute clusters. Metaflow’s Azure integration uses Azure Kubernetes Service (AKS) to enable the cloud computation layer and scaling. After the initial resource creation, AKS can be practically invisible to the end user. Importantly, with Metaflow the development experience remains almost identical when switching between local and cloud backends.
Metaflow’s Azure integration is still in the early stages. From this perspective, AML provides much more mature integration to various Azure services and a comprehensive visual designer, Azure ML Studio, that runs in your browser. While AML browser-based cloud environment is great, the local development experience is not as robust as in Metaflow. Metaflow provides also a user interface, where it is possible to browse flow executions, although it is not yet integrated into Azure features such as Active Directory.
In summary, Metaflow provides a very attractive and comprehensive data science offering for organizations that are using Azure as their cloud. And it’s open source!
When it's time for you to level up your data science architecture and properly leverage your data, we'll be happy to chat!