Many organizations have adopted machine learning (ML) in a piecemeal fashion, building or buying ad hoc models, algorithms, tools, or services to accomplish specific goals. This approach was necessary as companies learned about the capabilities of ML and as the technology matured, but it also has created a hodge-podge of siloed, manual, and nonstandardized processes and components within organizations. This can lead, in turn, to inefficient, cumbersome services that fail to deliver on their promised value—or that stall innovation entirely.
As businesses look to scale ML applications across the enterprise, they need to better automate and standardize tools, processes, and workflows. They need to build and deploy ML models quickly, spending less time manually training and monitoring models and more time on value-driving, revenue-generating innovation. Developers need access to the data that will power their ML models, to work across lines of business, and to collaborate transparently on the same tech stack. In other words, businesses need to adopt best practices for machine learning operations (MLOps): a set of software development practices that keep ML models running effectively and with agility.
The main function of MLOps is to automate the more repeatable steps in the ML workflows of data scientists and ML engineers, from model development and training to model deployment and operation (model serving). Automating these steps creates agility for businesses and better experiences for users and end customers, increasing the speed, power, and reliability of ML. These automated processes can also mitigate risk and free developers from rote tasks, allowing them to spend more time on innovation. This all contributes to the bottom line: a 2021 global study by McKinsey found that companies that successfully scale AI can add as much as 20 percent to their earnings before interest and taxes (EBIT).
“It’s not uncommon for companies with sophisticated ML capabilities to incubate different ML tools in individual pockets of the business,” says Vincent David, senior director for machine learning at Capital One. “But often you start seeing parallels—ML systems doing similar things, but with a slightly different twist. The companies that are figuring out how to make the most of their investments in ML are unifying and supercharging their best ML capabilities to create standardized, foundational tools and platforms that everyone can use — and ultimately create differentiated value in the market.”
In practice, MLOps requires close collaboration between data scientists, ML engineers, and site reliability engineers (SREs) to ensure consistent reproducibility, monitoring, and maintenance of ML models. Over the last several years, Capital One has developed MLOps best practices that apply across industries: balancing user needs, adopting a common, cloud-based technology stack and foundational platforms, leveraging open-source tools, and ensuring the right level of accessibility and governance for both data and models.
Understand different users’ different needs
ML applications generally have two main types of users—technical experts (data scientists and ML engineers) and nontechnical experts (business analysts)—and it’s important to strike a balance between their different needs. Technical experts often prefer complete freedom to use all tools available to build models for their intended use cases. Nontechnical experts, on the other hand, need user-friendly tools that enable them to access the data they need to create value in their own workflows.
To build consistent processes and workflows while satisfying both groups, David recommends meeting with the application design team and subject matter experts across a breadth of use cases. “We look at specific cases to understand the issues, so users get what they need to benefit their work, specifically, but also the company generally,” he says. “The key is figuring out how to create the right capabilities while balancing the various stakeholder and business needs within the enterprise.”
Adopt a common technology stack
Collaboration among development teams—critical for successful MLOps—can be difficult and time-consuming if these teams are not using the same technology stack. A unified tech stack allows developers to standardize, reusing components, features, and tools across models like Lego bricks. “That makes it easier to combine related capabilities so developers don’t waste time switching from one model or system to another,” says David.
A cloud-native stack—built to take advantage of the cloud model of distributed computing—allows developers to self-service infrastructure on demand, continually leveraging new capabilities and introducing new services. Capital One’s decision to go all-in on the public cloud has had a notable impact on developer efficiency and speed. Code releases to production now happen much more rapidly, and ML platforms and models are reusable across the broader enterprise.
Save time with open-source ML tools
Open-source ML tools (code and programs freely available for anyone to use and adapt) are core ingredients in creating a strong cloud foundation and unified tech stack. Using existing open-source tools means the business does not need to devote precious technical resources to reinventing the wheel, quickening the pace at which teams can build and deploy models.
To complement its use of open-source tools and packages, David says, Capital One also develops and releases its own tools. For example, to manage streams of dynamic data too large to manually monitor, Capital One built an open-source data profiling tool that uses ML to detect and protect sensitive data like bank account and credit card numbers. Additionally, Capital One recently released the open-source library rubicon-ml, which helps capture and store model training and execution information in a repeatable and searchable way. Releasing its own tools as open source ensures that Capital One builds ML capabilities that are flexible and repurposable (by others, as well as across its own businesses) and allows the company to connect with and contribute to the open-source community.
Enable data accessibility while prioritizing governance
A typical ML system includes a production environment (processing data in real-time) and an analytical environment (a store of data with which users can work). For many organizations, the lag time between these environments is a significant pain point. When data scientists and engineers need access to near-real-time data from the production environment, it’s important to set up appropriate controls.
ML developers thus need to ensure integration and access to both environments without compromising governance integrity. “In an ideal world, the organization would establish a seamless integration between production data stores and analytical environments that can enforce all the controls and governance frameworks that the data scientists, engineers, and other stakeholders involved in the model governance process need,” says David.
Governing and managing the ML models themselves is equally important. As a machine learns and as input data changes, models tend to drift, which traditionally requires engineers to monitor and correct for that drift. MLOps practices, by contrast, help automate the management and training of models and workflows. An organization adopting MLOps could determine for each ML use case what needs to be monitored, how often, and how much drift to allow before retraining is required. It can then configure tools to automatically detect triggers and retrain models at an appropriate cadence.
In the early days of ML, companies took pride in their ability to develop new and bespoke solutions for different parts of the business. But now companies seeking to scale ML in a well-governed, nimble way have to account for continuous updates to data sources, ML models, features, pipelines, and many other aspects of the ML model lifecycle. With its potential to offer standardized, reproducible, and adaptable processes across large-scale ML environments, MLOps could unlock the future of enterprise machine learning.
This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.