Skip to content

The Data Scientist

machine learning

ML Platform for Business

 

Abstract

This article outlines a framework for developing an enterprise-grade ML platform, focusing on the key capabilities required to accelerate and improve machine learning implementation. It provides a high-level overview of the necessary components, serving as a strategic playbook for ML practitioners, including researchers, architects, engineers, and product managers. The goal is to guide the construction of an ML platform that directly supports organizational goals and objectives.

Introduction

Many organizations mistakenly believe that building the ML model alone is sufficient for successful machine learning implementation. However, the truth is that developing the model is the easier aspect, while the real challenge lies in establishing an infrastructure that enables the model to integrate seamlessly with other system components of an enterprise to address real-world issues. In practice, deploying ML at scale requires a robust ML platform, but what precisely do we mean by the term “ML platform”? Let’s attempt to provide a definition.

ML platform: A set of tools & technologies that can facilitate the development, deployment, and serving of ML models.

ML platform requirements vary across companies, with no universal solution that fits all. The structure of ML platforms varies greatly between small/medium-sized companies and large technology companies. Additionally, the importance of machine learning within an organization impacts the complexity and advancement of its ML platform. To offer general guidance, I am putting a simplified representation with three stages (as shown in Figure 1). It emphasizes that when an organization reaches a specific stage in its ML journey, a Generic ML platform becomes essential for enabling further scalability. 

While the core functional domains (development, deployment, serving) remain the same between a Generic vs Specialized ML platform, but a Specialized ML platform deploys advanced capabilities i.e., embedding models, diffusion models, vector DBs to solve use cases like ChatGPT and DALL-E-2. Attempting to encapsulate the capabilities of both the Generic and Specialized ML platforms in a single framework would not do justice to this paper. Hence this white paper will delve into the specifics of a Generic ML platform, centering our attention on the following aspects –

  1. Emphasize the reasons why organizations require an ML platform and the business value it provides.
  2. Gain an understanding of user personas and identify target use cases.
  3. Develop the conceptual model of the ML platform.
  4. Establish the components and capabilities of the ML platform.
  5. Finally, we’ll explore initiating ML platform construction within an organization, accompanied by key suggestions

 

Why organizations require ML platform

Let’s start with a scenario that most organizations face while developing an ML model to solve one of their use cases,  

for example, suppose a team has developed a model, now what?

  • How to make data continuously accessible for model training?
  • How to store and track the used features in a model?
  • How to evaluate models, both offline and online?
  • How PMs, researchers, SMEs validate the model’s accuracy (without data scientist’s help)?
  • How to keep track of all the deployed models and their versions?
  • How can the model serve online and batch prediction?
  • How to integrate the outputs of models with various products, surfaces, and services within the organization?
  • How to apply additional business rules to the model’s output?
  • How to continually monitor and deploy changes to ML workflows?

 

It is crucial to address these questions with the utmost importance; otherwise, an organization may find itself developing numerous ML models, but only a few will make it to production, and even fewer (if any) will effectively solve real business problems. A scalable, reliable, and maintainable ML platform plays a crucial role in uncovering the solutions needed to improve the success rate. The ML platform can be visualized as a collection of building blocks (as shown in Figure 2), where each block represents a service that can address one or more of the above questions. The size of each block corresponds to the effort and complexity associated with that service.

 

As you can see, only a small fraction of the real-world ML platform is composed of the ML code, as shown by the small black block in the middle. Remarkably, the coding aspect is gradually diminishing with the emergence of Language Models (LLMs). But the required surrounding infrastructure is vast and complex, and how well they interoperate defines the success of an ML platform. We will discuss it in detail below, but for now, it is important for us to understand that an organization needs an ML platform in place before it becomes ambitious in executing successful ML projects.

Business Value

Some of the key business values that organization can derive from a well-designed ML platform are –

  • Address the “last mile” challenge: efficiently deploy and manage more models in production through standardized tools, technologies, and frameworks.
  • Achieve high consistency, repeatability, and scalability in ML effort: enhance model performance through rapid development iterations and scale workflows to accommodate complex model architecture.
  • Promote collaboration and knowledge sharing: broaden accessibility of ML to more data scientists, engineers, and other stakeholders to foster teamwork and accelerate ML innovations.

 

User Personas and Target Use Cases

Before we design the conceptual model for the ML platform, let’s first look at its user personas and target use cases.

User Personas:

Constructing an ML platform aims to improve accessibility across teams, not just for data scientists. Users include data scientists, ML engineers, SMEs, researchers, product managers, software developers and sometimes business leaders. Their collective contributions drive the success of ML projects.

These ML practitioners will have varied skill sets, preferences, and requirements. Data scientists often excel in statistical tools and Jupyter notebooks but may be less comfortable with ML model version control. Software developers, on the other hand, are skilled in coding ML serving components but may have less expertise in statistical tools. Non-ML talents i.e., SMEs and PMs will be more curious to know about the data and models through intuitive UI interfaces without writing SQL queries or accessing Jupyter notebook.

To achieve organizational success, it’s important to understand the user personas and meet their needs. The key is to leverage individual strengths and provide a supportive platform that accommodates preferred working methods for all. 

Target Use Cases:

Now, let’s shift our attention to the types of use cases in the real world. When I mention “use cases,” I am not specifically referring to individual business problems. Instead, I will concentrate on how teams put their models into action so that it can be used by others and generate real value. There are two ways to put a model into action – batch inference or online inference. Ultimately, every business use case will belong to one of these categories. Let’s explore the details below.

Batch Inference

  • Generate predictions on batch of observations (thousands or millions)
  • Batch jobs are often triggered on some recurring schedule (e.g., hourly, daily)
  • Model predictions are stored in data bases and made available to developers and end users.
  • SLA for prediction latency is in hours i.e., 12 hrs

 

Example: customer segmentation, sales forecasting, demand forecasting, churn scores etc.

 

Online Inference

  • Predictions are generated on a single observation of data at run time. 
  • Can be generated at any time of the day on demand.
  • Model predictions are consumed in-moment.
  • SLA for prediction latency is in few hundred ms i.e., 100 ms

 

Examples: real-time fraud detection, personalized recommendations, chatbot responses, real-time sentiment analysis etc.

 

ML Platform – Conceptual Model

Until now, we have explored the importance of an ML platform, identified its potential users, and outlined the desired use cases. We will now focus on designing the conceptual model of the platform, following a top-down approach,

  1. To begin with, we will define the high-level functional domains of machine learning.
  2. Next, we will identify critical components within the platform supporting each functional domain.
  3. Lastly, we will integrate all the identified components to create the conceptual model of the platform.

 

A. Define the functional domains

Illustrated in Figure 5, machine learning comprises three fundamental functional domains: Development, Deployment, and Serving. The ML platform we are designing must possess the capabilities to encompass all three domains effectively. Let’s delve into each of these functional domains.

 

Development

It refers to the process of creation and design of machine learning algorithms and models, utilizing data to extract valuable insights, make predictions, and address specific problems. 

The output of this process is a “Trained Model”.

Deployment

It refers to the process of making trained machine learning models available for production use so that it can take an input and return an output. 

The output of this process is “Model Endpoint (HTTP)”.

Serving

It refers to the process in which a deployed model, together with a serving infrastructure, generates actionable real-time or batch predictions for clients or applications. 

The output of this process is the “Prediction or Recommendation”.

B. Identify critical components

Now, let us examine the components within the ML platform that will bolster each functional domain. It is imperative that all the components within a specific functional domain collaborate harmoniously to ensure the successful delivery of the desired output.

Example: 

The components within the ‘Development’ functional area should collaborate cohesively to generate the ultimate output, which is a “Trained Model”. This trained model then serves as the input for the ‘Deployment’ functional area and so on, establishing a seamless flow between different functional domains as depicted in Figure 6.

 

C. Integrate all

This aspect can be challenging as it requires building integrations that ensure the ML platform functions as a unified infrastructure. Although the diagram below (refer Figure 7), showcasing all the integrations may appear complex, we will examine them individually in more detail. But before that it is crucial to highlight a significant design principle known as “modular isolation.”

Module isolation: We need to maintain the modularization while designing the platform, which involves separating each component to improve the reusability, functionality, and efficiency of the service. Each component of the ML platform should be well-defined to make it easier to correct, make improvements, and integrate with other components or services.

 

The diagram focuses on 3 key workflows –

  • Model development & deployment: shown by solid arrows
    • This represents the process of development and deployment of an ML model to production.
    • The workflow is non-linear and iterative. For example, during Feature Engineering, if the available data is inadequate for generating features, it may be necessary to revisit the Data Collection process to gather new data. Likewise, during Model Training, if the existing features do not yield satisfactory results, generating new features may be required.
  • Prediction: shown by dashed arrows
    • This represents the process for serving the model either batch or online; also known as model inference.
    • This flow is triggered by a batch prediction job or by a client application shown above in circular shapes.
  • Data: shown by dotted arrows
    • This represents the process of data movement between components
    • The data (raw data, features, model predictions) eventually get stored in some storage solutions i.e., Feature Stores (online & offline), Batch Prediction Store, Tracking Store, Metadata Store etc.

ML Platform – Components & Capabilities

Please find below a list of the 27 key components shown in the diagram above (Figure 7). The 4 external system components are not part of the ML platform, but they’re essential for a comprehensive view. The integrations between the ML Platform and these external systems are extremely crucial in an organizational set-up.

 

The table presented below offers descriptions and capabilities of each component within the ML platform. The capabilities are expressed as high-level features and do not delve into specific details. This should give enough directional guidance to an execution team on the capabilities they should focus to build in an ML platform.

Component Description Key Capabilities
Development
Data Collection The process of data ingestion from offline data sources and apply basic standardization and enrichment techniques. Data connectors to ingest data (high volume) fromData warehouse or Data LakeBatch Prediction Store Standardization and Enrichmentcolumn name, file format (parque, csv)anonymization of PII data (if any)
Data Staging Layer A data staging layer in the ML platform, where the standardized and enriched data are stored.
NOTE – Organizations typically maintain their own enterprise data lake or data warehouse, serving as a central repository for curated data. However, in practice, there arises a need to create a “Data Staging Layer” within the ML platform to consolidate and store training data from diverse sources, including streaming and batch data.
Data syncshould be kept updated with source(s) to avoid data stalenessStore and accessobject storage (JSON, CSV)high throughputs writeDB, tables on top of data storageSQL like interface for querying/exploringaccess for “Feature Engineering”Security & ComplianceGDPR (maintain, purge)security (role-based access and control)
Feature Engineering The process of transforming data to create features for ML. This step involves cleaning, reduction, and transformation of the raw data into meaningful ML features for use in model training.
NOTE – Data Collection and Feature Engineering are time consuming processes and may take up to 60-70% of effort for data scientists and ML engineers. It’s also typical to have this work repeated by different teams within an organization who use the same data to build ML models for different solutions. 
Automated feature engineeringManual feature engineeringHandling “cut-off time”filter out any data after “cut-off time”Primitives (transformation & aggregation) in-builtcustom
Offline Feature Store (for Training) A repository service that securely store, update, retrieve and share ML features
NOTE – Consistent features are needed between different parts of an organization, and between training and inference for any given ML model. Hence, reusability of this service is extremely important for cost reduction by eliminating duplicate feature engineering effort and storage overhead.
Store and accessDB, tables on top of feature storageSQL like interface for querying/exploringprovide access for “Model Training”Catalogtags and indexes feature groups (logical groupings of ML features)discoverable UI to browse features (group name, tags, version, creation time, last update time)Lineage trackingsource data used (from the Data Staging Layer)processing codeusage in models and endpointsFeature Consistency with “Online Feature Store”sync the latest feature snapshot to “Online Feature Store” in batchesHistorical data accessaccess to historical feature valuesre-create features at specific points of time in the pastSecurity & Compliancedata encryptiondata access based on rolestable level access controls
Model Training The learning process of an ML model is called training. This is the stage where the ML algorithm is trained by feeding the feature datasets. Accessability to access “Offline Feature Store”Algorithms availabilityinbuiltcustom (bring your own algorithm)Notebook interface for codingJupyter notebook type interfaceModel validationwith validation and test dataPublish to Model Registrypublish trained model details along with evaluation metrics
Trained Model A trained model is a model that has been developed but has not undergone the evaluation process yet.
NOTE – It is represented through their artifacts (parameter, model definition etc.) stored in a specific file format i.e., *.tar.gz or folder
Portabilitytransferable from one system to anotherConfigurationall the dependencies for the model should be packaged into one file/folder
Model Evaluation It is the process of assessing the performance and effectiveness of a machine learning model by humans (SMEs, PMs). Human centric model evaluation (UI)evaluate model output for given inputshuman feedback data collectionPublish to Model Registrypublish evaluated model details along with evaluation metrics
Evaluated Model An evaluated model is a model that has undergone the evaluation process after the training.
NOTE – A model is represented through their artifacts (parameter, model definition etc.) stored in a specific file format I.e., *.tar.gz or folder
Portabilitytransferable from one system to another
Deployment
Model Registry A centralized repository used to store, manage, and track models and their associated metadata i.e., name, version, performance metrics, training data and other relevant information.
NOTE – It serves a function analogous to version control systems (e.g., Git, SVN) in traditional software.
Development and deployment integrationintegrate and store trained modelsintegrate and store evaluated modelsintegrate and store deployed modelsmake models available for deploymentStore, manage and track modelsversiontraining dataset versionmodel ownerversion model source codeevaluation metricsenvironment details (stg, prod)version change historytags (if any)comments (if any)Model Catalog (UI)discoverable UI to browse models.
Endpoint Config This is a configuration file that stores the necessary information needed to deploy an ML model i.e., model name, variant, variant weight, compute instance type, no. of instances etc. Accessibilityto “Model Deployment” process 
Model Deployment It is the process of making a trained machine learning model available and operational for use in real-world applications, where it can take in an input and return an output. 
The output of the deployment process is either an online or a batch HTTPS end point.
Multiple model deployment frameworksinstance basedserverlesscontainerized i.e., DockerStage/prod deploymentsability to deploy model in stage or prodability to move model from stage to prodTests before deployingability to insert API and smoke tests before deployingPublish to Model Registrypublish deployed model details
Deployed Model A deployed model is the model that is finally put into production through the model deployment process.
NOTE – A model is represented through their artifacts (parameter, model definition etc.) stored in a specific file format I.e., *.tar.gz or folder
Portabilitytransferable from one system to anotherPerformancea model should retain its quantitative performance with growing needs in production.Configurationall the dependencies for the model should be packaged into one file/folder
Model Endpoint(s) This is the HTTPS endpoint that services call to make an inference (get prediction) from the ML model.
NOTE – The endpoint is an interface to interact with models that are deployed for online or batch inference. While the online deployment expects one request at a time, the batch deployment can process a batch of requests at a time i.e., the input can be an excel/csv file with multiple records and the response can also be an excel/csv file. 
Deployment support (behind one endpoint)single modelmultiple modelsmultiple variants of the same modelcombination of models and variants A/B testingroute requests between modelscollect relevant metricsstatistical analysis of collected metricsdecision making based on results i.e., deploy the new version, roll-back to the previous version, iterate furtherAuto scalingmanual, schedule, demand, predictive Load balancingefficient distribution of traffic
Serving
Online Model Service A service that provides a way to serve predictions for a new input data. It acts as a bridge between model endpoint and external clients that require online/real-time predictions.
NOTE – It works internally with multiple micro-services to process an incoming observation and return the prediction.
Custom API endpointcreate HTTPS endpoints that “Client” and “Customer Engagement Platform” can callA/B testingability to invoke appropriate “Model Endpoints” based on the context of the observation in the request from “Clients” or “Customer Engagement Platform”Read from “Online Feature Store” (fast read)for online inferenceRead from “Metadata Store” (fast read)to enrich model outputApply “Serving Rules”to simplify model prediction and fallback behavior (business rules – if any)Publish “Model Monitoring”Publish model specific metrics to a monitoring serviceRead/write to “Tracking Store”read user interaction click, render, view, opt-in/opt-outwrite predictions from the modelRead/write to “Model Prediction Caching”pre-compute and store slow changing predictions to caching serviceread from caching to return prediction Security and access controlenforce authentication and authorization of the incoming API calls (observations) Auto scalingmanual, schedule, demand, predictive Load balancingefficient distribution of trafficLoggingbad data errortraffic pattern from clients (with API keys)
Tracking Store A storage solution that tracks ML recommendations and user interactions (click, view) and make it available for consumption Integrate streaming ingestionuser actions i.e., click, view, opt-in/opt-out etc. from “Event API Endpoint”model predictions from “Online Model Service”StoreAggregate and store in DB, tablesFast read (low latency in ms)fast read by “Online Model Service”
Serving Rules A set of serving rules (depending on the use case) that need to be validated before the prediction gets served to the client.
Example – Do not show the same recommendation to a user more than once in last 24 hrs
Store & serveserving rules configuration per modelFast read (low latency in ms)access to “Online Model Service”
Online Feature Store (for Inference) The low latency, high availability store for features that enables real-time lookup of records
Only the latest values are stored in the online store
Streaming ingestionintegrate inference features from “Event API Endpoint”Store & serveaggregate and store in DB, tablesaccessible for model inferenceFast read (low latency in ms)fast read by “Online Model Service”Feature Consistency with “Offline Feature Store”Sync the pre-aggregated features to “Offline Feature Store” in micro batches
Model Monitoring Capabilities used to detect and measure issues that arise with machine learning models. Model Performancemeasure model performance with evaluation metrics (accuracy, recall, precision, F1, …)Data Driftshift in the statistical properties of the feature dataData Qualitycardinality shifts, data type mismatch, missing feature dataAlertsalert mechanism for threshold changes in any of the above parameters
Metadata Store These are storage solutions that fetch and store all the required metadata that the ML server needs to enrich the output of the ML model.
Example – The output of an ML model may be an app name, but you need more than just the app name to recommend to the client i.e., app description, hero image banner, pricing, deep link (if any) etc. which are the metadata about an app
Webhook integration with “Metadata Source”read from metadata store every time there is new, change or delete in metadataStorein DB, tablesbatch, micro match writesQuick lookupfast read by “Online Model Service” (in ms)
Model Prediction Caching In the context of model serving, Model Prediction Caching is a service that should process and store pre-computed predictions for frequently requested observations. 
NOTE – Caching provides the benefits of faster response time and cost savings (by reducing the calls to model endpoints)
Processprocess pre-computed predictions by invoking online model endpointCachestore pre-computed predictionsFast read (low latency in ms)fast read by “Online Model Service”TTL (time to live)periodic refresh or delete of predictions
Batch Prediction Job Scheduled job that runs on large dataset features and invokes batch model endpoint to get batch predictions all at once. Data ingestionIn batch i.e., daily from external storagePrepare featuresprocess ingested data to create featuresBatch API callability to trigger batch API calls to a batch model endpoint (HTTPS end point)API throttling i.e., 1M records to a batch of 10 API calls, with 100k each
Batch Prediction Store A storage service that holds the results of batch predictions received from the model endpoint.
NOTE – Ideally there should be a storage service in the enterprise data lake or data warehouse, which can eliminate the need of a separate Batch Prediction Store in the ML platform.
Storeobject storage (JSON, CSV)high throughputs writeIntegrations to other systemsdata warehouse or data lake for analyticscustomer engagement platform for audience segmentationsoffline feature store for training, which can sync with online feature store too
Real-time Data Ingestion
Event API Endpoint API end point for clients for sending streaming data using the HTTPS protocol Streaming ingestionability to ingest real-time signals from “Clients”aggregate and feed signals in real-time to “Online Feature Store”aggregate and feed user actions in real-time to “Tracking Store”feed to a storage layer for triggering feature consistency with “Offline Feature Store”Security and access controlenforce authentication and authorizationAuto scalingmanual, schedule, demand, predictive Load balancingefficient distribution of trafficLoggingbad data errortraffic pattern
External Systems
Client App, surface, or service that the end user directly interacts with Integration with “Customer Engagement Platform” for getting prediction/recommendations Integration with “Online Model Service”for directly fetching model prediction/recommendationsData publish (through SDK)publish real-time events to “Event API Endpoint”publish data to a data warehouse or data lake in batch/micro batch.ML transparencyexplain why certain results are being generated/recommendedopt-in/opt-out option for end users
Data Lake / Warehouse An offline storage service where data (structured & unstructured) is collected from multiple sources, validated, transformed, and stored.
NOTE – data warehouse stores structured data with schema. data lake can contain both structured and unstructured data.
Integration with “Clients”collect real-time and batch dataIntegration with “Batch Prediction Job”make data accessible for batch predictionIntegration with “Batch Prediction Store”collect and curate batch predictionsIntegration with “Data Collection”make data accessible for data collection jobs
Customer Engagement Platform Application or tool that allows organizations to launch, schedule, coordinate, and monitor marketing campaigns to engage with customers across multiple channels i.e., email, desktop, web, mobile
NOTE – This can be an in-house or enterprise level application with features i.e., CDP (Customer Data Platform), audience segmentation, campaign orchestration, A/B testing, analytics etc. 
Integration with “Online Model Service”pass context (from Client or CDP) to “Online Model Service”process the response from “Online Model Service” and pass it back to “Clients”API throttling (if needed)
Metadata Source This can be one or multiple systems that stores metadata information about objects i.e., tutorial, app, video, template etc. that are recommended by the models to the clients  Webhook integration with “Metadata Store”API call to “Metadata Store” for any change, new or delete of metadata

 

Final Thoughts

Initial steps for organizations

Building an ML platform is an evolutionary process shaped by organizational needs, skills, and tech investments. Cloud solutions like AWS SageMaker, Azure ML, GCP ML and Databricks offer features to support model development, deployment, and serving. Organizations can build their ML platform upon these solutions while incorporating additional capabilities tailored to their unique use cases.

Start by collaborating with data scientists and ML engineers to understand their challenges in executing ML projects comprehensively. Prioritize a list of MVP features to address these challenges, utilizing existing capabilities, customization, and new developments as required. Crucially, ensure seamless integration of the Development, Deployment, and Serving layers with external systems and business applications through standardized APIs, while also considering scalability as a fundamental aspect during MVP development.

Key recommendations

  • MLOps – To meet the demand for faster time to market and navigate the engineering complexity, it is crucial to create AI engineering pipelines that are more flexible and adaptable. The objective of MLOps is to rapidly develop, deploy, serve, and maintain ML models in the ML platform across different environments in the enterprise. An organization can plan the following initiatives depending on its priorities, CI: Continuous Integration, CD: Continuous Deployment, CM: Continuous Monitoring, CT: Continuous Training.
  • AI Trust, Risk & Security Management – As per the Gartner report, by 2026, organizations that can operationalize AI transparency, trust and security will see their AI models achieve a 50% result improvement in terms of adoption, business goals and user acceptance. Organizations should have a task force or dedicated unit to manage the AI privacy, compliance, security, and risk management for improved AI business outcomes.
  • ML-specific Documentation – This is crucial for transparency and for user experience of internal stakeholders as it eliminates “black-boxness” around machine learning. The documentation should primarily focus on
    • Overview and detailed information of the platform and its components
    • Model and associated details i.e., data used, ML architecture, algorithm, predictions, validation metrics, non-ML routines used etc.
Author: Pritish Udgata
Date 8 Oct 2022