Most data scientists have a math, statistical, or other quantitative background and while they are well-trained in building high-performing models, they may be less versed in using cloud technologies to engineer a production-grade model. Data scientists frequently ask:
Once I create a model, how can I make it usable for other websites, apps, and services?
A: There are many steps and considerations when transitioning from a trained model to production. These include determining where your model will be hosted, how it will be called from other applications, latency requirements, and many more. See our model building checklist for a full list of considerations.
How does machine learning differ from statistical analysis?
A: There are many differences (and similarities) between machine learning and statistical analysis. Machine learning primarily deals with making predictions - that is, given a set of data inputs, can the model accurately predict some output? Statistical analysis primarily deals with correlation and relationships between variables, such as finding the variables most associated with some outcome. While the underlying theory and methods for machine learning and statistical analysis overlap considerably (logistic regression being one example), the objectives are usually different.
What are the tradeoffs between various programming languages for machine learning, such as R and Python?
A: Machine learning models can be built in many programming languages such as R, Python, SAS, Stata, and others. However, if your goal is to create a machine learning model in production, you need to consider the systems your model will need to integrate with. For this reason (and a few others), Python has become the go-to language for machine learning. Python makes it easier to integrate models with other software systems compared to the other languages. While you can certainly create a model in any of the above languages, if you are using the model in a cloud framework (AWS, GCP, etc.) or want to integrate it with a website or app, your life will be easier with Python.
What is the role of Big Data in Data Science and Machine Learning?
A: Big Data plays a vital (and sometimes overlooked) role in machine learning. It is common to have the data you are working with be so big, that big data tools like Spark, Hadoop, and others are necessary to process this large amount of data. Cloud frameworks like AWS and GCP also have tools that can help process large amounts of data.
What's the difference between a data scientist and a data analyst?
A: Data analysts are usually responsible for creating tables, charts, and visualizations from data. They usually have an undergraduate degree in a STEM (science, technology, engineering, math) field, but this is not required. Data analysts have practical programming skills and can combine, manipulate, and analyze data effectively. Data scientists have the skills of a data analyst, but usually have stronger programming and mathematical skills. They perform the same tasks as data analysts, but also create statistical and machine learning models when needed.
Software engineers typically have a computer science background and strong coding skills. They are comfortable with building production-grade applications but may need to learn the mathematical side of data science and machine learning in order to create a well-performing model. Software engineers typically ask:
How much math do I need to know in order to be successful in machine-learning?
A: It depends on how involved you will be in the ML side of things (ex: model training, refactoring model code to make it production ready, or creating the model architecture itself). Taking courses in linear algebra and advanced calculus are extremely helpful in machine learning. Knowing the mathematical details of modeling algorithms will better help you make modeling decisions and choose between trade-offs.
What are the methods most typically used by data scientists when analyzing data?
A: It varies and largely depends on the type of machine learning used by your company. DataScienceCentral provides a nice list of 40 common techniques used by Data Scientists. This list is a good starting point, but if you want to specialize in a particular area of machine learning (e.g. computer vision or natural language processing), you’ll want to be familiar with those algorithms too.
What software engineering skills are most useful in machine learning?
A: Software engineering skills are very helpful in machine learning. Having strong coding skills (along with ML knowledge) will allow you to build high-performing models. Also, engineering skills will make it easier to put models in production and build scalable, flexible systems around your model.
How do machine learning projects differ from software engineering projects?
A: The biggest difference between machine learning and software engineering projects is the ability to guarantee deliverables and performance. For example, an engineering projects typically have specific requirements (such building a website with features X, Y, and Z) and it’s easy to tell when those requirements are fulfilled. Machine learning projects can be more vague and open-ended, since model performance cannot usually be guaranteed. For example, you could build a recommendation engine but cannot guarantee how well it performs in production or how well it will perform in the future.
Product and project managers
As a product manager, how do I know whether or not to use ML for a particular problem?
A: Don’t use ML when your problem: (1)can be solved by simple rules (2)does not adapt to new data (3)requires 100% accuracy (4)requires full interpretability. Use ML when your problem has existing examples of actual answers, and: (1)handles very complex logic (2)scales up fast (3)requires specialized personalization (4)adapts in real time. If you are trying to create a complex formula involving too many variables, ML may be better. Ex: Search products can have an unlimited number of inputs, hence impossible to craft rules based on the input
Which areas of ML should product managers be involved in?
A: PMs should be heavily involved in defining: (1)What is the problem to solve (2)What is the measurable goal (3)What do you want to predict. Data selection also has high PM involvement: which datasets to use (public, internal, custom) and for what purposes (training and tuning, measuring success, replace flawed or outdated data). Areas that have moderate PM involvement include: (1)Data cleaning (ex: removing or fixing missing data) (2)Data sampling (ex: choosing representative data, solving issues such as seasonality, trends, leakage, biases), (3)Data unintended bias (4)Data labeling (ex: tagging/classifying data).
How can I measure the impact of machine learning on my product?
A: Methods like A/B testing and others are typically used to measure how ML models impact products and what lift (financial and otherwise) they provide. See our Model Evaluation section for more details.
How is managing an ML project different from managing a software project?
A: Many software development best practices can also be applied to ML projects. They key is to treat the model as one (but key) piece of the over software system. For example, if the model will be invoked as an API, then decouple the development of that API from the rest of the system. If you are using Agile (Scrum) as your software development methodology, the API should deliver predictions at the end of each sprint, same as any other software component. The quality of the predictions will improve as the sprints go on, as the data science team enriches the model behind the API. See managing ML projects for best practices.
Once you have created a machine-learning model and are ready to use it in the real world, you have a choice on how it should be used for inference: real-time (also called ‘online’) and batch (also called ‘offline’).
Real-time inference is necessary when your inputs are real-time events that cannot be listed at design time (ex: search terms). In many cases, this is not needed. For example, if you have an eCommerce site and want to recommend similar products based on a particular product's detailed record page, you can simply generate and store predictions at design time for each SKU in your inventory. In real time, you simply retrieve these predictions, as opposed to inferencing in real time. However, for the same use case, if you want to influence those predictions by leveraging the user's breadcrumb trail, then real time inferencing becomes necessary.
Similarly, in many cases where you want to personalize the predictions for a given user, real time inferencing may be necessary. For example, say you are shopping on an online retail website. You place a few items in your cart and click ‘checkout’. At that moment, the items in your cart (along with other data the website may have about you) are sent to a recommendation model which returns a few items to the website to show as recommendations. In this scenario, data is sent to the model and predictions are returned instantly (or near instantly).
Examples of models used in real-time are online retail websites, social media apps, driverless cars, and many others.
In batch inference you don’t need the model predictions immediately. Here, predictions are run through the model in a group (a ‘batch’) and it might take minutes or hours for the model to return predictions for each observation in the batch.
For example, say you want to predict the price of houses that come on the market each day. Since you only need to do inference once per day, you don’t need to constantly send predictions to the model. Instead, you could collect all the data on the new houses that came on the market (neighborhood, year built, square footage, etc.) and send these to your model in one ‘batch’. The model will then return a price precision for each house. In this scenario, it’s acceptable if the model takes an hour (or longer) to create the predictions because you don’t need them immediately.
The choice between real-time and batch inference largely depends on whether you want to use unpredictable real-time events as inputs during your predictions and the need for personalized predictions. If your model is part of a user-facing app or website, you will likely need real-time inference because your predictions must be returned immediately to the user and they must be personalized.
You might be wondering… if real-time is faster than batch, why not always use real-time inference? The answer is that real-time inference is usually more expensive and requires more tooling compared to batch. When using real-time inference, you usually need a server to invoke the model and return the predictions to an app, website, or other service. You also need to worry about latency requirements - you model must return predictions in a timely manner. Keeping a server live is costlier compared to running a few batch prediction jobs every night, where you only need computing power for a few minutes or hours.
Use case assumption:
Summary of cost scenario:
We have 30 models running production. 24 of those can be run on c5.xlarge instance configured with an Amazon Elastic Inference eia2.medium accelerator machines We also have 6 deep learning models that need to run on ml.p2.xlarge instances.
Monthly cost of hosting 24 models running on ml.eia2.large:
6 Deep learning models running on ml.p2.xlarge
Total Cost Per Month = $5178+$4116+$5442+$1600+$258 = $16,594 / month
Reference: snippets from SageMaker documentation
You might have in-house example data repositories or use publicly available datasets. Typically, you pull the dataset or datasets into a single repository.
Before using a dataset to train a model, data scientists typically explore, analyze, and preprocess it. You can use a Jupyter notebook on an Amazon SageMaker notebook instance to do so.
You should inspect the data and clean it as needed (ex: if your data has a "country name" attribute with values "United States" and "US", you might want to edit the data to be consistent). You may also want to perform additional data transformations (ex: combine attributes).
SageMaker Processing enables running jobs to preprocess and post process data, perform feature engineering, and evaluate models on Amazon SageMaker easily and at scale. You use the built-in data processing containers or to bring your own containers and submit custom jobs to run on managed infrastructure.
Once the data is ready, store it in an S3 bucket.
Train the model
To train a model in SageMaker, you create a training job. The training job includes the following information:
You have the following options for a training algorithm:
You can create a training job with the Amazon SageMaker console or the API. After you create the training job, Amazon SageMaker launches the ML compute instances and uses the training code and the training dataset to train the model. Depending on the size of your training dataset and how quickly you need the results, you can use resources ranging from a single general-purpose instance to a distributed cluster of GPU instances.
You can use SageMaker Debugger to inspect training parameters and data throughout the training process when working with the TensorFlow, PyTorch, and Apache MXNet learning frameworks. Debugger automatically detects and alerts users to commonly occurring errors such as parameter values getting too large or small.
Your instructions to AWS for training a model:
Evaluate the model
SageMaker saves the resulting model artifacts and other output in the S3 bucket you specified for that purpose. To evaluate your trained model, you use either the AWS SDK for Python (Boto) or the high-level Python library that Amazon SageMaker provides to send requests to the model for inferences. You use a Jupyter notebook in your Amazon SageMaker notebook instance to train and evaluate your model.
You can evaluate your model using historical data (offline testing) or live data (online testing):
Deploy the model
After you train your model, you can deploy it to get predictions in one of two ways:
Deploying a model using Amazon SageMaker hosting services is a three-step process:
You can deploy a model trained with Amazon SageMaker to your own deployment target. To do that, you need to know the algorithm-specific format of the model artifacts that were generated by model training.
You traditionally re-engineer a model before you integrate it with your application and deploy it. With Amazon SageMaker hosting services, you can deploy your model independently, decoupling it from your application code.
Invoke the model
To get inferences from the model, client applications send requests to the Amazon SageMaker Runtime HTTPS endpoint. You can also send requests to this endpoint from your Jupyter notebook during testing. However, endpoints are scoped to an individual AWS account, and are not public. The URL does not contain the account ID, but Amazon SageMaker determines the account ID from the authentication token that is supplied by the caller. This means if the client application is not within the scope of your account, it cannot hit that endpoint. However, you can use Amazon API Gateway and AWS Lambda to set up and deploy a web service that you can call from such a client application.
Generate ground truth
To increase a model's accuracy, you might choose to save the user's input data and ground truth, if available, as part of the training data. You can then retrain the model periodically with a larger, improved training dataset.
Update the model
You can modify an endpoint without taking models that are already deployed into production out of service. For example, you can add new model variants, update the ML Compute instance configurations of existing model variants, or change the distribution of traffic among model variants. To modify an endpoint, you provide a new endpoint configuration. Amazon SageMaker implements the changes without any downtime.
Changing or deleting model artifacts or changing inference code after deploying a model produces unpredictable results. If you need to change or delete model artifacts or change inference code, modify the endpoint by providing a new endpoint configuration. Once you provide the new endpoint configuration, you can change or delete the model artifacts corresponding to the old endpoint configuration.
Monitor the model
Amazon SageMaker Model Monitor enables developers to set alerts for when there are deviations in the model quality, such as data drift and anomalies.
The steps to complete a machine learning project can range from finding data and building a model to having a completely automated end-to-end system that can train and deploy models every hour.
Stage 1: Building a model
The steps to building a good model follow the ML lifecycle:
We won’t go into detail on each step here - but if you’re looking for more detail, be sure to check out our article on the model-building process.
Stage 2: Deploying a model
Once you have a model ready for deployment, the next step is to deploy the model so other applications (website, mobile apps, etc.) can use your model. There are many ways to deploy a model, but most deployment methods are some type of service that accepts data calls and return model predictions.
These deployments can range from a simple flask app that takes data input, runs it through a model, and returns a request. Or they can be as complicated as a large backend service that involve many servers, containers, databases, polling queues, and more. How big or small to make your service depends on how many requests your model will get, how fast the model needs to respond to those requests, the engineering architecture of existing systems at your company.
A good first step is to start simple so you can see how your model will behave once it’s deployed and work through any issues that arise. It’s essential to work closely with teams that will consume your model to work out exactly how your model will interact with model consumers. For example, if you model will be used in a website, you’ll want to work closely with the front-end developers who will incorporate the model into the website.
Stage 3: Automated end-to-end ML pipeline
Having a deployed model is great, but what if you want to train and deploy your model every day, or every hour? If you deployed your model using a flask app running on EC2, you would need to manually retrain your model, copy the new model to the server, and re-launch your deployment service every time you want to update the model. This is obviously a time-consuming and manual process.
Instead of manually training and deploying your model, a better option is to create an automated end-to-end ML pipeline that you can run at any time frame. Then, all your need to do is schedule your pipeline to run at the frequency you need.
Automated pipelines are usually some form of linked containers, where each container represents one (or multiple) parts of the ML lifecycle. You might have a container that fetches raw data and passes that data to another container that processes the data into a model-ready format, and so on. These pipelines involve much more engineering effort that just deploying a model via SageMaker or flask, but provide more flexibility and automation for model deployment, especially scheduled deployments.
AWS has orchestration services like Step Functions that can coordinate and automate these pipelines.
Most machine learning (ML) projects fall into one of six main types of problems. This list is certainly not exhaustive but covers most of the common ML use cases.
Time Series Forecasting
Natural Language Processing
Classification and regression
More details about each type of ML problem is described below:
Recommendation systems are commonly used by websites and mobile apps to offer products to users (see examples below). These systems work by leveraging historical information (purchase history, viewing history, etc) about user behavior to recommend new items users might like.
The algorithms behind these recommendations come in two flavors: collaborative filtering and content-based filtering. Collaborative filtering assumes that users who have similar preferences in the past will have similar preferences in the future. For example, if one user has purchased items A, B, C, D in the past, and another user has purchased A, B, and C, then item D could be recommended to the second user, because user one liked item D in the past, and the two users have similar tastes.
Content-based filtering is similar but uses information about the products themselves. These recommendations are based on a user's likes, dislikes, and product attributes. For example, if a user has purchased music and computer products in the past, other music and computer products could be recommended to them in the future.
Time Series Forecasting
Time series models aim to predict new data in the future. The classic example is trying to predict the value of a stock tomorrow based on that stock’s history and other data about the market. A key differentiator of these models is the time dimension - they use data that is specifically linked to moments in time.
Traditional methods for these types of problems include linear models like ARIMA (‘auto-regressive integrated moving average’) and it’s seasonal variant SARIMA. Newer methods have successfully used deep learning for these types of problems like recurrent neural networks (RNN), LSTMs, and DeepAR.
Computer vision deals with computers recognizing images or parts of images. Some examples are identifying people, animals, or products in images, or object tracking. Facial recognition is one type of computer vision example, but companies also use computer vision to automatically find their logo in vast amounts of pictures, visually optimize product packaging, and much more.
Deep learning plays a big part in computer vision. Common approaches for computer vision include convolutional neural networks (CNN), pooling, transfer-learning, and Gaussian mixture models (GMM).
Natural Language Processing
Natural Language Processing (NLP) is focused on text data. While most data is structured (think spreadsheet data), text data is unstructured (think tweets or blog posts) and is usually more challenging to work with. Extracting meaningful insights from text is a complicated process, but fortunately the field of NLP has created many methods to tackle these problems.
Common methods in NLP are deep learning algorithms like LSTMs and transformer models, but also include latent dirichlet analysis (LDA), classification algorithms, and conditional models like the linear CRF. NLP models commonly use preprocessing methods like embeddings to more accurately represent words within text.
Classification and Regression
Classification aims classifying each data point into one of many groups. Usually each data point can only belong to one classification group. For example, using medical data about a patient to predict whether or not they have a disease. In this case you are trying to classify the patient into one of two groups (‘having disease’ or ‘not having disease’) using other attributes about the patient (height, weight, medical history, etc.). Classification can extend to many groups - for example, you could classify the temperature tomorrow as ‘low’, ‘medium’ , or ‘high’, or classify potential voters into ‘democrat’, ‘republican’, ‘independent’, ‘other’.
Some common methods for classification are logistic regression, support vector machines, random forests, and neural networks.
Regression deals with prediction data that can take on any value. For example, predicting the exact temperature for a substance or the price of a new house on the market. Common methods for regression are linear models, feed-forward neural networks, random forests, and gradient boosting methods like XGBoost.
Model evaluation is a critical step in building a well-performing ML model. A good evaluation will not only tell you how well your model can predict unseen data, but it can give you insight into what problems your model is facing (e.g. overfitting). A solid understanding of why your model is underperforming is the first step toward improving your model.
Model evaluation can happen at different stages of the ML lifecycle. Most traditional ML and statistics textbooks focus on how to evaluate your model during the model development stage. These methods include regularization, k-fold cross validation, leverage points, and many others. While evaluation at the stage is important, model evaluation can (and should) be performed at two other stages of the ML lifecycle: pre-deployment and post-deployment.
This article will cover common techniques for model evaluation at these three stages of the ML lifecycle.
Evaluation During Model Development
There are many methods to evaluate models during the model development phase - entire textbooks are written about this topic alone. For the sake of brevity, we will cover this type of model evaluation in another article - but you should be aware that many other methods and approaches exist.
When you have trained and tested your model, you’ll want to test it in a production environment. Once your model can make inferences in a production environment, you are ready to start this phase of testing. One of the most common and useful methods of testing this way is A/B testing.
A/B testing is a common approach to test software engineering changes such as new features - but it’s also very useful to test ML models in production. There are many different ways to carry out A/B testing, but we will focus on the most common here.
The most basic way to test your model’s impact on actual users is to have the model predictions working for some users, but not others. Then a direct comparison can be made between the two groups and the effect of your model can be measured. For example, say you have a model that recommends products to users based on what they’ve bought in the past and what they currently have in their cart. To implement A/B testing, you would randomly select users on your website into one of two groups:
(1) users who see the recommended products that come from the model
(2) users who get no model-based recommendations
Then, you can compare the lift in number of items bought (or total sales) in each of the groups and compare them.
To compare these groups you can use common statistical methods for comparing the means of multiple groups - methods such as the t-test, ANOVA, and others. If the tests are significant, you have a strong signal that your model has had an effect on traffic, sales, or whatever you want to measure. SageMaker can help with this - it has the ability to host multiple versions of a model in the same endpoint, making it easier to send traffic to different models and compare the results.
You have now built a model, verified it through A/B testing and it is running in production. At this point you need to make sure your model stays “healthy” - i.e., model performance stays constant over time.
Often models running in production start to perform worse over time. This can happen for different reasons, but it’s extremely important to monitor your models’ performance in production so your team is aware if model performance dips below acceptable levels. One way to monitor your model’s health in production is to store the model predictions so they can be compared against actual results.
Going back to the model recommendation system, imagine we store the products that were recommended to each user. We could then establish a baseline to compare our models again - for example, 50% of recommended products are bought by users. As our model runs in production, we can use this metric to monitor the model. If, over time, only 25% of recommended products are bought by users, that would mean the model is not performing well and we need to figure out why the model is not performing as well as it used to. One method to combat this issue is to frequently retrain and deploy the model.
86% of data science decision makers across the Global 2000 believe machine learning impacts their industries today. However, many enterprises are concerned that only a fraction of their ML projects will have business impact. In some cases, investments made in ML projects are questioned and projects abandoned when the implementation does not match the vision (ref).
The ML industry is beginning to understand the need for more engineering discipline around ML. “Just as humans built buildings and bridges before there was civil engineering, humans are proceeding with the building of societal-scale ML systems. Just as early buildings and bridges sometimes fell to the ground, many of our early societal-scale ML systems are already exposing serious flaws. What we’re missing is an engineering discipline with its principles of analysis and design.” -- Prof. Michael Jordan, UC Berkeley
Identify the ML project’s stage and plan accordingly
Assess economic value
Build an economic model of the expected value from the project. Use it to provide context to inform project decisions, thus moving the focus from the ML technology to its impact on business. Doing so at the beginning of a project can dramatically change the direction and focus of the project.
Elicit key business drivers or constraints that the model must meet (such as, “must be at least as accurate as the current process,” or “must provide transparency into how decisions are being made”). These constraints become requirements for the ML system, risks to be managed, or decision criteria on whether the model is sufficiently good to proceed. Whether a model is sufficient to support the business case might be a higher bar than whether the model is a good model.
Assess the cost of errors. Given the speed and volume that ML models address, existing human intervention and oversight may be removed. What is the cost for resultant errors? If there is a cost for each error, how much tolerance is there, before the economic model ceases to be positive? If model drift occurs, the number of errors might increase. How serious a problem is that?
Quantify cost of errors by assigning a $ value to each of FP, FN, TP, TN. Use it to change model behavior itself. If the costs of different kinds of errors-such as false negatives or false positives—are widely different, that information can be used to train a model with more desirable outcomes (ref). Ex: the differential costs of errors changes the ML model used to predict breast cancer, producing fewer false negatives (undesirable and expensive) at the cost of more false positives, while still producing a cheaper model overall.
Assess data quality
A significant portion of the research component of ML projects is to assess the data quality and whether it’s appropriate for the problem.
Initial research is frequently performed on cleaned and possibly enriched data extracted from a data lake, for convenience and speed of access. The implicit assumption is that the data operated on in production will be the same, and can be provided quickly enough to act on. This assumption should be tested to ensure the ML model will work as expected.
The more data sources that are involved, the more disparate the data sources that are to be merged, and the more data transformation steps that are involved, the more complex the data quality challenge becomes.
2 ways to ensure that the model’s production performance is similar to development: (1)Compare statistics of the source input data to the data the ML model actually used to train on (2)Validate model against unclean data inputs
Data scientist : Background in math, statistics and advanced analytics. Provide provide statistical and ML specialty knowledge. Rigorous experimental design is critical, particularly for companies with large user bases or in highly regulated industries.
Engineers (data/application/infrastructure): background in programming and specialize in big data technologies. They perform data acquisition, ETL (extract, transform, load) and build data pipelines. Application engineers integrate the model into an application and use the inferences in the context of a business process.
Steering committee : business stakeholders and the financial owner of benefits and risks. They can bring in external specialists (HR/legal/PR) as needed to manage risks
Use scorecards to report on progress
Project environment scorecard
Data quality scorecard
Move from research to production
The code in the researcher’s Jupyter notebook is generally not production quality. Reengineering the researcher’s code is frequently required to make this code a good fit for a production environment.
Unfortunately, the method to communicate the requirements to the development team is frequently by giving them the researcher’s Jupyter or Zeppelin notebook, or a set of Python or R scripts. If the development team redevelops and optimizes the code for production while the research team continues from their base notebook, you have the problem of versioning the code and identifying changes.
All usual software engineering and management practices must still be applied, including security, logging and monitoring, task management, end-to-end A/B testing, API versioning (if multiple versions of the model are used), and so on.