Model evaluation is a critical step in building a well-performing ML model. A good evaluation will not only tell you how well your model can predict unseen data, but it can give you insight into what problems your model is facing (e.g. overfitting). A solid understanding of why your model is underperforming is the first step toward improving your model.
Model evaluation can happen at different stages of the ML lifecycle. Most traditional ML and statistics textbooks focus on how to evaluate your model during the model development stage. These methods include regularization, k-fold cross validation, leverage points, and many others. While evaluation at the stage is important, model evaluation can (and should) be performed at two other stages of the ML lifecycle: pre-deployment and post-deployment.
This article will cover common techniques for model evaluation at these three stages of the ML lifecycle.
Evaluation During Model Development
There are many methods to evaluate models during the model development phase - entire textbooks are written about this topic alone. For the sake of brevity, we will cover this type of model evaluation in another article - but you should be aware that many other methods and approaches exist.
When you have trained and tested your model, you’ll want to test it in a production environment. Once your model can make inferences in a production environment, you are ready to start this phase of testing. One of the most common and useful methods of testing this way is A/B testing.
A/B testing is a common approach to test software engineering changes such as new features - but it’s also very useful to test ML models in production. There are many different ways to carry out A/B testing, but we will focus on the most common here.
The most basic way to test your model’s impact on actual users is to have the model predictions working for some users, but not others. Then a direct comparison can be made between the two groups and the effect of your model can be measured. For example, say you have a model that recommends products to users based on what they’ve bought in the past and what they currently have in their cart. To implement A/B testing, you would randomly select users on your website into one of two groups:
(1) users who see the recommended products that come from the model
(2) users who get no model-based recommendations
Then, you can compare the lift in number of items bought (or total sales) in each of the groups and compare them.
To compare these groups you can use common statistical methods for comparing the means of multiple groups - methods such as the t-test, ANOVA, and others. If the tests are significant, you have a strong signal that your model has had an effect on traffic, sales, or whatever you want to measure. SageMaker can help with this - it has the ability to host multiple versions of a model in the same endpoint, making it easier to send traffic to different models and compare the results.
You have now built a model, verified it through A/B testing and it is running in production. At this point you need to make sure your model stays “healthy” - i.e., model performance stays constant over time.
Often models running in production start to perform worse over time. This can happen for different reasons, but it’s extremely important to monitor your models’ performance in production so your team is aware if model performance dips below acceptable levels. One way to monitor your model’s health in production is to store the model predictions so they can be compared against actual results.
Going back to the model recommendation system, imagine we store the products that were recommended to each user. We could then establish a baseline to compare our models again - for example, 50% of recommended products are bought by users. As our model runs in production, we can use this metric to monitor the model. If, over time, only 25% of recommended products are bought by users, that would mean the model is not performing well and we need to figure out why the model is not performing as well as it used to. One method to combat this issue is to frequently retrain and deploy the model.