Once you have created a machine-learning model and are ready to use it in the real world, you have a choice on how it should be used for inference: real-time (also called ‘online’) and batch (also called ‘offline’).
Real-time inference is necessary when your inputs are real-time events that cannot be listed at design time (ex: search terms). In many cases, this is not needed. For example, if you have an eCommerce site and want to recommend similar products based on a particular product's detailed record page, you can simply generate and store predictions at design time for each SKU in your inventory. In real time, you simply retrieve these predictions, as opposed to inferencing in real time. However, for the same use case, if you want to influence those predictions by leveraging the user's breadcrumb trail, then real time inferencing becomes necessary.
Similarly, in many cases where you want to personalize the predictions for a given user, real time inferencing may be necessary. For example, say you are shopping on an online retail website. You place a few items in your cart and click ‘checkout’. At that moment, the items in your cart (along with other data the website may have about you) are sent to a recommendation model which returns a few items to the website to show as recommendations. In this scenario, data is sent to the model and predictions are returned instantly (or near instantly).
Examples of models used in real-time are online retail websites, social media apps, driverless cars, and many others.
In batch inference you don’t need the model predictions immediately. Here, predictions are run through the model in a group (a ‘batch’) and it might take minutes or hours for the model to return predictions for each observation in the batch.
For example, say you want to predict the price of houses that come on the market each day. Since you only need to do inference once per day, you don’t need to constantly send predictions to the model. Instead, you could collect all the data on the new houses that came on the market (neighborhood, year built, square footage, etc.) and send these to your model in one ‘batch’. The model will then return a price precision for each house. In this scenario, it’s acceptable if the model takes an hour (or longer) to create the predictions because you don’t need them immediately.
The choice between real-time and batch inference largely depends on whether you want to use unpredictable real-time events as inputs during your predictions and the need for personalized predictions. If your model is part of a user-facing app or website, you will likely need real-time inference because your predictions must be returned immediately to the user and they must be personalized.
You might be wondering… if real-time is faster than batch, why not always use real-time inference? The answer is that real-time inference is usually more expensive and requires more tooling compared to batch. When using real-time inference, you usually need a server to invoke the model and return the predictions to an app, website, or other service. You also need to worry about latency requirements - you model must return predictions in a timely manner. Keeping a server live is costlier compared to running a few batch prediction jobs every night, where you only need computing power for a few minutes or hours.