What happens when you deploy an ML model?

An introduction to performative prediction with visuals

Authors
Javier Sanguino Bautiste
Thomas Kehrenberg
Novi Quadrianto

Deployed model = Happy ML engineer?

You are a machine learning engineer and you work for a bank. You create a model to predict the default risk of applicants. Your model works very well on your test set. The company decides to deploy your model, making you worry about performance in the real-world. However, these are unfounded fears: your model works well the following weeks after deployment. Everyone congratulates you and you are satisfied because you did your work well.

But… after some time, you realize that the model is not performing as well anymore. Repeat applicants have changed their financial characteristics to have a higher chance of getting the loan. These applicants have learned to “game” your model. New applicants do not follow the training distribution anymore. The mere (very?) deployment of your model has triggered a distribution shift in the data!

Once an ML model is deployed, it undoubtedly has an effect in the real world. This effect has been largely overlooked in Machine Learning, as model deployment is often the final step that the ML engineer is involed in. But if you want your model to work in practice, you need to take these effects into account.

Scenarios like the one described above have been formalized in the field of performative predictionperdomo2020performative. performative prediction happens when deploying a model causes a distribution shift in the data. Let \(\theta\) be the model parameters. We can model the effect then as a dependency of the data distribution on \(\theta\) through a distribution map, \(\mathcal{D}(\theta)\), a function from the set of model parameters to the space of data distributions. For a given model, \(\mathcal{D}(\theta)\) gives the data distribution induced by the model parameters \(\theta\). (We will often use \(\theta\) to refer both the model parameters and the model itself, as it is common in the literature.)

The following diagram illustrates this concept. On the left-hand side, we have the situation in the typical ML setup: There is a fixed data distribution \(\mathcal{D}\) on which we train our model \(\theta\). On the right-hand side, we have the performative prediction setup: The distribution map \(\mathcal{D}(\theta)\) defines a data distribution on which we train our model \(\theta\), but in turn, the model parameters affect the data distribution.

Iterations in performative prediction

In order to formalize this feedback loop between the model and the data distribution, we describe it as an iterative process, which is shown in the following diagram. Before the process begins, we have an initial data distribution \(\mathcal{D}_0\), on which the first model \(\theta^{(0)}\) is trained. From the first model, we get the initial distribution \(\mathcal{D}(\theta^{(0)})\), the data distribution induced by the model parameters \(\theta^{(0)}\). At each subsequent time step, we first train model \(\theta^{(t+1)}\) on the data distribution \(\mathcal{D}(\theta^{(t)})\). And then, the model is deployed, which causes a new distribution shift to \(\mathcal{D}(\theta^{(t+1)})\):

The fact that the environment reacts to our model makes it difficult to evaluate the performance of the model. In the traditional ML setup, we can measure the performance of a model on a fixed data distribution, but we cannot do this in the performative prediction setup.

How to measure performance of the model then?

In order to measure the performance of a model in the different steps of performative prediction, one has to measure the risk of the model \(\theta^{(t+1)}\) on the data distribution \(\mathcal{D}(\theta^{(t)})\), but also the risk of the model \(\theta^{(t+1)}\) on the data distribution \(\mathcal{D}(\theta^{(t+1)})\). The decoupled performative risk will allow us to do both. In order to not confuse ourselves with the different model weights, we will introduce the notation \(\theta_M\) to refer to the parameters of the model we want to evaluate and \(\theta_D\) to refer to the parameters of the the model that defines the data distribution. The decoupled performative risk is then simply the risk of a model \(\theta_M\) on the distribution \(\mathcal{D}(\theta_D)\):

\[ \mathcal{DPR}(\theta_D, \theta_M) := \mathbb{E}_{(x,y) \sim \mathcal{D}(\theta_D)} \big[\ell(\theta_M, x, y)\big] \]

where \(\ell(\theta, x, y)\) is the loss function of the model \(\theta\) on the data sample \((x,y)\).

Add a risk changing plot and a line plot that interacts with the iterations image: In one plot, we plot the risk for a given distribution map, the other the simply the risk and a third one with surface. The first one changes dynamically in the deployment step and has an animation to find the minimum in the optimisation step. In the other one, just new points appear.

Ultimately, we are, however, mostly interested in the risk of the model on the distribution induced by its own parameters. This risk is referred to as the performative risk and is defined like this:

\[ \mathcal{PR}(\theta) := \mathcal{DPR}(\theta, \theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}(\theta)} [\ell(\theta, x, y)] \]

Compared to the risk in the traditional ML setup, the data distribution is not fixed here but depends on \(\theta\).

Interest points of performative prediction

One natural solution to the problem of performative prediction is deploying a model that — after it has affected the data distribution — does not require retraining, i.e., a model robust to the distribution shift. This model will be optimal for the distribution induced by itself:

\[ \theta_{ST} = \operatorname*{argmin}_{\theta} \mathbb{E}_{(x,y)\sim \mathcal{D}(\theta_{ST})}[\ell(\theta, x, y)] \]

However, this solution is not optimal for the closed-loop interaction between the data and the model, i.e., it is not the minimum of the performative risk. not sure about the following sentence As the distribution will always adapt to the new model after deployment, the minimum of the performative risk is also a very relevant solution point.

\[ \theta_{OP} = \operatorname*{argmin}_{\theta} \mathcal{PR}(\theta) = \operatorname*{argmin}_{\theta} \mathbb{E}_{(x,y)\sim \mathcal{D}(\theta)}[\ell(\theta, x, y)] \]

Iterations in Performative Prediction

How to reach them?

The performative prediction literature started out by focusing on how to find the stable solution, as it is more mathematically tractable. The first algorithms were based on retraining the model. The process of these algorithms can be summarized as:

Get the data samples of the distribution induced by \(\theta^{(t)}\): \( (x,y)\sim \mathcal{D}(\theta^{(t)})\)
Train the model on those data samples: \(\theta^{(t)} \rightarrow \theta^{(t+1)}\)
Deploy the model, causing a new distribution shift: \(\mathcal{D}(\theta^{(t)}) \rightarrow \mathcal{D}(\theta^{(t+1)})\)

If the model is fully optimized at each step, we call this algorithm Repeated Risk Minimization (RRM): if only one optimizer step is performed, we call it Repeated Gradient Descent (RGD) and if it is several gradient steps, we call it k-greedy RGD.

Note that these algorithms do not use the information of the distribution map when training the model. They only use the data samples induced by the model parameters. Therefore, although it is very easy to apply them in practice (wait until the distribution shifts, observe new data samples and retrain), it takes many steps to converge and does not find the optimal solution, as it does not use the information of the distribution map in the optimization process.

The convergence guarantees of these algorithms rely on the convexity of the loss function \(\ell(\theta, x, y)\) on blabla and on the sensitivity of the distribution map, \(\mathcal{D}(\theta)\), to the model parameters, \(\theta\): a small change in the parameters will give a similar distribution map.

In these initial algorithms, the only guarantee to optimality is that the stable point might lie close to the optimal point under certain conditions, which cannot be checked in practice.

Later, the literature started focusing on how to find the optimal solution directly. One of these algorithm — Performative Gradient Descent (PerfGD) — relies on the performative gradient:

\[\nabla_{\theta} PR(\theta) = \nabla_{\theta} \mathbb{E}_{(x,y) \sim \mathcal{D}(\theta)} [\ell(\theta, x, y)]~.\]

This gradient is difficult to calculate due to the dependency of the data distribution on the model parameters. (When finding the stable point, one need only to calculate \(\mathbb{E}_{(x,y) \sim \mathcal{D}(\theta)} [ \nabla_{\theta} \ell(\theta, x, y)]\); that is why it is more mathematically tractable.) Two possibilities have been proposed: REINFORCE izzo2021learn and the reparametrization trick. there should at least be a citation here; maybe even a full definition

Conclusion

Now imagine you are a machine learning engineer working in the company of your dreams. You have been hired some time ago and you are excited because you worked on the next big thing. The soon-to-be-released model that works amazingly well in your test sets — AGI is just around the corner! But… have you considered the effects that this model will have on the data distribution? If the internet is populated with text and images created by your model… you might not be able to train the next model. And it seems that retraining is not enough…

I feel like the problem you mention here is a different thing; it's more about mode collapse. When you train an LLM, what you want is diverse data, so that your LLM can learn a lot. So if you train it on LLM-generated data, then the problem is that the LLM can't learn much from that data. But there is no one trying to game your classifier… there isn't even really a classifier.

Something about the thought generation process, that is very trendy now

Now imagine that you are a policy maker and you are in charge of measuring the impact of a new model. You are very worried about the possible outcomes of a specific model… Have you also considered that the model has power over the distribution of the data? Is this effect benefitial for the people using the model?