Reverse Concept Drift (RCD)

How to use the Reverse Concept Drift Algorithm

Reverse Concept Drift (RCD) belongs to the NannyML Cloud family of algorithms and can be accessed through a NannyML Cloud license or as a standalone algorithm.

How Reverse Concept Drift Works

Concept Drift

The Reverse Concept Drift (RCD) algorithm focuses on the concept shift's impact on the model's performance. This is to keep the method simpler and to provide results that are easier to interpret. For the impact of all factors in the model's performance, we need to look no further than the actual realized performance.

Intuition

When we have concept drift, we know there is a new concept in the monitored data compared to what we have in our reference data. We can train a new machine learning model and learn the new concept in order to compare it with the existing one. But how do we make a meaningful comparison?

As mentioned, performance change is a combination of covariate shift and concept drift. We can factor out covariate shift impact, as well as its interaction with concept shift, by focusing on the reference dataset. How can we do that? We use the concept we learnt on the monitored data to make predictions on the reference dataset and treat them as ground truth. This allows us to estimate how the monitored model would perform under the monitored data's concept on the reference data.

Implementation

The impact of concept drift on performance is calculated based on the following steps:

  1. Train a LightGBM model on a chunk of the monitored data.

  2. Use the learned concept, to make predictions on the reference dataset.

  3. Estimate Model Performance on reference data assuming the monitored concept model's predictions are the ground truth. A key detail here is that we are using the predicted scores, not the predicted labels. This allows us to have a more accurate calculation but adds more complexity. The calculation uses CBPE in an inverse way, where the fractional results come from the y_true column rather than the y_pred_proba column.

  4. The actual model's performance on reference is subtracted from the estimated performance result. This results in a performance number that is the performance impact on the model only because of the concept shift. To compare, the full performance impact under both concept shift and covariate shift is the performance change between the performance of the model in the chunk data minus its performance on the reference data. This is why those results are also labeled with the Performance Impact Estimation (PIE) acronym.

However, Reverse Concept Drift (RCD) also offers another approach. It substitutes steps 3 and 4 with the following step:

Assumptions and Limitations

RCD rests on some assumptions in order to give accurate results. Those assumptions are also its limitations. Let's see what they are:

1. The data available are large enough.

We need enough data to be able to accurately train a density ratio estimation model and be able to properly multi-calibrate.

RCD will likely fail if there is covariate shift to previously unseen regions in the model input space. Mathematically, we can say that the support of the analysis data needs to be a subset of the support of the reference data. If not, density ratio estimation is theoretically not defined. Practically if we don't have data from an analysis region in the reference data, we can't account for that shift with a weighted calculation from reference data.

2. Our machine learning algorithm is able to capture the concept accurately.

We are using a LightGBM model to learn the concept. In cases where that algorithm cannot perform well enough, RCD will not provide accurate results.

Last updated