Exploring the Dynamics of Machine Learning Lifecycle: Understanding Data Drift and Concept Drift

Ali Issa
2 min readFeb 20, 2024

--

Generated by DALL-E 3

I’ve been reading about the machine learning project lifecycle. Mainly, for a given ML project, we start by doing the following:

1️⃣ Scope: Identify the scope of the project. Decide what to work on, what is going to be taken as input and output, the metrics, and an estimation of resources. For example, speech recognition will take a voice as input and will return the transcription as output. The accuracy and latency will be considered as metrics.

2️⃣ Data: This is where we define, organize, and label the data. The consistency of labeling the data is a crucial part of avoiding confusing the model during training.

3️⃣ Modeling: A model will be selected and trained on the labeled data by building the code to train the model and choosing the hyperparameters. But in production, the focus is more on the hyperparameters and data, due to the problem of data/concept drift where data distribution could change and will differ from the data distribution in the training data. Then, an error analysis will be conducted to spot the weaknesses in the model and try to retrain the model to fix these issues.

4️⃣ Deployment: We deploy the model and monitor it to detect any concept drift and data drift. These are concepts that can occur in production when the data distribution has changed.

𝐂𝐨𝐧𝐜𝐞𝐩𝐭 𝐝𝐫𝐢𝐟𝐭 𝐚𝐧𝐝 𝐃𝐚𝐭𝐚 𝐝𝐫𝐢𝐟𝐭

𝐂𝐨𝐧𝐜𝐞𝐩𝐭 𝐝𝐫𝐢𝐟𝐭

This is where the data distribution of the input remains the same, but the output has changed. For example, if we have a model that predicts the price of houses based on its size. During COVID-19, the price of houses has changed. So in X->Y, the Y has changed, but X (the size of the house) will be the same.

𝐃𝐚𝐭𝐚 𝐝𝐫𝐢𝐟𝐭

In contrast, with data drift, the data distribution of input X has changed. Using the same example of a model that predicts house prices based on the size of the house, if people started building larger houses, then the input distribution is changing. This is called data drift. Even if X->Y did not change, but X is changing.

--

--