Early Access: The content on this website is provided for informational purposes only in connection with pre-General Availability Qlik Products.
All content is subject to change and is provided without warranty.
Skip to main content Skip to complementary content

Data leakage

Data leakage means that the data used to train a machine learning algorithm includes the information you are trying to predict. This could lead to the model performing better in training than it would in the real world, creating a false assurance of how well the model performs. Learn how to identify and prevent data leakage to get reliable predictions.

Generally speaking, data leakage is caused by at least one of the following:

  • When one or more features in the training set can be used to derive the target variable you are trying to predict. For example, your target is a Sales field and one of your features is a Sales Tax field that is calculated from Sales.

  • When one or more features in the training set includes information that would not be known at the time of prediction.

In the following table, the column Stage is a duplicate column of the column Stage (Binary) that we want to predict. By including Stage in the training dataset, we would be providing the answer to the anticipated result, leading to a high score for our model.

Table with the "leaky column" Stage that contains information about the target column Stage (Binary)
Total Employees Annual Revenue (M$) Lead Source Forecast Deal ($) Stage Stage (Binary)
12078 2705 Partner 369,000 6 - Closed/Lost LOST
10076 1783 Inside sales 71,000 6 - Closed/Won WON
8518 2114 Inside sales 294,000 6 - Closed/Lost LOST
3978 1159 Sales rep 214,000 6 - Closed/Won WON
3517 2285 Marketing promo 154,000 6 - Closed/Lost LOST
3370 97 Customer referral 41,000 6 - Closed/Won WON

Target leakage

Target leakage is a form of data leakage. Target leakage occurs when feature data references target data that could be used for predictions. The references, or "leakages", can be direct or indirect.

With intelligent model optimization, AutoML identifies target leakage and prevents it from being introduced into your models. Features indicating target leakage are automatically detected and removed from model training. For more information about intelligent model optimization, see Intelligent model optimization.

Identifying data leakage

To identify data leakage, consider questions like "Will you have the same information for records at the time you want to make a prediction?" or "Will the record be the same in 30 days?". Remember that all data in your training dataset must be relevant to the time constraint in your business question.

When you have trained a model, you can look for the following clues in the model metrics.

  • High scores: Is the score really high? For example, is the F1 score above 85?

  • Feature importance: Is one feature a lot more important than everything else?

  • Holdout score: Is the holdout score much lower than the cross-validation score?

The table shows examples of common features that might cause data leakage.

Business use case Target

Potentially leaky features

Will a sales opportunity close?

Close (Yes or No)

Stage, close date, invoice details, commissions paid

Predict a future transaction amount

Amount of the next transaction

Taxes, order details

Will a lead convert to an opportunity?

Convert (Yes or No)

Opportunity details, conversion date

Will a customer churn?

Churn (Yes or No)

Churn reason, churn date, static customer tenure, customer temperature

Will an employee voluntarily term?

Terminate (Yes or No)

Exit interview details, term date, resignation letter information

Preventing data leakage

The best way to prevent data leakage is to use the structured framework to get a good business question and dataset. For more information, see Defining machine learning questions.

Tip noteIf you have identified a leaky column that should not be used in the model training, you can still keep it in the dataset. Just exclude this feature from the training data in your machine learning experiment.
Related learning:

Learn more

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!