Unlocking Insights: The Value of Missing Data in Data Science
Written on
Chapter 1: Understanding Missing Values
When I begin analyzing a new dataset, one of my first tasks is to assess the extent of missing values. Encountering a high number of NaN entries can be disheartening. It's quite common for some columns to contain over 50% missing data. The excitement surrounding new datasets often stems from the potential for more data to drive improvements. However, just like previous datasets, these new ones can be riddled with USELESS MISSING VALUES!
But hold on! Are these missing values truly without merit? Could they potentially offer additional insights?
In this brief article, I aim to shift your perspective on missing data. I will demonstrate how to interpret missing values differently and enhance your model's performance by utilizing them effectively. It's essential to note that a key differentiator between a skilled data scientist and an amateur lies in their approach to missing values and the insights they derive from them.
Let me pose a question: What is your typical response when you encounter missing values? Many of us tend to eliminate them right away (or forever), while others may spend weeks trying to impute them.
Eliminating or imputing missing values without first understanding their underlying causes can lead to several drawbacks.
- Removing rows with missing values may cause us to overlook other non-missing data that could enhance our predictive models.
- Discarding columns with a high volume of missing entries also results in the loss of other valuable non-missing data, which is a wasteful approach.
- In some cases, missing values themselves might convey essential information that can improve our analysis or model performance.
- Finally, indiscriminately removing or imputing missing values can introduce biases into our models.
It's crucial to clarify that while I don't oppose the removal or imputation of missing values, it’s vital to ask yourself a few key questions beforehand. Here are three important inquiries to consider before you delete or impute missing data.
What caused these missing values?
Consider a dataset containing information about individuals diagnosed with a specific type of cancer, compiled by a medical research team. Among the data collected, one missing column might relate to biopsy results. While it makes sense for individuals without cancer to have missing values, the high number of missing entries for diagnosed patients raises questions.
One plausible explanation is that the research team was unable to obtain biopsy results for certain patients. Alternatively, it could indicate that physicians chose not to order biopsies due to other compelling factors. For instance, a doctor may have been confident in a cancer diagnosis based on the results of other tests, or perhaps they were unsure about the diagnosis and opted against recommending a biopsy.
Ignoring or removing these missing values can lead to significant information loss. In this instance, missing values represent a segment of patients with either clear or ambiguous symptoms, while those with available biopsy results had enough symptoms to warrant further testing.
As you can see, missing values can provide vital information. Understanding the context behind a missing value is key to unlocking its potential insights.
Is there a correlation between missing data and other variables?
In some instances, missing values may correlate with specific features, suggesting that they are not randomly absent.
A straightforward example can be seen in patient information sheets, where missing values may frequently arise in questions about pregnancy. Men, for instance, might leave these questions unanswered, leading to a high number of missing entries in those columns. While some women may also skip these questions, the implications of missing values differ between male and female patients.
Additionally, there might be correlations between missing values and unique identifiers, such as patient IDs or order numbers. Understanding these correlations can help us decipher both the significance of the missing data and the underlying structure of these identifiers.
Is there potential bias in samples with missing values?
Another risk associated with the premature removal of samples with missing data is the introduction of bias into our models or analyses. For instance, consider a user database for an online dating application, where demographic questions such as age and race are optional. It’s reasonable to expect that certain groups, such as women or minorities, may leave these questions unanswered.
If we remove entries based solely on missing data, we risk skewing our model. The hypothesis here is that the resulting dataset may reflect biases that favor certain demographics, for example, white males, after excluding those with missing entries.
This hypothetical scenario emphasizes the importance of considering inherent biases before discarding samples with missing values.
Summary
Missing values can harbor important insights that not only enhance model performance but also address critical business questions.
Before you decide to remove or impute missing data, consider these three essential questions:
- What caused these missing values?
- Is there a correlation between missing data and other variables?
- Could there be bias in samples with missing values?
Chapter 2: Strategies for Handling Missing Data
In the video "Machine Learning - Dealing with Missing Data, Model Optimization & Parameter Tuning = Better Results," the speaker discusses effective strategies for managing missing data, model optimization techniques, and parameter tuning to yield superior outcomes.
The video "Prediction with Missing Values - GAEL VAROQUAUX, PhD" dives into advanced techniques for predicting outcomes in the presence of missing values, offering valuable insights and methodologies for data scientists.