Effective Techniques for Normalizing Data in Pandas
Written on
Chapter 1: Understanding Data Normalization
Isn't it easier to analyze data when it's organized and presented uniformly? Unfortunately, real life doesn't always align with that preference. Hence, data normalization emerges as a crucial technique that guarantees structured, consistent, and formatted data. In the context of machine learning, normalization is frequently employed to convert various types of data into a standardized scale. This is essential because the features in machine learning can have vastly different value ranges. Therefore, normalizing data is a critical step before applying machine learning algorithms.
Let’s explore different methods to normalize a DataFrame in Pandas.
Section 1.1: Mean Normalization
Mean normalization is one effective approach for feature scaling. The most straightforward method to normalize all columns in a Pandas DataFrame involves subtracting the mean and dividing by the standard deviation.
import pandas as pd
Companies = pd.DataFrame({"No": [1000, 2000, 3000], "Yes": [400, 500, 600]})
df = pd.DataFrame(Companies)
print(df)
# Method 1
df_normalized = (df - df.mean()) / df.std()
print(df_normalized)
# Method 2
normalized_df = df.apply(lambda x: (x - x.mean()) / x.std(), axis=0)
print(normalized_df)
Output: Output image by the Author.
Explanation:
The two scripts above illustrate distinct methods for normalizing a Pandas DataFrame using mean normalization. In the first example, the mean() and std() functions are utilized, while the second leverages dataframe.apply() with a lambda function. Both approaches yield identical results.
Section 1.2: Min-Max Normalization
In contrast to the previous method, min-max normalization provides a straightforward way to adjust the scale of columns.
import pandas as pd
Companies = pd.DataFrame({"No": [1000, 2000, 3000], "Yes": [400, 500, 600]})
df = pd.DataFrame(Companies)
print(df)
normalized_df = (df - df.min()) / (df.max() - df.min())
print(normalized_df)
Output: Output image by the Author.
Explanation:
This method simplifies the process of normalization by requiring only the min-max function, avoiding more complex calculations involving the mean.
Chapter 2: Utilizing Sklearn for Normalization
The first video, "Data Scaling in Python | Minmax Scaler," provides an insightful overview of scaling techniques in Python, particularly focusing on the MinMaxScaler.
Normalization with Sklearn & MinMaxScaler
In the previous examples, we explored normalization using Pandas. Now, let’s examine how to perform normalization with Sklearn, which offers a variety of methods.
The MinMaxScaler method subtracts the minimum value in a feature and divides by the range (original maximum minus original minimum). This process effectively scales the feature values.
import pandas as pd
from sklearn import preprocessing
Companies = pd.DataFrame({"No": [1000, 2000, 3000], "Yes": [400, 500, 600]})
df = pd.DataFrame(Companies)
print(df)
x = df.values # returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df = pd.DataFrame(x_scaled)
print(normalized_df)
Output: Output image by the Author.
Normalization with Sklearn & StandardScaler
The previous program showed how to adjust numeric values without creating significant discrepancies (normalization). Now, let’s standardize these values so that their means equal 0 and their standard deviations equal 1. This process is termed StandardScaler.
In essence, a StandardScaler standardizes features by subtracting the mean and dividing by the standard deviation (scaling to unit variance).
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
Companies = pd.DataFrame({"No": [1000, 2000, 3000], "Yes": [400, 500, 600]})
df = pd.DataFrame(Companies)
print(df)
scaler = StandardScaler()
df.iloc[:, 0:] = scaler.fit_transform(df.iloc[:, 0:].to_numpy())
print(df)
Output: Output image by the Author.
Explanation:
In this example, we utilize the StandardScaler() function to standardize the normalized columns, ensuring the data adheres to the correct format and maintains internal consistency.
Section 2.1: Simple Column Transformations
Another user-friendly method for column transformation in Pandas is through normalization.
import pandas as pd
Companies = pd.DataFrame({"No": [1000, 2000, 3000], "Yes": [400, 500, 600]})
df = pd.DataFrame(Companies)
print(df)
# Method 1
df2 = df.apply(lambda x: x / x.max(), axis=0)
print(df2)
# Method 2
df["Fee"] = df["Fee"] / df["Fee"].max()
Output: Output image by the Author.
Explanation:
The first script demonstrates a simple approach to normalizing a DataFrame. Using DataFrame.apply() along with a lambda function effectively divides the values by the maximum. The second method is applicable only when all DataFrame columns contain positive values.
Section 2.2: Using .astype() for Normalization
Similar to the prior example, another straightforward method to normalize columns in a DataFrame is by employing the astype() method.
import pandas as pd
import numpy as np
Companies = pd.DataFrame({"No": [1000, 2000, 3000], "Yes": [400, 500, 600]})
df = pd.DataFrame(Companies)
# Method 1
df2 = df / df.max().astype(np.float64)
print(df2)
# Method 2
df2 = df / df.loc[df.abs().idxmax()].astype(np.float64)
print(df2)
Output: Output image by the Author.
Explanation:
In the first approach, the .astype() method converts DataFrame values to float type. This is particularly useful when dealing with negative values that should not be normalized, as shown in the second program.
Conclusion
In summary, this article has explored fundamental techniques for normalizing data in Pandas. I encourage you to delve into these methods further and apply them in your projects.
Feel free to connect with me on LinkedIn and Twitter for more insights.
Recommended Articles
- 15 Most Usable NumPy Methods with Python
- NumPy: Linear Algebra on Images
- Exception Handling Concepts in Python
- Pandas: Dealing with Categorical Data
- Hyper-parameters: RandomSearchCV and GridSearchCV in Machine Learning
- Fully Explained Linear Regression with Python
- Fully Explained Logistic Regression with Python
- Data Distribution using Numpy with Python
- 40 Most Insanely Usable Methods in Python
- 20 Most Usable Pandas Shortcut Methods in Python
The second video, "Should You Scale Your Data ??? : Data Science Concepts," discusses the importance of scaling in data science and provides clarity on when and how to apply it effectively.