Mastering Unsupervised K-Means Clustering with Python Tools
Written on
Introduction to Clustering
Clustering involves grouping a collection of items such that those within the same group, or cluster, exhibit greater similarity to one another than to those in different clusters. This method is a fundamental exploratory data mining technique widely employed across various domains, including machine learning, image analysis, information retrieval, and bioinformatics. Here are some practical applications of clustering:
- Segmenting customers based on their purchasing habits or interests to craft tailored marketing strategies.
- Categorizing documents into distinct groups based on their content, tags, and topics.
- Analyzing outcomes in social and life sciences to uncover natural groupings and patterns.
Types of Clustering Algorithms
width: 800 alt: Overview of various clustering algorithms
PyCaret Overview
PyCaret is a user-friendly, open-source machine learning library in Python designed to streamline the machine learning process. It serves as a comprehensive tool for managing machine learning workflows, significantly accelerating experimentation and enhancing productivity. Unlike other libraries, PyCaret allows users to condense extensive code into just a few lines, thus optimizing efficiency.
To install the PyCaret library using pip, simply execute: pip install pycaret
Dataset Preparation
In this guide, we will work with the Mice Protein Expression dataset from UCI. This dataset contains the expression levels of 77 proteins that produce detectable signals in the cortical nuclear fraction, comprising a total of 1080 measurements per protein. Each measurement can be viewed as an independent sample (mouse). The dataset is cited as follows:
Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. [Web Link] journal. pone.0129126
To load the dataset, use the following code: from pycaret.datasets import get_data dataset = get_data('mice')
width: 800 alt: Visual representation of dataset shape
Now, let's check the dataset's dimensions: dataset.shape Output: (1080, 82)
We will partition the dataset into training and unseen data, using 95% for model training: data = dataset.sample(frac=0.95, random_state=786) data_unseen = dataset.drop(data.index)
data.reset_index(drop=True, inplace=True) data_unseen.reset_index(drop=True, inplace=True)
print('Data for Modeling: ' + str(data.shape)) print('Unseen Data For Predictions: ' + str(data_unseen.shape)) Output: Data for Modeling: (1026, 82) Unseen Data For Predictions: (54, 82)
Setting Up the Environment
The setup function in PyCaret establishes the modeling environment and constructs the transformation pipeline for deployment: from pycaret.clustering import *
exp_clu101 = setup(data, normalize=True,
ignore_features=['MouseID'],
session_id=123)
width: 800 alt: Information grid from the setup process
Once the setup is complete, an information grid is displayed, presenting key details about the experiment, including the original and transformed dataset shapes, and the types of features.
Training the Model
Training an unsupervised clustering model in PyCaret is straightforward using the create_model function: kmeans = create_model('kmeans')
width: 800 alt: Output of KMeans model training
The output confirms that a K-Means model has been successfully trained with a default of 4 clusters.
For additional model options, you can use: kmodes = create_model('kmodes', num_clusters=6)
width: 800 alt: Output of KModes model training
For a complete list of available models, refer to the PyCaret documentation or utilize the models() function.
Model Inference
Now that the unsupervised machine learning model is trained, we can apply it to our dataset using the assign_model function: kmean_results = assign_model(kmeans) kmean_results.head()
width: 800 alt: Displaying model results
Model Analysis
The plot_model function is used for analyzing the clustering results: plot_model(kmeans, plot='cluster')
width: 800 alt: Visualization of clusters
The plot illustrates the cluster labels, and additional features can be displayed by hovering over data points.
To determine the optimal number of clusters, the elbow method can be visualized using: plot_model(kmeans, plot='elbow')
width: 800 alt: Elbow plot for cluster number determination
For further analysis, the silhouette method can be employed: plot_model(kmeans, plot='silhouette')
width: 800 alt: Silhouette plot for cluster consistency validation
Learn more about the Silhouette plot to understand cluster cohesion and separation.
The distribution of cluster sizes can be visualized using: plot_model(kmeans, plot='distribution')
width: 800 alt: Distribution of samples across clusters
To further explore the relationship between cluster labels and other features, you can modify the distribution plot: plot_model(kmeans, plot='distribution', feature='class')
width: 800 alt: Cluster distribution based on class labels
Inference on Unseen Data
The predict_model function allows for generating clustering labels for new, unseen data: unseen_predictions = predict_model(kmeans, data=data_unseen) unseen_predictions.head()
width: 800 alt: Predictions for unseen data samples
Saving and Loading the Model
To save your trained model, you can use: save_model(kmeans, 'Final KMeans Model 25Nov2020')
width: 800 alt: Confirmation of model saving
When deploying, you can load the model with: from pycaret.clustering import load_model, predict_model kmeans = load_model('Final KMeans Model 25Nov2020') Then, use the model for inference on new datasets: data = pd.read_csv('...') inference_df = predict_model(kmeans, data=data)
width: 800 alt: Process of loading model and generating inference
Thank You for Engaging
I appreciate your interest in data science, machine learning, and PyCaret. To stay updated with my latest writings, feel free to follow me on Medium, LinkedIn, and Twitter.
Unsupervised Machine Learning - Flat Clustering with KMeans
This video provides a comprehensive guide on unsupervised machine learning, particularly focusing on flat clustering using KMeans in Python. It demonstrates how to implement KMeans with Scikit-learn, explaining the underlying concepts and practical steps involved.
K-Means Clustering Algorithm with Python Tutorial
This tutorial offers an in-depth exploration of the K-Means clustering algorithm using Python. It covers everything from the foundational principles to practical applications, showcasing how to effectively utilize K-Means in your projects.