Mastering Unsupervised K-Means Clustering with Python Tools

Introduction to Clustering

Clustering involves grouping a collection of items such that those within the same group, or cluster, exhibit greater similarity to one another than to those in different clusters. This method is a fundamental exploratory data mining technique widely employed across various domains, including machine learning, image analysis, information retrieval, and bioinformatics. Here are some practical applications of clustering:

Segmenting customers based on their purchasing habits or interests to craft tailored marketing strategies.
Categorizing documents into distinct groups based on their content, tags, and topics.
Analyzing outcomes in social and life sciences to uncover natural groupings and patterns.

Types of Clustering Algorithms

width: 800

alt: Overview of various clustering algorithms

width:	800
alt:	Overview of various clustering algorithms

PyCaret Overview

PyCaret is a user-friendly, open-source machine learning library in Python designed to streamline the machine learning process. It serves as a comprehensive tool for managing machine learning workflows, significantly accelerating experimentation and enhancing productivity. Unlike other libraries, PyCaret allows users to condense extensive code into just a few lines, thus optimizing efficiency.

To install the PyCaret library using pip, simply execute: pip install pycaret

Dataset Preparation

In this guide, we will work with the Mice Protein Expression dataset from UCI. This dataset contains the expression levels of 77 proteins that produce detectable signals in the cortical nuclear fraction, comprising a total of 1080 measurements per protein. Each measurement can be viewed as an independent sample (mouse). The dataset is cited as follows:

Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. [Web Link] journal. pone.0129126

To load the dataset, use the following code: from pycaret.datasets import get_data dataset = get_data('mice')

width: 800

alt: Visual representation of dataset shape

width:	800
alt:	Visual representation of dataset shape

Now, let's check the dataset's dimensions: dataset.shape Output: (1080, 82)

We will partition the dataset into training and unseen data, using 95% for model training: data = dataset.sample(frac=0.95, random_state=786) data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True) data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape)) print('Unseen Data For Predictions: ' + str(data_unseen.shape)) Output: Data for Modeling: (1026, 82) Unseen Data For Predictions: (54, 82)

Setting Up the Environment

The setup function in PyCaret establishes the modeling environment and constructs the transformation pipeline for deployment: from pycaret.clustering import *

exp_clu101 = setup(data, normalize=True,

ignore_features=['MouseID'],

session_id=123)

width: 800

alt: Information grid from the setup process

width:	800
alt:	Information grid from the setup process

Once the setup is complete, an information grid is displayed, presenting key details about the experiment, including the original and transformed dataset shapes, and the types of features.

Training the Model

Training an unsupervised clustering model in PyCaret is straightforward using the create_model function: kmeans = create_model('kmeans')

width: 800

alt: Output of KMeans model training

width:	800
alt:	Output of KMeans model training

The output confirms that a K-Means model has been successfully trained with a default of 4 clusters.

For additional model options, you can use: kmodes = create_model('kmodes', num_clusters=6)

width: 800

alt: Output of KModes model training

width:	800
alt:	Output of KModes model training

For a complete list of available models, refer to the PyCaret documentation or utilize the models() function.

Model Inference

Now that the unsupervised machine learning model is trained, we can apply it to our dataset using the assign_model function: kmean_results = assign_model(kmeans) kmean_results.head()

width: 800

alt: Displaying model results

width:	800
alt:	Displaying model results

Model Analysis

The plot_model function is used for analyzing the clustering results: plot_model(kmeans, plot='cluster')

width: 800

alt: Visualization of clusters

width:	800
alt:	Visualization of clusters

The plot illustrates the cluster labels, and additional features can be displayed by hovering over data points.

To determine the optimal number of clusters, the elbow method can be visualized using: plot_model(kmeans, plot='elbow')

width: 800

alt: Elbow plot for cluster number determination

width:	800
alt:	Elbow plot for cluster number determination

For further analysis, the silhouette method can be employed: plot_model(kmeans, plot='silhouette')

width: 800

alt: Silhouette plot for cluster consistency validation

width:	800
alt:	Silhouette plot for cluster consistency validation

Learn more about the Silhouette plot to understand cluster cohesion and separation.

The distribution of cluster sizes can be visualized using: plot_model(kmeans, plot='distribution')

width: 800

alt: Distribution of samples across clusters

width:	800
alt:	Distribution of samples across clusters

To further explore the relationship between cluster labels and other features, you can modify the distribution plot: plot_model(kmeans, plot='distribution', feature='class')

width: 800

alt: Cluster distribution based on class labels

width:	800
alt:	Cluster distribution based on class labels

Inference on Unseen Data

The predict_model function allows for generating clustering labels for new, unseen data: unseen_predictions = predict_model(kmeans, data=data_unseen) unseen_predictions.head()

width: 800

alt: Predictions for unseen data samples

width:	800
alt:	Predictions for unseen data samples

Saving and Loading the Model

To save your trained model, you can use: save_model(kmeans, 'Final KMeans Model 25Nov2020')

width: 800

alt: Confirmation of model saving

width:	800
alt:	Confirmation of model saving

When deploying, you can load the model with: from pycaret.clustering import load_model, predict_model kmeans = load_model('Final KMeans Model 25Nov2020') Then, use the model for inference on new datasets: data = pd.read_csv('...') inference_df = predict_model(kmeans, data=data)

width: 800

alt: Process of loading model and generating inference

width:	800
alt:	Process of loading model and generating inference

Thank You for Engaging

I appreciate your interest in data science, machine learning, and PyCaret. To stay updated with my latest writings, feel free to follow me on Medium, LinkedIn, and Twitter.

Unsupervised Machine Learning - Flat Clustering with KMeans

This video provides a comprehensive guide on unsupervised machine learning, particularly focusing on flat clustering using KMeans in Python. It demonstrates how to implement KMeans with Scikit-learn, explaining the underlying concepts and practical steps involved.

K-Means Clustering Algorithm with Python Tutorial

This tutorial offers an in-depth exploration of the K-Means clustering algorithm using Python. It covers everything from the foundational principles to practical applications, showcasing how to effectively utilize K-Means in your projects.

afyonkarahisarkitapfuari.com

Mastering Unsupervised K-Means Clustering with Python Tools

Introduction to Clustering

Types of Clustering Algorithms

PyCaret Overview

Dataset Preparation

Setting Up the Environment

Training the Model

Model Inference

Model Analysis

Inference on Unseen Data

Saving and Loading the Model

Thank You for Engaging

Unsupervised Machine Learning - Flat Clustering with KMeans

K-Means Clustering Algorithm with Python Tutorial

Share the page:

Recent Post:

Stay Accountable: 5 Questions Every Runner Should Ask Daily

The Future of Starship: Challenges Ahead for SpaceX's Ambitions

Significant Changes at EA: Star Wars FPS Canceled, Jedi Series Continues

Celebrating the Unseen Contributions of Our Wives

Navigating the End of Almost Relationships with Dignity

Harnessing Python in Your Browser: A Guide to Pyodide

The Rise of 3D-Printed Firearms: A Double-Edged Sword

Unlocking Entrepreneurial Potential: Mastering Intrapreneurship Skills