# Mastering Unsupervised K-Means Clustering with Python Tools

Written on

## Introduction to Clustering

Clustering involves grouping a collection of items such that those within the same group, or cluster, exhibit greater similarity to one another than to those in different clusters. This method is a fundamental exploratory data mining technique widely employed across various domains, including machine learning, image analysis, information retrieval, and bioinformatics. Here are some practical applications of clustering:

- Segmenting customers based on their purchasing habits or interests to craft tailored marketing strategies.
- Categorizing documents into distinct groups based on their content, tags, and topics.
- Analyzing outcomes in social and life sciences to uncover natural groupings and patterns.

### Types of Clustering Algorithms

width: 800 alt: Overview of various clustering algorithms

### PyCaret Overview

PyCaret is a user-friendly, open-source machine learning library in Python designed to streamline the machine learning process. It serves as a comprehensive tool for managing machine learning workflows, significantly accelerating experimentation and enhancing productivity. Unlike other libraries, PyCaret allows users to condense extensive code into just a few lines, thus optimizing efficiency.

To install the PyCaret library using pip, simply execute: pip install pycaret

### Dataset Preparation

In this guide, we will work with the Mice Protein Expression dataset from UCI. This dataset contains the expression levels of 77 proteins that produce detectable signals in the cortical nuclear fraction, comprising a total of 1080 measurements per protein. Each measurement can be viewed as an independent sample (mouse). The dataset is cited as follows:

Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. [Web Link] journal. pone.0129126

To load the dataset, use the following code: from pycaret.datasets import get_data dataset = get_data('mice')

width: 800 alt: Visual representation of dataset shape

Now, let's check the dataset's dimensions: dataset.shape Output: (1080, 82)

We will partition the dataset into training and unseen data, using 95% for model training: data = dataset.sample(frac=0.95, random_state=786) data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True) data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape)) print('Unseen Data For Predictions: ' + str(data_unseen.shape)) Output: Data for Modeling: (1026, 82) Unseen Data For Predictions: (54, 82)

### Setting Up the Environment

The setup function in PyCaret establishes the modeling environment and constructs the transformation pipeline for deployment: from pycaret.clustering import *

exp_clu101 = setup(data, normalize=True,

ignore_features=['MouseID'],

session_id=123)

width: 800 alt: Information grid from the setup process

Once the setup is complete, an information grid is displayed, presenting key details about the experiment, including the original and transformed dataset shapes, and the types of features.

### Training the Model

Training an unsupervised clustering model in PyCaret is straightforward using the create_model function: kmeans = create_model('kmeans')

width: 800 alt: Output of KMeans model training

The output confirms that a K-Means model has been successfully trained with a default of 4 clusters.

For additional model options, you can use: kmodes = create_model('kmodes', num_clusters=6)

width: 800 alt: Output of KModes model training

For a complete list of available models, refer to the PyCaret documentation or utilize the models() function.

### Model Inference

Now that the unsupervised machine learning model is trained, we can apply it to our dataset using the assign_model function: kmean_results = assign_model(kmeans) kmean_results.head()

width: 800 alt: Displaying model results

### Model Analysis

The plot_model function is used for analyzing the clustering results: plot_model(kmeans, plot='cluster')

width: 800 alt: Visualization of clusters

The plot illustrates the cluster labels, and additional features can be displayed by hovering over data points.

To determine the optimal number of clusters, the elbow method can be visualized using: plot_model(kmeans, plot='elbow')

width: 800 alt: Elbow plot for cluster number determination

For further analysis, the silhouette method can be employed: plot_model(kmeans, plot='silhouette')

width: 800 alt: Silhouette plot for cluster consistency validation

Learn more about the Silhouette plot to understand cluster cohesion and separation.

The distribution of cluster sizes can be visualized using: plot_model(kmeans, plot='distribution')

width: 800 alt: Distribution of samples across clusters

To further explore the relationship between cluster labels and other features, you can modify the distribution plot: plot_model(kmeans, plot='distribution', feature='class')

width: 800 alt: Cluster distribution based on class labels

### Inference on Unseen Data

The predict_model function allows for generating clustering labels for new, unseen data: unseen_predictions = predict_model(kmeans, data=data_unseen) unseen_predictions.head()

width: 800 alt: Predictions for unseen data samples

### Saving and Loading the Model

To save your trained model, you can use: save_model(kmeans, 'Final KMeans Model 25Nov2020')

width: 800 alt: Confirmation of model saving

When deploying, you can load the model with: from pycaret.clustering import load_model, predict_model kmeans = load_model('Final KMeans Model 25Nov2020') Then, use the model for inference on new datasets: data = pd.read_csv('...') inference_df = predict_model(kmeans, data=data)

width: 800 alt: Process of loading model and generating inference

## Thank You for Engaging

I appreciate your interest in data science, machine learning, and PyCaret. To stay updated with my latest writings, feel free to follow me on Medium, LinkedIn, and Twitter.

### Unsupervised Machine Learning - Flat Clustering with KMeans

This video provides a comprehensive guide on unsupervised machine learning, particularly focusing on flat clustering using KMeans in Python. It demonstrates how to implement KMeans with Scikit-learn, explaining the underlying concepts and practical steps involved.

### K-Means Clustering Algorithm with Python Tutorial

This tutorial offers an in-depth exploration of the K-Means clustering algorithm using Python. It covers everything from the foundational principles to practical applications, showcasing how to effectively utilize K-Means in your projects.