afyonkarahisarkitapfuari.com

Mastering Unsupervised K-Means Clustering with Python Tools

Written on

Introduction to Clustering

Clustering involves grouping a collection of items such that those within the same group, or cluster, exhibit greater similarity to one another than to those in different clusters. This method is a fundamental exploratory data mining technique widely employed across various domains, including machine learning, image analysis, information retrieval, and bioinformatics. Here are some practical applications of clustering:

  • Segmenting customers based on their purchasing habits or interests to craft tailored marketing strategies.
  • Categorizing documents into distinct groups based on their content, tags, and topics.
  • Analyzing outcomes in social and life sciences to uncover natural groupings and patterns.

Types of Clustering Algorithms

width:800
alt:Overview of various clustering algorithms

PyCaret Overview

PyCaret is a user-friendly, open-source machine learning library in Python designed to streamline the machine learning process. It serves as a comprehensive tool for managing machine learning workflows, significantly accelerating experimentation and enhancing productivity. Unlike other libraries, PyCaret allows users to condense extensive code into just a few lines, thus optimizing efficiency.

To install the PyCaret library using pip, simply execute: pip install pycaret

Dataset Preparation

In this guide, we will work with the Mice Protein Expression dataset from UCI. This dataset contains the expression levels of 77 proteins that produce detectable signals in the cortical nuclear fraction, comprising a total of 1080 measurements per protein. Each measurement can be viewed as an independent sample (mouse). The dataset is cited as follows:

Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. [Web Link] journal. pone.0129126

To load the dataset, use the following code: from pycaret.datasets import get_data dataset = get_data('mice')

width:800
alt:Visual representation of dataset shape

Now, let's check the dataset's dimensions: dataset.shape Output: (1080, 82)

We will partition the dataset into training and unseen data, using 95% for model training: data = dataset.sample(frac=0.95, random_state=786) data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True) data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape)) print('Unseen Data For Predictions: ' + str(data_unseen.shape)) Output: Data for Modeling: (1026, 82) Unseen Data For Predictions: (54, 82)

Setting Up the Environment

The setup function in PyCaret establishes the modeling environment and constructs the transformation pipeline for deployment: from pycaret.clustering import *

exp_clu101 = setup(data, normalize=True,

ignore_features=['MouseID'],

session_id=123)

width:800
alt:Information grid from the setup process

Once the setup is complete, an information grid is displayed, presenting key details about the experiment, including the original and transformed dataset shapes, and the types of features.

Training the Model

Training an unsupervised clustering model in PyCaret is straightforward using the create_model function: kmeans = create_model('kmeans')

width:800
alt:Output of KMeans model training

The output confirms that a K-Means model has been successfully trained with a default of 4 clusters.

For additional model options, you can use: kmodes = create_model('kmodes', num_clusters=6)

width:800
alt:Output of KModes model training

For a complete list of available models, refer to the PyCaret documentation or utilize the models() function.

Model Inference

Now that the unsupervised machine learning model is trained, we can apply it to our dataset using the assign_model function: kmean_results = assign_model(kmeans) kmean_results.head()

width:800
alt:Displaying model results

Model Analysis

The plot_model function is used for analyzing the clustering results: plot_model(kmeans, plot='cluster')

width:800
alt:Visualization of clusters

The plot illustrates the cluster labels, and additional features can be displayed by hovering over data points.

To determine the optimal number of clusters, the elbow method can be visualized using: plot_model(kmeans, plot='elbow')

width:800
alt:Elbow plot for cluster number determination

For further analysis, the silhouette method can be employed: plot_model(kmeans, plot='silhouette')

width:800
alt:Silhouette plot for cluster consistency validation

Learn more about the Silhouette plot to understand cluster cohesion and separation.

The distribution of cluster sizes can be visualized using: plot_model(kmeans, plot='distribution')

width:800
alt:Distribution of samples across clusters

To further explore the relationship between cluster labels and other features, you can modify the distribution plot: plot_model(kmeans, plot='distribution', feature='class')

width:800
alt:Cluster distribution based on class labels

Inference on Unseen Data

The predict_model function allows for generating clustering labels for new, unseen data: unseen_predictions = predict_model(kmeans, data=data_unseen) unseen_predictions.head()

width:800
alt:Predictions for unseen data samples

Saving and Loading the Model

To save your trained model, you can use: save_model(kmeans, 'Final KMeans Model 25Nov2020')

width:800
alt:Confirmation of model saving

When deploying, you can load the model with: from pycaret.clustering import load_model, predict_model kmeans = load_model('Final KMeans Model 25Nov2020') Then, use the model for inference on new datasets: data = pd.read_csv('...') inference_df = predict_model(kmeans, data=data)

width:800
alt:Process of loading model and generating inference

Thank You for Engaging

I appreciate your interest in data science, machine learning, and PyCaret. To stay updated with my latest writings, feel free to follow me on Medium, LinkedIn, and Twitter.

Unsupervised Machine Learning - Flat Clustering with KMeans

This video provides a comprehensive guide on unsupervised machine learning, particularly focusing on flat clustering using KMeans in Python. It demonstrates how to implement KMeans with Scikit-learn, explaining the underlying concepts and practical steps involved.

K-Means Clustering Algorithm with Python Tutorial

This tutorial offers an in-depth exploration of the K-Means clustering algorithm using Python. It covers everything from the foundational principles to practical applications, showcasing how to effectively utilize K-Means in your projects.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Stay Accountable: 5 Questions Every Runner Should Ask Daily

Discover five essential questions every runner should ask to enhance performance, avoid injuries, and maintain overall happiness in their training.

The Future of Starship: Challenges Ahead for SpaceX's Ambitions

Analyzing the hurdles SpaceX faces post-Starship launch failure, including engineering, legal, and regulatory challenges.

Significant Changes at EA: Star Wars FPS Canceled, Jedi Series Continues

EA announces layoffs and cancels its Star Wars FPS, but the Jedi franchise continues to thrive.

Celebrating the Unseen Contributions of Our Wives

Recognizing the often-overlooked impact of wives on family life and the importance of expressing appreciation for their efforts.

Navigating the End of Almost Relationships with Dignity

Discover respectful ways to end an almost relationship without ghosting.

Harnessing Python in Your Browser: A Guide to Pyodide

Discover how to utilize Pyodide for Python programming directly in your web browser.

The Rise of 3D-Printed Firearms: A Double-Edged Sword

The emergence of 3D-printed firearms raises concerns about safety and regulation as individuals exploit loopholes in gun buyback programs.

Unlocking Entrepreneurial Potential: Mastering Intrapreneurship Skills

Discover how to leverage your entrepreneurial skills in your workplace through intrapreneurship and drive meaningful change.