DEV Community

Cover image for Physics into programming: Using physical phenomena to cluster data
Maksym Zavalniuk
Maksym Zavalniuk

Posted on

Physics into programming: Using physical phenomena to cluster data

Introduction

This post will explore the intersection of physics and programming, specifically how physical phenomena can be used to cluster data. A teacher from my university suggested implementing an interesting method. Our new clustering method, called Surface Tension Clustering, aims to address some limitations of existing clustering algorithms by leveraging the physical analogy of surface tension observed in liquids. This method identifies cluster boundary points based on the imbalance of attractive forces, similar to how surface tension acts on molecules at the surface of a liquid.

surface tension observed in liquid

But let's first talk about what clustering is and what methods exist.

General Information on Clustering

Clustering is an unsupervised learning method, meaning it does not rely on pre-labeled data to train a model. The main goal is to find natural groupings within the data. There are several main approaches to clustering:

  • Centroid-based methods: Clusters are represented by a central point (centroid). Example: K-means clustering.
  • Hierarchical methods: Clusters are formed into a hierarchy, either by merging smaller clusters into larger ones (agglomerative) or splitting larger clusters into smaller ones (divisive).
  • Density-based methods: Clusters are defined as dense regions in the data space. Example: DBSCAN.

Popular Clustering Methods

K-means Clustering

K-means is one of the most widely used clustering algorithms. It partitions data into K clusters, where each data point belongs to the cluster with the nearest mean. Here’s a simple implementation in Python:

from sklearn.cluster import KMeans
import numpy as np

# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

# Create KMeans instance with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

# Get cluster centers
centers = kmeans.cluster_centers_

# Predict the cluster for new samples
predictions = kmeans.predict([[0, 0], [4, 4]])
Enter fullscreen mode Exit fullscreen mode

Mean Shift Clustering

Mean shift is a sliding-window-based algorithm that attempts to find dense areas of data points. It automatically determines the number of clusters.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a density-based clustering algorithm that groups together points that are closely packed together, marking points that lie alone in low-density regions as outliers.

from sklearn.cluster import DBSCAN

# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

# Create DBSCAN instance
dbscan = DBSCAN(eps=1, min_samples=2).fit(X)

# Get labels
labels = dbscan.labels_
Enter fullscreen mode Exit fullscreen mode

Gaussian Mixture Models (GMM)

GMM is a probabilistic model that assumes all data points are generated from a mixture of several Gaussian distributions with unknown parameters.

from sklearn.mixture import GaussianMixture

# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

# Create GMM instance
gmm = GaussianMixture(n_components=2).fit(X)

# Predict cluster for new samples
predictions = gmm.predict([[0, 0], [4, 4]])
Enter fullscreen mode Exit fullscreen mode

Hierarchical Clustering

Hierarchical clustering builds a hierarchy of clusters either by merging smaller clusters into larger ones (agglomerative) or splitting larger clusters into smaller ones (divisive).

Introducing New Clustering Method: Surface Tension Clustering

Concept and Algorithm

The Surface Tension Clustering method is inspired by the behavior of molecules in liquids, where surface molecules experience an inward force due to the lack of neighboring molecules above them. In this analogy, boundary points of clusters are determined by the imbalance of vector forces from surrounding points.

Here’s a step-by-step outline of the algorithm:

  1. Distance Calculation: Compute the distances between all pairs of points.
  2. Neighborhood Identification: For each point, identify the nearest neighbors (typically 4-6 points).
  3. Center Calculation: Determine the center of mass for the neighboring points.
  4. Force Calculation: Calculate the resultant force acting on each point. If the force is significant, the point is considered a boundary point.
  5. Cluster Formation: Use the boundary points to form clusters and identify the support points, which are the closest points between different clusters.

How did I improve it?

Before applying the main clustering algorithm, I divide the points into smaller clusters using k-means and then look for the boundary points within them. I also wrote an interface using PyQT, which I will demonstrate below.

Demonstration of the work

So, here you can see the interface of the program, where I have specified random points, and different colours represent different clusters within which the boundary points will be searched. You can also adjust the sensitivity of the algorithm here.

Random 2d

Research can also be carried out in 3D space:

3D dimension

Now I'll show you a more visual example of how the algorithm works. In the picture, you can see that the red dots are the boundary dots, and if you connect them with a line, you can imagine that the grey dots are inside the water drop, i.e. they belong to the same cluster.

Worked example

Conclusion

Clustering is a powerful tool for uncovering hidden structures in data. Traditional methods like K-means and DBSCAN are widely used, but new Surface Tension Clustering method demonstrates significant advancements in efficiency, adaptability, and scalability. By leveraging the physical analogy of surface tension, we offer a novel approach to clustering that addresses some of the limitations of existing methods. GitHub repository you can see here.

Thank you for reading this far, I'd love to read your thoughts in the comments!

Top comments (0)