WikiGalaxy

Personalize

Clustering Algorithms

Understanding Clustering:

Clustering is a type of unsupervised learning technique used to group similar data points together. It helps in identifying patterns and structures within the data without any prior labels.

  • Used for market segmentation, social network analysis, and image compression.
  • Helps in understanding data distribution and identifying anomalies.
  • Commonly used clustering algorithms include K-Means, Hierarchical Clustering, and DBSCAN.

K-Means Clustering

How K-Means Works:

K-Means is a popular clustering algorithm that partitions data into K clusters, where each data point belongs to the cluster with the nearest mean.

  • Initialize K centroids randomly.
  • Assign each data point to the nearest centroid.
  • Recalculate the centroids as the mean of all points in the cluster.
  • Repeat until centroids do not change significantly.

import org.apache.commons.math3.ml.clustering.KMeansPlusPlusClusterer;
import org.apache.commons.math3.ml.clustering.DoublePoint;
import java.util.List;
import java.util.ArrayList;

class KMeansExample {
    public static void main(String[] args) {
        List points = new ArrayList<>();
        points.add(new DoublePoint(new double[]{1.0, 2.0}));
        points.add(new DoublePoint(new double[]{2.0, 1.0}));
        points.add(new DoublePoint(new double[]{4.0, 5.0}));
        
        KMeansPlusPlusClusterer clusterer = new KMeansPlusPlusClusterer<>(2);
        List> clusters = clusterer.cluster(points);
        
        System.out.println("Number of clusters formed: " + clusters.size());
    }
}
        

Benefits of K-Means:

K-Means is efficient and simple to implement. However, it requires the number of clusters to be defined beforehand and can be sensitive to outliers.

Console Output:

Number of clusters formed: 2

Hierarchical Clustering

Understanding Hierarchical Clustering:

Hierarchical clustering creates a tree of clusters called a dendrogram. It can be agglomerative (bottom-up) or divisive (top-down).

  • Agglomerative starts with each data point as a single cluster and merges them iteratively.
  • Divisive starts with all data points in one cluster and splits them iteratively.
  • Useful for visualizing data and understanding its structure.

import org.apache.commons.math3.ml.clustering.hierarchical.*;
import org.apache.commons.math3.ml.clustering.DoublePoint;
import java.util.List;
import java.util.ArrayList;

class HierarchicalExample {
    public static void main(String[] args) {
        List points = new ArrayList<>();
        points.add(new DoublePoint(new double[]{1.0, 2.0}));
        points.add(new DoublePoint(new double[]{2.0, 1.0}));
        points.add(new DoublePoint(new double[]{4.0, 5.0}));
        
        AgglomerativeClusterer clusterer = new AgglomerativeClusterer<>(new EuclideanDistance(), 2);
        List> clusters = clusterer.cluster(points);
        
        System.out.println("Number of clusters formed: " + clusters.size());
    }
}
        

Advantages of Hierarchical Clustering:

It does not require specifying the number of clusters in advance and provides a clear visualization through dendrograms. However, it can be computationally intensive for large datasets.

Console Output:

Number of clusters formed: 2

DBSCAN Clustering

What is DBSCAN?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together points that are closely packed and marks points in low-density regions as outliers.

  • Does not require specifying the number of clusters.
  • Can find arbitrarily shaped clusters.
  • Robust to noise and outliers.

import org.apache.commons.math3.ml.clustering.DBSCANClusterer;
import org.apache.commons.math3.ml.clustering.DoublePoint;
import java.util.List;
import java.util.ArrayList;

class DBSCANExample {
    public static void main(String[] args) {
        List points = new ArrayList<>();
        points.add(new DoublePoint(new double[]{1.0, 2.0}));
        points.add(new DoublePoint(new double[]{2.0, 1.0}));
        points.add(new DoublePoint(new double[]{4.0, 5.0}));
        
        DBSCANClusterer clusterer = new DBSCANClusterer<>(1.0, 2);
        List> clusters = clusterer.cluster(points);
        
        System.out.println("Number of clusters formed: " + clusters.size());
    }
}
        

Why Use DBSCAN?

DBSCAN is effective in handling noise and discovering clusters of varying shapes and sizes, making it suitable for spatial data analysis.

Console Output:

Number of clusters formed: 2

Mean Shift Clustering

Understanding Mean Shift:

Mean Shift is a non-parametric clustering technique that does not require specifying the number of clusters. It works by updating candidate points to the average of points within a given region.

  • Tracks the mode of the data distribution.
  • Effective for finding the center of clusters.
  • Can handle clusters of varying shapes and sizes.

import smile.clustering.MeanShift;
import smile.data.DataFrame;
import smile.data.vector.DoubleVector;

class MeanShiftExample {
    public static void main(String[] args) {
        double[][] data = {
            {1.0, 2.0},
            {2.0, 1.0},
            {4.0, 5.0}
        };
        
        DataFrame df = DataFrame.of(DoubleVector.of("x", data[0]), DoubleVector.of("y", data[1]));
        MeanShift clustering = new MeanShift(df, 2.0);
        
        System.out.println("Number of clusters formed: " + clustering.k());
    }
}
        

Advantages of Mean Shift:

Mean Shift is versatile and can adapt to the data's inherent structure. However, it can be computationally expensive and sensitive to bandwidth selection.

Console Output:

Number of clusters formed: 2

Gaussian Mixture Models (GMM)

What is GMM?

Gaussian Mixture Models (GMM) use a probabilistic model to represent normally distributed subpopulations within an overall population. It is a soft clustering technique.

  • Models the data as a mixture of several Gaussian distributions.
  • Each Gaussian represents a cluster.
  • Can handle complex cluster shapes and overlap.

import smile.clustering.GaussianMixture;
import smile.data.DataFrame;
import smile.data.vector.DoubleVector;

class GMMExample {
    public static void main(String[] args) {
        double[][] data = {
            {1.0, 2.0},
            {2.0, 1.0},
            {4.0, 5.0}
        };
        
        DataFrame df = DataFrame.of(DoubleVector.of("x", data[0]), DoubleVector.of("y", data[1]));
        GaussianMixture gmm = new GaussianMixture(df, 2);
        
        System.out.println("Number of clusters formed: " + gmm.k());
    }
}
        

Benefits of GMM:

GMM is flexible in modeling complex data distributions and provides a probabilistic clustering approach, allowing for soft assignments of data points to clusters.

Console Output:

Number of clusters formed: 2

logo of wikigalaxy

Newsletter

Subscribe to our newsletter for weekly updates and promotions.

Privacy Policy

 • 

Terms of Service

Copyright © WikiGalaxy 2025