WikiGalaxy

Personalize

Introduction to Machine Learning

What is Machine Learning?

Types of Machine Learning

Mathematics for Machine Learning

Linear Algebra for ML

Probability and Statistics for ML

Supervised Learning

Linear Regression

Classification Algorithms

Unsupervised Learning

Clustering Algorithms

Dimensionality Reduction

Neural Networks and Deep Learning

Convolutional Neural Networks (CNNs)

Reinforcement Learning

Introduction to Reinforcement Learning

Markov Decision Processes

Machine Learning Applications

Applications in Healthcare

Applications in Finance

Clustering Algorithms

Understanding Clustering:

Clustering is a type of unsupervised learning technique used to group similar data points together. It helps in identifying patterns and structures within the data without any prior labels.

Used for market segmentation, social network analysis, and image compression.
Helps in understanding data distribution and identifying anomalies.
Commonly used clustering algorithms include K-Means, Hierarchical Clustering, and DBSCAN.

K-Means Clustering

How K-Means Works:

K-Means is a popular clustering algorithm that partitions data into K clusters, where each data point belongs to the cluster with the nearest mean.

Initialize K centroids randomly.
Assign each data point to the nearest centroid.
Recalculate the centroids as the mean of all points in the cluster.
Repeat until centroids do not change significantly.


import org.apache.commons.math3.ml.clustering.KMeansPlusPlusClusterer;
import org.apache.commons.math3.ml.clustering.DoublePoint;
import java.util.List;
import java.util.ArrayList;

class KMeansExample {
    public static void main(String[] args) {
        List points = new ArrayList<>();
        points.add(new DoublePoint(new double[]{1.0, 2.0}));
        points.add(new DoublePoint(new double[]{2.0, 1.0}));
        points.add(new DoublePoint(new double[]{4.0, 5.0}));
        
        KMeansPlusPlusClusterer clusterer = new KMeansPlusPlusClusterer<>(2);
        List> clusters = clusterer.cluster(points);
        
        System.out.println("Number of clusters formed: " + clusters.size());
    }
}

Benefits of K-Means:

K-Means is efficient and simple to implement. However, it requires the number of clusters to be defined beforehand and can be sensitive to outliers.

Console Output:

Number of clusters formed: 2

Hierarchical Clustering

Understanding Hierarchical Clustering:

Hierarchical clustering creates a tree of clusters called a dendrogram. It can be agglomerative (bottom-up) or divisive (top-down).

Agglomerative starts with each data point as a single cluster and merges them iteratively.
Divisive starts with all data points in one cluster and splits them iteratively.
Useful for visualizing data and understanding its structure.


import org.apache.commons.math3.ml.clustering.hierarchical.*;
import org.apache.commons.math3.ml.clustering.DoublePoint;
import java.util.List;
import java.util.ArrayList;

class HierarchicalExample {
    public static void main(String[] args) {
        List points = new ArrayList<>();
        points.add(new DoublePoint(new double[]{1.0, 2.0}));
        points.add(new DoublePoint(new double[]{2.0, 1.0}));
        points.add(new DoublePoint(new double[]{4.0, 5.0}));
        
        AgglomerativeClusterer clusterer = new AgglomerativeClusterer<>(new EuclideanDistance(), 2);
        List> clusters = clusterer.cluster(points);
        
        System.out.println("Number of clusters formed: " + clusters.size());
    }
}

Advantages of Hierarchical Clustering:

It does not require specifying the number of clusters in advance and provides a clear visualization through dendrograms. However, it can be computationally intensive for large datasets.

Console Output:

Number of clusters formed: 2

DBSCAN Clustering

What is DBSCAN?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together points that are closely packed and marks points in low-density regions as outliers.

Does not require specifying the number of clusters.
Can find arbitrarily shaped clusters.
Robust to noise and outliers.


import org.apache.commons.math3.ml.clustering.DBSCANClusterer;
import org.apache.commons.math3.ml.clustering.DoublePoint;
import java.util.List;
import java.util.ArrayList;

class DBSCANExample {
    public static void main(String[] args) {
        List points = new ArrayList<>();
        points.add(new DoublePoint(new double[]{1.0, 2.0}));
        points.add(new DoublePoint(new double[]{2.0, 1.0}));
        points.add(new DoublePoint(new double[]{4.0, 5.0}));
        
        DBSCANClusterer clusterer = new DBSCANClusterer<>(1.0, 2);
        List> clusters = clusterer.cluster(points);
        
        System.out.println("Number of clusters formed: " + clusters.size());
    }
}

Why Use DBSCAN?

DBSCAN is effective in handling noise and discovering clusters of varying shapes and sizes, making it suitable for spatial data analysis.

Console Output:

Number of clusters formed: 2

Mean Shift Clustering

Understanding Mean Shift:

Mean Shift is a non-parametric clustering technique that does not require specifying the number of clusters. It works by updating candidate points to the average of points within a given region.

Tracks the mode of the data distribution.
Effective for finding the center of clusters.
Can handle clusters of varying shapes and sizes.


import smile.clustering.MeanShift;
import smile.data.DataFrame;
import smile.data.vector.DoubleVector;

class MeanShiftExample {
    public static void main(String[] args) {
        double[][] data = {
            {1.0, 2.0},
            {2.0, 1.0},
            {4.0, 5.0}
        };
        
        DataFrame df = DataFrame.of(DoubleVector.of("x", data[0]), DoubleVector.of("y", data[1]));
        MeanShift clustering = new MeanShift(df, 2.0);
        
        System.out.println("Number of clusters formed: " + clustering.k());
    }
}

Advantages of Mean Shift:

Mean Shift is versatile and can adapt to the data's inherent structure. However, it can be computationally expensive and sensitive to bandwidth selection.

Console Output:

Number of clusters formed: 2

Gaussian Mixture Models (GMM)

What is GMM?

Gaussian Mixture Models (GMM) use a probabilistic model to represent normally distributed subpopulations within an overall population. It is a soft clustering technique.

Models the data as a mixture of several Gaussian distributions.
Each Gaussian represents a cluster.
Can handle complex cluster shapes and overlap.


import smile.clustering.GaussianMixture;
import smile.data.DataFrame;
import smile.data.vector.DoubleVector;

class GMMExample {
    public static void main(String[] args) {
        double[][] data = {
            {1.0, 2.0},
            {2.0, 1.0},
            {4.0, 5.0}
        };
        
        DataFrame df = DataFrame.of(DoubleVector.of("x", data[0]), DoubleVector.of("y", data[1]));
        GaussianMixture gmm = new GaussianMixture(df, 2);
        
        System.out.println("Number of clusters formed: " + gmm.k());
    }
}

Benefits of GMM:

GMM is flexible in modeling complex data distributions and provides a probabilistic clustering approach, allowing for soft assignments of data points to clusters.

Console Output:

Number of clusters formed: 2