What are the Hyperparameters for Agglomerative Clustering?

Agglomerative clustering, a popular hierarchical clustering method, is guided by several key hyperparameters that influence how clusters are formed and the final structure of the clustering solution. Understanding these parameters is crucial for effectively applying the algorithm to various datasets.

The main hyperparameters in Agglomerative clustering are n_clusters, affinity (distancing measure), and linkage type. Beyond these core settings, others like distance_threshold also play a significant role in advanced scenarios.

Core Hyperparameters Explained

Let's delve into the primary hyperparameters that define the behavior of Agglomerative Clustering:

1. `n_clusters`

This hyperparameter determines the number of clusters to form. Agglomerative clustering builds a hierarchy of clusters, and n_clusters essentially tells the algorithm where to cut this hierarchy to obtain a flat partitioning of the data.

How it works: If you set n_clusters to a specific integer (e.g., 3, 5), the algorithm will stop merging clusters when it reaches that number of distinct clusters.
Practical Considerations:
- Often, the optimal number of clusters is unknown beforehand. Techniques like the elbow method or silhouette score can help in determining a suitable value.
- If n_clusters is set to None, the algorithm will perform a full hierarchical clustering, merging all data points into a single cluster. In this case, you might use distance_threshold instead.

2. `affinity` (Distance Measure)

The affinity parameter defines the metric used to calculate the distance between individual data points. This distance measure is fundamental to how similarity or dissimilarity is quantified, directly impacting which points are considered close enough to be merged.

Common Distance Metrics:
- Euclidean (L2 norm): The most common choice, representing the straight-line distance between two points in Euclidean space. Ideal for continuous, numeric data where geometric distance is meaningful.
- Manhattan (L1 norm): Also known as city block distance, it measures the sum of the absolute differences of their coordinates. Useful when the difference in individual dimensions is more important than the overall diagonal distance, or when dealing with high-dimensional data that might have outliers.
- Cosine: Measures the cosine of the angle between two vectors. It's particularly effective for text analysis or other high-dimensional data where the orientation of the vectors (pattern similarity) is more important than their magnitude (absolute values).
- L1 and L2: These correspond to Manhattan and Euclidean distances, respectively.
Impact: The choice of affinity significantly influences the shape and density of the clusters. For instance, Euclidean distance tends to find spherical clusters, while cosine similarity might group documents with similar topics regardless of their length.
Constraint: The affinity must be compatible with the chosen linkage method. For example, 'ward' linkage only works with 'Euclidean' affinity.

3. `linkage` Type

The linkage parameter determines how the distance between two clusters is calculated during the merging process. This is critical for defining the cluster formation strategy.

Common Linkage Methods:
- Ward: Minimizes the variance of the clusters being merged. It tends to produce compact, spherical clusters of roughly equal size. Only compatible with 'Euclidean' distance.
- Complete (Maximum Linkage): Considers the maximum distance between any two points in the two clusters. This method aims to find compact, well-separated clusters and is sensitive to outliers.
- Average (Average Linkage): Calculates the average distance between all pairs of points across the two clusters. It tends to produce more balanced clusters than single linkage and is less susceptible to noise than complete linkage.
- Single (Minimum Linkage): Considers the minimum distance between any two points in the two clusters. It's prone to "chaining," where clusters can grow by adding single points, forming long, elongated clusters. Useful for detecting non-convex shapes.
Practical Insights:
- Ward is often a good default choice when clusters are expected to be relatively balanced and spherical.
- Complete linkage can be useful when you want very distinct, separated clusters.
- Single linkage excels at identifying irregularly shaped clusters but can be sensitive to noise.

Summary of Key Hyperparameters

Hyperparameter	Description	Common Values/Options	Impact on Clustering
`n_clusters`	Desired number of clusters to form.	Integer (e.g., 2, 3, 5), `None`	Determines the "cut" point in the hierarchy.
`affinity`	Metric used to calculate the distance between data points.	`Euclidean`, `Manhattan`, `Cosine`, `L1`, `L2`	Defines how similarity is measured; impacts cluster shape.
`linkage`	Method used to calculate the distance between two clusters.	`Ward`, `Complete`, `Average`, `Single`	Dictates how clusters merge and their resulting structure.
`distance_threshold`	The maximum distance between clusters to merge. (Alternative to `n_clusters` when `n_clusters` is `None`)	Float	Determines cluster formation based on a distance cutoff.
`compute_full_tree`	Whether to compute the full hierarchy or stop early.	`auto`, `True`, `False`	Controls computational efficiency, especially with `n_clusters`.

Additional Hyperparameters and Considerations

`distance_threshold`

This hyperparameter offers an alternative way to control the clustering process when n_clusters is set to None. Instead of specifying the number of clusters, you define a maximum distance threshold for merging. Clusters will continue to merge as long as the distance between them is below this threshold.

Use Case: Useful when you have a natural notion of "closeness" and want clusters formed by points within a certain proximity, rather than a predetermined count.

`compute_full_tree`

This parameter dictates whether the entire hierarchical tree is computed or if the process stops once n_clusters or distance_threshold criteria are met.

Impact: Setting it to False (or auto if the number of clusters is small) can save computational time and memory, especially for very large datasets, if you only need a specific number of clusters.

Practical Advice for Hyperparameter Tuning

Dendrograms: For smaller datasets, visualizing the dendrogram can be incredibly insightful for choosing n_clusters or distance_threshold, as it graphically represents the merging process and distances.
Evaluation Metrics: Utilize metrics like the Silhouette Score or Davies-Bouldin index to quantitatively compare different hyperparameter combinations if ground truth labels are unavailable.
Domain Knowledge: Always leverage domain expertise to guide your choices. For instance, if you're clustering documents, cosine similarity might be a more natural affinity choice.
Iterative Approach: Start with common defaults (e.g., Euclidean for affinity, Ward for linkage) and iteratively experiment with changes, observing their impact.

Understanding and appropriately tuning these hyperparameters allows data scientists to effectively sculpt the clustering structure to reveal meaningful patterns and insights within their data.

What are the Hyperparameters for Agglomerative Clustering?

Core Hyperparameters Explained

1. n_clusters

2. affinity (Distance Measure)

3. linkage Type