写在前面
上一篇我们学习了最经典的K-Means聚类,但K-Means也有缺点:需要指定K、对异常敏感、只能处理球形簇。所以还有很多其他聚类算法,各有各的适用场景。
本文我们介绍四种常见聚类算法:层次聚类、DBSCAN、Mean Shift、高斯混合模型GMM,每个都讲清楚原理、优缺点、适用场景,配可运行代码。读完这篇,你就知道什么场景用什么聚类算法了。
在上一篇 K-Means聚类 的尾声,我们留下了实践作业,现在让我们一起揭晓答案。
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# 1. 生成4个簇的模拟数据
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.5, random_state=42)
# 2. 肘部法则找K
k_range = range(1, 11)
sse = []
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
sse.append(kmeans.inertia_)
# 画肘部图
plt.figure(figsize=(8, 4))
plt.plot(k_range, sse, 'o-', linewidth=2)
plt.xlabel('K (簇数)')
plt.ylabel('SSE (簇内平方和)')
plt.title('Elbow Method on 4 Blobs')
plt.grid(True, alpha=0.3)
plt.show()
# 3. 轮廓系数
sil_scores = []
for k in k_range[1:]:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
sil_scores.append(silhouette_score(X, labels))
plt.figure(figsize=(8, 4))
plt.plot(k_range[1:], sil_scores, 'o-', linewidth=2)
plt.xlabel('K (簇数)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score on 4 Blobs')
plt.grid(True, alpha=0.3)
plt.show()
# 4. 找出最佳K,可视化结果
best_k = k_range[1:][np.argmax(sil_scores)]
print(f"轮廓系数最大的K: {best_k}")
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.7)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centers')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(f'K-Means Clustering (K={best_k})')
plt.legend()
plt.show()
print(f"SSE: {kmeans.inertia_:.4f}")
print(f"轮廓系数: {silhouette_score(X, labels):.4f}")
结果分析:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, adjusted_rand_score
from sklearn.preprocessing import StandardScaler
# 加载数据
iris = load_iris()
X = iris.data
y_true = iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 尝试K=2,3,4
for k in [2, 3, 4]:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
sil = silhouette_score(X_scaled, labels)
ari = adjusted_rand_score(y_true, labels)
print(f"K={k}: 轮廓系数={sil:.4f}, ARI={ari:.4f}")
# 可视化
plt.figure(figsize=(8, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
c='red', marker='X', s=200)
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title(f'K-Means Clustering on Iris (K={k})')
plt.show()
典型结果:
K=2: 轮廓系数=0.4701, ARI=0.5631
K=3: 轮廓系数=0.4500, ARI=0.6589
K=4: 轮廓系数=0.4102, ARI=0.6307
结论:轮廓系数和ARI都在K=3时最好,和真实类别数一致。
层次聚类就是一层一层地聚类,主要分两种:
凝聚层次聚类步骤:
两个簇之间的距离怎么算?常见三种:
| 单链接(Single Linkage) | ||
| 全链接(Complete Linkage) | ||
| 平均链接(Average Linkage) |
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# 生成数据
X, y_true = make_blobs(n_samples=50, centers=3, random_state=42)
# 层次聚类
agg = AgglomerativeClustering(n_clusters=3, linkage='average')
labels = agg.fit_predict(X)
# 可视化结果
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Hierarchical Clustering (Average Linkage, K=3)')
plt.show()
# 画树状图(dendrogram)
plt.figure(figsize=(10, 4))
Z = linkage(X, method='average')
dendrogram(Z)
plt.title('Dendrogram (Hierarchical Clustering)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
树状图能让你清晰看到聚类是怎么一层层合并的,在树状图上切断就能得到对应簇数。
适用场景:
DBSCAN是基于密度的聚类,它不需要预先指定K,能自动发现任意形状的簇,还能自动识别噪声点。这是它比K-Means最大的优点。
核心思想: "簇"是高密度连接区域,低密度区域把不同簇分开,低密度区域的点就是噪声。
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons, make_blobs
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# 例子1:月牙形数据(非凸,K-Means搞不定)
X, y_true = make_moons(n_samples=200, noise=0.05, random_state=42)
X = StandardScaler().fit_transform(X)
# DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)
# 可视化
plt.figure(figsize=(8, 6))
unique_labels = set(labels)
colors = [plt.cm.viridis(each / (len(unique_labels) - 1)) for each in unique_labels]
for label, color inzip(unique_labels, colors):
if label == -1:
color = (0, 0, 0, 1) # 噪声点黑色
mask = labels == label
plt.scatter(X[mask, 0], X[mask, 1], color=color, label=f'Cluster {label}'if label != -1else'Noise')
plt.title('DBSCAN on Moons Data')
plt.legend()
plt.show()
print(f"发现簇数: {len(set(labels)) - (1if -1in labels else0)}")
print(f"噪声点数: {sum(labels == -1)}")
DBSCAN能完美把两个月牙分开,而K-Means在这个例子上会失败。
适用场景:
Mean Shift(均值漂移)是基于密度梯度上升的聚类算法:
核心思路:
优点:
缺点:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import MeanShift
# 生成数据
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)
# Mean Shift聚类
ms = MeanShift(bandwidth=None) # bandwidth自动估计
ms.fit(X)
labels = ms.labels_
centers = ms.cluster_centers_
print(f"自动发现簇数: {len(np.unique(labels))}")
# 可视化
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.7)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centers')
plt.title('Mean Shift Clustering')
plt.legend()
plt.show()
适用场景:
高斯混合模型(Gaussian Mixture Model,GMM)假设:所有数据是由多个高斯分布混合生成的,每个高斯分布就是一个簇。我们需要估计每个高斯分布的参数(均值、协方差),然后计算每个样本属于每个高斯分布的概率。
一句话:GMM假设每个簇服从高斯分布,用EM算法估计参数。
GMM用EM算法迭代求解:
| 硬聚类 | ||
| 软聚类(概率聚类) |
GMM给出的是概率,你可以选概率最大的作为簇分配,也可以保留概率做后续分析。
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
# 生成数据
X, y_true = make_blobs(n_samples=300, centers=3, random_state=42)
X = StandardScaler().fit_transform(X)
# GMM聚类
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X)
labels = gmm.predict(X)
means = gmm.means_
probs = gmm.predict_proba(X)
print(f"混合系数: {gmm.weights_}")
print(f"均值:\n{means}")
# 可视化
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.7)
plt.scatter(means[:, 0], means[:, 1], c='red', marker='X', s=200, label='Means')
plt.title('Gaussian Mixture Model Clustering (K=3)')
plt.legend()
plt.show()
# 输出第一个样本的概率
print(f"\n第一个样本属于各簇概率: {probs[0]}")
适用场景:
| K-Means | ||||||
| 层次聚类 | ||||||
| DBSCAN | ||||||
| Mean Shift | ||||||
| GMM |
快速选择指南:
我们用月牙形数据对比K-Means和DBSCAN:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
# 生成月牙数据
X, y_true = make_moons(n_samples=200, noise=0.05, random_state=42)
X = StandardScaler().fit_transform(X)
# 四种算法对比
algorithms = [
('K-Means (K=2)', KMeans(n_clusters=2, random_state=42, n_init=10)),
('DBSCAN', DBSCAN(eps=0.3, min_samples=5)),
('Hierarchical (K=2)', AgglomerativeClustering(n_clusters=2, linkage='average')),
('GMM (K=2)', GaussianMixture(n_components=2, random_state=42))
]
plt.figure(figsize=(16, 10))
for i, (name, algo) inenumerate(algorithms):
labels = algo.fit_predict(X)
plt.subplot(2, 2, i+1)
unique_labels = set(labels)
for label in unique_labels:
if label == -1:
color = (0, 0, 0, 1)
else:
color = plt.cm.viridis(label / (len(unique_labels) - 1)) iflen(unique_labels) > 1else (0.5, 0.5, 0.5, 1)
mask = labels == label
plt.scatter(X[mask, 0], X[mask, 1], color=color, alpha=0.7)
plt.title(name)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.tight_layout()
plt.show()
运行这段代码,你会清楚看到:
现在轮到你练习了,请完成以下任务:
使用 make_blobs 生成有3个簇的数据,对比三种聚类算法:
make_moons 生成月牙数据我们介绍了四种常见的聚类算法,各有各的适用场景:
记住一句话:没有最好的聚类算法,只有最适合你场景的聚类算法。先试试K-Means,不行再根据你的数据特点选其他算法。
提醒:请运行文中代码,完成本期作业,实践出真知。
| 层次聚类 | from sklearn.cluster import AgglomerativeClustering | |||||
| DBSCAN | from sklearn.cluster import DBSCAN | |||||
| Mean Shift | from sklearn.cluster import MeanShift | |||||
| GMM | from sklearn.mixture import GaussianMixture |
关键概念:
选择口诀: