19. 支持向量机实现人像分类#

19.1. 介绍#

支持向量机是一个非常优秀的算法。本次挑战中，我们将使用 scikit-learn 提供的支持向量机方法来完成人脸图像分类任务。

19.2. 知识点#

图像数据预处理
支持向量机分类

首先，我们通过 scikit-learn 提供的 fetch_lfw_people 下载人脸数据集，数据集最初出自 Labeled Faces in the Wild 项目。

from sklearn.datasets import fetch_lfw_people

# 加载数据集
faces = fetch_lfw_people(min_faces_per_person=60)
faces.target_names, faces.images.shape

(array(['Ariel Sharon', 'Colin Powell', 'Donald Rumsfeld', 'George W Bush',
        'Gerhard Schroeder', 'Hugo Chavez', 'Junichiro Koizumi',
        'Tony Blair'], dtype='<U17'),
 (1348, 62, 47))

可以看的，我们仅使用了 8 位名人的人像数据，总共包含 1348 个样本。其中，每张人像照片的尺寸为 \(62*47\) 像素。总结 faces 属性有：

属性	描述
`faces.images`	62x47 矩阵，记录人脸图像中的像素值
`faces.data`	将 images 对应的 62x47 矩阵转换为行向量
`faces.target_names`	8 位人像姓名
`faces.target`	8 位人像依次编号

下面，我们先使用 Matplotlib 绘图来预览这些数据。

Exercise 19.1

挑战：预览数据集前 5 张人像图片，以 1 行 5 列子图呈现。

规定：每张图片横轴上显示该张图片对应人像姓名。

from matplotlib import pyplot as plt

%matplotlib inline

## 代码开始 ### (≈4 行代码)

## 代码结束 ###

Solution to Exercise 19.1

from matplotlib import pyplot as plt
%matplotlib inline

### 代码开始 ### (≈4 行代码)
fig, axes = plt.subplots(1, 5, figsize=(12, 6))
for i, image in enumerate(faces.images[:5]):
    axes[i].imshow(image)
    axes[i].set_xlabel(faces.target_names[faces.target[i]])
### 代码结束 ###

期望输出

由于图片本身为 2 维数组，需要处理之后才能用于训练模型。所以，下面我们使用 faces.data 数据，其已经将每一个人像对于的二维数组展平成 1 维。

faces.data.shape

可以看的，faces.data 的形状为 \((1348, 2914)\) ，即代表有 1348 个样本，每一个样本对应了 2914 个特征。而这 2914 个特征即是将人像图片 \(62*47 = 2914\) 展平之后的向量。

接下来，按照惯例需要对数据集进行切分，将其分为训练集和测试集。不过，这里值的注意的是，由于样本只有 1348 个，而每个样本对应的特征则为 2914。机器学习建模过程中，我们要避免特征远大于样本数量的情形，这样训练出来的模型一般表现都会非常糟糕。

所以，这里我们需要对数据特征进行「降维」，实际上就是减少数据的特征量。这里使用到 PCA 降维方法，该方法会在后续实验中详细介绍，这里不做讲解。

from sklearn.decomposition import PCA

# 直接运行，将数据特征缩减为 150 个
pca = PCA(n_components=150, whiten=True, random_state=42)
pca_data = pca.fit_transform(faces.data)
pca_data.shape

可以看的，数据形状又先前的 \((1348, 2914)\) 变为 \((1348, 150)\) 。

下面，就可以根据降维后的数据切分训练集和测试集了。

Exercise 19.2

挑战：使用 train_test_split() 将数据集切分为 80%（训练集）和 20%（测试集）两部分。

规定：训练集特征，测试集特征，训练集目标，测试集目标分别为：X_train, X_test, y_train, y_test，随机数种子定为 42。

from sklearn.model_selection import train_test_split

## 代码开始 ### (≈1 行代码)

## 代码结束 ###

X_train.shape, X_test.shape, y_train.shape, y_test.shape

Solution to Exercise 19.2

from sklearn.model_selection import train_test_split

### 代码开始 ### (≈1 行代码)
X_train, X_test, y_train, y_test = train_test_split(
    pca_data, faces.target, test_size=0.2, random_state=42)
### 代码结束 ###

X_train.shape, X_test.shape, y_train.shape, y_test.shape

期望输出

((1078, 150), (270, 150), (1078,), (270,))

接下来，我们使用 SVM 算法建模，并使用上面划分的数据训练和测试模型。

Exercise 19.3

挑战：使用 scikit-learn 提供的支持向量机分类方法完成建模，并得到模型在测试集上的准确度结果。

规定：支持向量机分类器参数 C=10, gamma=0.001，其余为默认参数。

## 代码开始 ### (≈4 行代码)
model = None
## 代码结束 ###

Solution to Exercise 19.3

### 代码开始 ### (≈4 行代码)
from sklearn.svm import SVC

model = SVC(C=10, gamma=0.001)
model.fit(X_train, y_train)
model.score(X_test, y_test)
### 代码结束 ###

期望输出

最终准确度 \(>0.8\) 即可。

如果你觉得这些内容对你有帮助，可以通过微信赞赏码或者 Buy Me a Coffee 请我喝杯咖啡。