一、开篇:从分类到定位的跨越
昨天你学习了图像分割,给每个像素赋予了语义。但很多实际任务并不需要这么精细——我们只需要知道图像中有什么物体,它们在哪里。
目标检测 = 分类 + 定位
对于图像中的每个感兴趣物体,输出:
类别标签(Class)
边界框(Bounding Box):通常用 [x, y, width, height] 或 [x1, y1, x2, y2] 表示
二、目标检测的任务与挑战
2.1 输入与输出
输入:图像(任意尺寸)
输出:一组检测结果,每个结果包含:
{ 'class': 'cat', 'bbox': [100, 150, 200, 300], # x1, y1, x2, y2 'confidence': 0.95}
2.2 核心挑战
物体数量可变:图像中可能有0个、1个或多个物体
物体尺度多变:近大远小,需要多尺度处理
物体遮挡:部分可见
实时性要求:很多应用需要实时(如自动驾驶、视频监控)
精确定位:框必须紧紧包围物体
三、目标检测的经典流程
3.1 传统方法:滑动窗口 + 手工特征
在深度学习之前,目标检测通常采用滑动窗口策略:
for 每个位置: for 每个尺度: 提取窗口区域 用分类器判断是否为物体
import cv2import numpy as np# 初始化HOG描述符和SVM检测器(OpenCV内置行人检测)hog = cv2.HOGDescriptor()hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())img = cv2.imread('street.jpg')(rects, weights) = hog.detectMultiScale(img, winStride=(4, 4), padding=(8, 8), scale=1.05)for (x, y, w, h) in rects: cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)cv2.imshow('HOG+SVM Pedestrian Detection', img)cv2.waitKey(0)
缺点:
计算量大(需要遍历所有位置和尺度)
手工特征泛化能力有限
难以处理多类别
四、深度学习时代的革命
2014年,R-CNN 的提出开启了深度学习目标检测的新纪元。从此,目标检测沿着两条路线演进:
五、两阶段检测器(Two-Stage Detectors)
5.1 R-CNN (2014) —— 开创者
步骤:
使用选择性搜索(Selective Search) 生成约2000个候选区域
将每个候选区域缩放到固定尺寸,送入CNN提取特征
对每个候选区域用SVM分类,并用回归器微调边界框
# 伪代码示意def rcnn_demo(image): proposals = selective_search(image) # 2000个候选框 features = [] for box in proposals: crop = image[box] crop_resized = resize(crop, (224,224)) feat = cnn.extract_features(crop_resized) # AlexNet等 features.append(feat) classes = svm_classifier.predict(features) boxes = bbox_regressor.refine(proposals, features) return non_max_suppression(classes, boxes)
缺点:速度极慢(每张图需处理2000次CNN),训练复杂(分阶段)
5.2 Fast R-CNN (2015) —— 共享卷积计算
改进:
# Fast R-CNN 核心:RoI Poolingclass RoIPooling(nn.Module): def __init__(self, output_size): super().__init__() self.output_size = output_size def forward(self, feature_map, rois): # rois: (batch_idx, x1, y1, x2, y2) outputs = [] for roi in rois: # 从feature_map中裁剪对应区域 # 然后池化到固定尺寸 pooled = adaptive_pool(feature_map[roi], self.output_size) outputs.append(pooled) return torch.stack(outputs)
速度:比R-CNN快10倍以上,但仍依赖外部候选区域生成(如选择性搜索)
5.3 Faster R-CNN (2016) —— 区域提议网络(RPN)
革命性创新:
class RPN(nn.Module): """区域提议网络""" def __init__(self, in_channels, num_anchors): super().__init__() self.conv = nn.Conv2d(in_channels, 512, 3, padding=1) self.cls = nn.Conv2d(512, num_anchors * 2, 1) # 物体/背景 self.reg = nn.Conv2d(512, num_anchors * 4, 1) # 边界框回归 def forward(self, x): x = F.relu(self.conv(x)) scores = self.cls(x) # (N, 2*A, H, W) deltas = self.reg(x) # (N, 4*A, H, W) return scores, deltas
输入图像 → 卷积骨干网络(VGG/ResNet) → 特征图 ↙ ↘ RPN(生成候选框) RoI Pooling ↓ ↓ 候选框 分类 + 回归
里程碑意义:首次实现真正的端到端检测,速度达到接近实时(5-17 fps)六、一阶段检测器(One-Stage Detectors)
6.1 YOLO (2016) —— 你看一次就够了
核心思想:将目标检测视为回归问题,直接预测边界框和类别概率。
YOLO工作流程:
将图像划分为 S×S 网格
每个网格负责预测 B 个边界框和 C 个类别概率
每个边界框预测:(x, y, w, h, confidence)
# YOLO 输出张量形状: (S, S, B*5 + C)# 例如 S=7, B=2, C=20 → 7×7×30
YOLOv1 损失函数:综合了定位误差、置信度误差、分类误差。
优点:极快(45 fps),适合实时应用
缺点:对小物体检测效果差,定位精度不如Faster R-CNN
6.2 SSD (2016) —— 多尺度检测
创新:
class SSD(nn.Module): def __init__(self): super().__init__() self.base = VGG16() self.extra_layers = self.make_extra_layers() self.loc_layers = nn.ModuleList([ nn.Conv2d(512, 4 * 4, 3, padding=1), # 对于 conv4_3 nn.Conv2d(1024, 6 * 4, 3, padding=1), # 对于 conv7 # ... 更多层 ]) self.conf_layers = nn.ModuleList([ nn.Conv2d(512, 4 * num_classes, 3, padding=1), nn.Conv2d(1024, 6 * num_classes, 3, padding=1), # ... ]) def forward(self, x): sources = [] loc = [] conf = [] # 经过基础网络,收集多个尺度的特征图 for layer in self.base: x = layer(x) if layer in self.source_layers: sources.append(x) for extra in self.extra_layers: x = extra(x) sources.append(x) # 对每个特征图预测 for (x, l, c) in zip(sources, self.loc_layers, self.conf_layers): loc.append(l(x).permute(0,2,3,1).contiguous()) conf.append(c(x).permute(0,2,3,1).contiguous()) return loc, conf
6.3 RetinaNet (2017) —— Focal Loss 解决类别不平衡
一阶段检测器面临的核心问题:正负样本极端不平衡(背景远多于物体)。
RetinaNet提出 Focal Loss,动态调整难易样本的权重,使模型聚焦于困难样本。
class FocalLoss(nn.Module): def __init__(self, alpha=0.25, gamma=2.0): super().__init__() self.alpha = alpha self.gamma = gamma def forward(self, preds, targets): bce_loss = F.binary_cross_entropy_with_logits(preds, targets, reduction='none') pt = torch.exp(-bce_loss) focal_loss = self.alpha * (1 - pt) ** self.gamma * bce_loss return focal_loss.mean()
RetinaNet 在速度和精度上超越了当时的所有一阶段检测器。七、评估指标
7.1 IoU(Intersection over Union)
衡量预测框和真实框的重合度。
通常认为 IoU > 0.5 的预测为正确定位。
7.2 Precision 和 Recall
7.3 Average Precision (AP)
对于每个类别,计算 Precision-Recall 曲线下的面积。
7.4 mean Average Precision (mAP)
所有类别的 AP 平均值,是目标检测最常用的综合指标。
# 伪代码:计算 mAPdef compute_map(detections, groundtruths, iou_thresh=0.5): # detections: 每个图像的预测框列表 # groundtruths: 每个图像的真实框列表 aps = [] for class_id in classes: tp = [] fp = [] num_gt = sum(len(gt[class_id]) for gt in groundtruths) for det in sorted(detections[class_id], key=lambda x: x.score, reverse=True): iou_max = 0 matched = False for gt in groundtruths[class_id]: iou = compute_iou(det.bbox, gt.bbox) if iou > iou_max: iou_max = iou best_gt = gt if iou_max >= iou_thresh and not best_gt.matched: tp.append(1) fp.append(0) best_gt.matched = True else: tp.append(0) fp.append(1) # 计算 precision-recall 曲线 tp_cum = np.cumsum(tp) fp_cum = np.cumsum(fp) precisions = tp_cum / (tp_cum + fp_cum + 1e-6) recalls = tp_cum / num_gt ap = np.trapz(precisions, recalls) # 计算曲线下面积 aps.append(ap) return np.mean(aps)
八、常用数据集
| | | | |
|---|
| PASCAL VOC | | | | |
| MS COCO | | | | |
| Open Images | | | | |
| ImageNet Detection | | | | |
| KITTI | | | | |
九、实战:使用预训练模型进行目标检测
9.1 用 OpenCV DNN 加载 YOLOv4
import cv2import numpy as np# 加载 YOLOnet = cv2.dnn.readNet('yolov4.weights', 'yolov4.cfg')layer_names = net.getLayerNames()output_layers = [layer_names[i - 1] for i in net.getUnconnectedOutLayers()]# 加载类别名称with open('coco.names', 'r') as f: classes = [line.strip() for line in f.readlines()]# 读取图像img = cv2.imread('street.jpg')height, width, _ = img.shape# 预处理blob = cv2.dnn.blobFromImage(img, 1/255.0, (416, 416), swapRB=True, crop=False)net.setInput(blob)outputs = net.forward(output_layers)# 解析输出boxes, confidences, class_ids = [], [], []for output in outputs: for detection in output: scores = detection[5:] class_id = np.argmax(scores) confidence = scores[class_id] if confidence > 0.5: center_x = int(detection[0] * width) center_y = int(detection[1] * height) w = int(detection[2] * width) h = int(detection[3] * height) x = int(center_x - w/2) y = int(center_y - h/2) boxes.append([x, y, w, h]) confidences.append(float(confidence)) class_ids.append(class_id)# 非极大值抑制indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)for i in indices: i = i[0] x, y, w, h = boxes[i] label = classes[class_ids[i]] confidence = confidences[i] cv2.rectangle(img, (x, y), (x+w, y+h), (0,255,0), 2) cv2.putText(img, f'{label} {confidence:.2f}', (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,255,0), 2)cv2.imshow('YOLOv4 Detection', img)cv2.waitKey(0)
9.2 用 PyTorch 加载 Faster R-CNN 预训练模型
import torchimport torchvisionfrom torchvision import transformsfrom PIL import Imageimport matplotlib.pyplot as pltimport matplotlib.patches as patches# 加载预训练模型model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)model.eval()# 如果使用GPUdevice = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')model = model.to(device)# 类别名称(COCO 91类)COCO_CLASSES = [ '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']# 加载图像img = Image.open('street.jpg').convert('RGB')transform = transforms.Compose([transforms.ToTensor()])img_tensor = transform(img).unsqueeze(0).to(device)# 推理with torch.no_grad(): predictions = model(img_tensor)# 可视化fig, ax = plt.subplots(1, figsize=(12,8))ax.imshow(img)for box, label, score in zip(predictions[0]['boxes'], predictions[0]['labels'], predictions[0]['scores']): if score > 0.5: x1, y1, x2, y2 = box.tolist() rect = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=2, edgecolor='r', facecolor='none') ax.add_patch(rect) ax.text(x1, y1, f'{COCO_CLASSES[label]}: {score:.2f}', color='white', fontsize=8, bbox=dict(facecolor='red', alpha=0.5))plt.axis('off')plt.show()
总结:定位、识别、理解
目标检测是计算机视觉的核心任务之一,连接着图像分类和图像分割。它让机器不仅知道“有什么”,还知道“在哪里”。
从R-CNN到YOLOv8,目标检测在短短几年内取得了飞速发展,成为自动驾驶、安防、医疗、零售等领域的核心技术。今天你掌握了它的基本原理、主要算法和实战技能,为深入研究和应用打下了坚实基础。