当前位置：首页>python>一天一个Python知识点——Day 159:目标检测概述

一天一个Python知识点——Day 159:目标检测概述

2026-06-21 22:30:51

一、开篇：从分类到定位的跨越

昨天你学习了图像分割，给每个像素赋予了语义。但很多实际任务并不需要这么精细——我们只需要知道图像中有什么物体，它们在哪里。

图像分类：这张图里有什么？（单标签）
目标检测：这张图里有什么物体，分别在什么位置？（多物体 + 边界框）

目标检测 = 分类 + 定位
对于图像中的每个感兴趣物体，输出：

类别标签（Class）
边界框（Bounding Box）：通常用 [x, y, width, height] 或 [x1, y1, x2, y2] 表示

二、目标检测的任务与挑战

2.1 输入与输出

输入：图像（任意尺寸）
输出：一组检测结果，每个结果包含：

{  'class': 'cat',  'bbox': [100, 150, 200, 300],  # x1, y1, x2, y2  'confidence': 0.95}

2.2 核心挑战

物体数量可变：图像中可能有0个、1个或多个物体
物体尺度多变：近大远小，需要多尺度处理
物体遮挡：部分可见
实时性要求：很多应用需要实时（如自动驾驶、视频监控）
精确定位：框必须紧紧包围物体

三、目标检测的经典流程

3.1 传统方法：滑动窗口 + 手工特征

在深度学习之前，目标检测通常采用滑动窗口策略：

for 每个位置:    for 每个尺度:        提取窗口区域        用分类器判断是否为物体

典型代表：HOG + SVM（用于行人检测）

import cv2import numpy as np# 初始化HOG描述符和SVM检测器（OpenCV内置行人检测）hog = cv2.HOGDescriptor()hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())img = cv2.imread('street.jpg')(rects, weights) = hog.detectMultiScale(img, winStride=(4, 4),                                         padding=(8, 8), scale=1.05)for (x, y, w, h) in rects:    cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)cv2.imshow('HOG+SVM Pedestrian Detection', img)cv2.waitKey(0)

缺点：

计算量大（需要遍历所有位置和尺度）
手工特征泛化能力有限
难以处理多类别

四、深度学习时代的革命

2014年，R-CNN 的提出开启了深度学习目标检测的新纪元。从此，目标检测沿着两条路线演进：

两阶段检测器：先提候选区域，再分类回归（精度高，速度慢）
一阶段检测器：直接回归类别和位置（速度快，精度略低）

五、两阶段检测器（Two-Stage Detectors）

5.1 R-CNN (2014) —— 开创者

步骤：

使用选择性搜索（Selective Search） 生成约2000个候选区域
将每个候选区域缩放到固定尺寸，送入CNN提取特征
对每个候选区域用SVM分类，并用回归器微调边界框

# 伪代码示意def rcnn_demo(image):    proposals = selective_search(image)  # 2000个候选框    features = []    for box in proposals:        crop = image[box]        crop_resized = resize(crop, (224,224))        feat = cnn.extract_features(crop_resized)  # AlexNet等        features.append(feat)    classes = svm_classifier.predict(features)    boxes = bbox_regressor.refine(proposals, features)    return non_max_suppression(classes, boxes)

缺点：速度极慢（每张图需处理2000次CNN），训练复杂（分阶段）

5.2 Fast R-CNN (2015) —— 共享卷积计算

改进：

整张图只过一次CNN，得到特征图
候选区域映射到特征图上，用RoI Pooling 层将不同尺寸区域变成固定大小
统一使用神经网络分类和回归

# Fast R-CNN 核心：RoI Poolingclass RoIPooling(nn.Module):    def __init__(self, output_size):        super().__init__()        self.output_size = output_size    def forward(self, feature_map, rois):        # rois: (batch_idx, x1, y1, x2, y2)        outputs = []        for roi in rois:            # 从feature_map中裁剪对应区域            # 然后池化到固定尺寸            pooled = adaptive_pool(feature_map[roi], self.output_size)            outputs.append(pooled)        return torch.stack(outputs)

速度：比R-CNN快10倍以上，但仍依赖外部候选区域生成（如选择性搜索）

5.3 Faster R-CNN (2016) —— 区域提议网络（RPN）

革命性创新：

提出区域提议网络（RPN），与检测网络共享卷积层
RPN直接在特征图上滑动锚点框（anchors），预测每个锚点是否包含物体
实现端到端训练

class RPN(nn.Module):    """区域提议网络"""    def __init__(self, in_channels, num_anchors):        super().__init__()        self.conv = nn.Conv2d(in_channels, 512, 3, padding=1)        self.cls = nn.Conv2d(512, num_anchors * 2, 1)  # 物体/背景        self.reg = nn.Conv2d(512, num_anchors * 4, 1)  # 边界框回归    def forward(self, x):        x = F.relu(self.conv(x))        scores = self.cls(x)  # (N, 2*A, H, W)        deltas = self.reg(x)  # (N, 4*A, H, W)        return scores, deltas

Faster R-CNN 架构：

输入图像 → 卷积骨干网络（VGG/ResNet） → 特征图                ↙                     ↘          RPN（生成候选框）             RoI Pooling              ↓                            ↓          候选框                     分类 + 回归

里程碑意义：首次实现真正的端到端检测，速度达到接近实时（5-17 fps）

六、一阶段检测器（One-Stage Detectors）

6.1 YOLO (2016) —— 你看一次就够了

核心思想：将目标检测视为回归问题，直接预测边界框和类别概率。

YOLO工作流程：

将图像划分为 S×S 网格
每个网格负责预测 B 个边界框和 C 个类别概率
每个边界框预测：(x, y, w, h, confidence)

# YOLO 输出张量形状： (S, S, B*5 + C)# 例如 S=7, B=2, C=20 → 7×7×30

YOLOv1 损失函数：综合了定位误差、置信度误差、分类误差。

优点：极快（45 fps），适合实时应用
缺点：对小物体检测效果差，定位精度不如Faster R-CNN

6.2 SSD (2016) —— 多尺度检测

创新：

从多个尺度的特征图上进行预测
使用预定义的默认框（类似锚点）
结合了YOLO的速度和Faster R-CNN的锚点机制

class SSD(nn.Module):    def __init__(self):        super().__init__()        self.base = VGG16()        self.extra_layers = self.make_extra_layers()        self.loc_layers = nn.ModuleList([            nn.Conv2d(512, 4 * 4, 3, padding=1),   # 对于 conv4_3            nn.Conv2d(1024, 6 * 4, 3, padding=1),  # 对于 conv7            # ... 更多层        ])        self.conf_layers = nn.ModuleList([            nn.Conv2d(512, 4 * num_classes, 3, padding=1),            nn.Conv2d(1024, 6 * num_classes, 3, padding=1),            # ...        ])    def forward(self, x):        sources = []        loc = []        conf = []        # 经过基础网络，收集多个尺度的特征图        for layer in self.base:            x = layer(x)            if layer in self.source_layers:                sources.append(x)        for extra in self.extra_layers:            x = extra(x)            sources.append(x)        # 对每个特征图预测        for (x, l, c) in zip(sources, self.loc_layers, self.conf_layers):            loc.append(l(x).permute(0,2,3,1).contiguous())            conf.append(c(x).permute(0,2,3,1).contiguous())        return loc, conf

6.3 RetinaNet (2017) —— Focal Loss 解决类别不平衡

一阶段检测器面临的核心问题：正负样本极端不平衡（背景远多于物体）。
RetinaNet提出 Focal Loss，动态调整难易样本的权重，使模型聚焦于困难样本。

class FocalLoss(nn.Module):    def __init__(self, alpha=0.25, gamma=2.0):        super().__init__()        self.alpha = alpha        self.gamma = gamma    def forward(self, preds, targets):        bce_loss = F.binary_cross_entropy_with_logits(preds, targets, reduction='none')        pt = torch.exp(-bce_loss)        focal_loss = self.alpha * (1 - pt) ** self.gamma * bce_loss        return focal_loss.mean()

RetinaNet 在速度和精度上超越了当时的所有一阶段检测器。

七、评估指标

7.1 IoU（Intersection over Union）

衡量预测框和真实框的重合度。

IoU = (A ∩ B) / (A ∪ B)

通常认为 IoU > 0.5 的预测为正确定位。

7.2 Precision 和 Recall

Precision：TP / (TP + FP) —— 预测为正例的样本中，实际为正的比例
Recall：TP / (TP + FN) —— 实际为正的样本中，被正确预测的比例

7.3 Average Precision (AP)

对于每个类别，计算 Precision-Recall 曲线下的面积。

PASCAL VOC：在 11 个等间隔 Recall 点上求平均 Precision
COCO：在 IoU 阈值从 0.5 到 0.95 的范围内计算 AP 的平均值

7.4 mean Average Precision (mAP)

所有类别的 AP 平均值，是目标检测最常用的综合指标。

# 伪代码：计算 mAPdef compute_map(detections, groundtruths, iou_thresh=0.5):    # detections: 每个图像的预测框列表    # groundtruths: 每个图像的真实框列表    aps = []    for class_id in classes:        tp = []        fp = []        num_gt = sum(len(gt[class_id]) for gt in groundtruths)        for det in sorted(detections[class_id], key=lambda x: x.score, reverse=True):            iou_max = 0            matched = False            for gt in groundtruths[class_id]:                iou = compute_iou(det.bbox, gt.bbox)                if iou > iou_max:                    iou_max = iou                    best_gt = gt            if iou_max >= iou_thresh and not best_gt.matched:                tp.append(1)                fp.append(0)                best_gt.matched = True            else:                tp.append(0)                fp.append(1)        # 计算 precision-recall 曲线        tp_cum = np.cumsum(tp)        fp_cum = np.cumsum(fp)        precisions = tp_cum / (tp_cum + fp_cum + 1e-6)        recalls = tp_cum / num_gt        ap = np.trapz(precisions, recalls)  # 计算曲线下面积        aps.append(ap)    return np.mean(aps)

八、常用数据集

数据集	类别数	图像数	标注框	特点
PASCAL VOC	20	约1万	约2.5万	经典，小规模
MS COCO	80	33万	150万	大规模，复杂场景，小物体多
Open Images	600	900万	1500万	超大规模，类别丰富
ImageNet Detection	200	47万	53万	与分类兼容
KITTI	8	7481	约8万	自动驾驶场景

九、实战：使用预训练模型进行目标检测

9.1 用 OpenCV DNN 加载 YOLOv4

import cv2import numpy as np# 加载 YOLOnet = cv2.dnn.readNet('yolov4.weights', 'yolov4.cfg')layer_names = net.getLayerNames()output_layers = [layer_names[i - 1] for i in net.getUnconnectedOutLayers()]# 加载类别名称with open('coco.names', 'r') as f:    classes = [line.strip() for line in f.readlines()]# 读取图像img = cv2.imread('street.jpg')height, width, _ = img.shape# 预处理blob = cv2.dnn.blobFromImage(img, 1/255.0, (416, 416), swapRB=True, crop=False)net.setInput(blob)outputs = net.forward(output_layers)# 解析输出boxes, confidences, class_ids = [], [], []for output in outputs:    for detection in output:        scores = detection[5:]        class_id = np.argmax(scores)        confidence = scores[class_id]        if confidence > 0.5:            center_x = int(detection[0] * width)            center_y = int(detection[1] * height)            w = int(detection[2] * width)            h = int(detection[3] * height)            x = int(center_x - w/2)            y = int(center_y - h/2)            boxes.append([x, y, w, h])            confidences.append(float(confidence))            class_ids.append(class_id)# 非极大值抑制indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)for i in indices:    i = i[0]    x, y, w, h = boxes[i]    label = classes[class_ids[i]]    confidence = confidences[i]    cv2.rectangle(img, (x, y), (x+w, y+h), (0,255,0), 2)    cv2.putText(img, f'{label} {confidence:.2f}', (x, y-10),                cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,255,0), 2)cv2.imshow('YOLOv4 Detection', img)cv2.waitKey(0)

9.2 用 PyTorch 加载 Faster R-CNN 预训练模型

import torchimport torchvisionfrom torchvision import transformsfrom PIL import Imageimport matplotlib.pyplot as pltimport matplotlib.patches as patches# 加载预训练模型model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)model.eval()# 如果使用GPUdevice = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')model = model.to(device)# 类别名称（COCO 91类）COCO_CLASSES = [    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A',    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',    'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']# 加载图像img = Image.open('street.jpg').convert('RGB')transform = transforms.Compose([transforms.ToTensor()])img_tensor = transform(img).unsqueeze(0).to(device)# 推理with torch.no_grad():    predictions = model(img_tensor)# 可视化fig, ax = plt.subplots(1, figsize=(12,8))ax.imshow(img)for box, label, score in zip(predictions[0]['boxes'], predictions[0]['labels'], predictions[0]['scores']):    if score > 0.5:        x1, y1, x2, y2 = box.tolist()        rect = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=2, edgecolor='r', facecolor='none')        ax.add_patch(rect)        ax.text(x1, y1, f'{COCO_CLASSES[label]}: {score:.2f}', color='white', fontsize=8,                bbox=dict(facecolor='red', alpha=0.5))plt.axis('off')plt.show()

总结：定位、识别、理解

目标检测是计算机视觉的核心任务之一，连接着图像分类和图像分割。它让机器不仅知道“有什么”，还知道“在哪里”。

从R-CNN到YOLOv8，目标检测在短短几年内取得了飞速发展，成为自动驾驶、安防、医疗、零售等领域的核心技术。今天你掌握了它的基本原理、主要算法和实战技能，为深入研究和应用打下了坚实基础。

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

一天一个Python知识点——Day 159:目标检测概述

一、开篇：从分类到定位的跨越

二、目标检测的任务与挑战

2.1 输入与输出

2.2 核心挑战

三、目标检测的经典流程

3.1 传统方法：滑动窗口 + 手工特征

四、深度学习时代的革命

五、两阶段检测器（Two-Stage Detectors）

5.1 R-CNN (2014) —— 开创者

5.2 Fast R-CNN (2015) —— 共享卷积计算

5.3 Faster R-CNN (2016) —— 区域提议网络（RPN）

六、一阶段检测器（One-Stage Detectors）

6.1 YOLO (2016) —— 你看一次就够了

6.2 SSD (2016) —— 多尺度检测

6.3 RetinaNet (2017) —— Focal Loss 解决类别不平衡

七、评估指标

7.1 IoU（Intersection over Union）

7.2 Precision 和 Recall

7.3 Average Precision (AP)

7.4 mean Average Precision (mAP)

八、常用数据集

九、实战：使用预训练模型进行目标检测

9.1 用 OpenCV DNN 加载 YOLOv4

9.2 用 PyTorch 加载 Faster R-CNN 预训练模型

总结：定位、识别、理解

最新文章

热门文章

随机文章

一天一个Python知识点——Day 159:目标检测概述

一、开篇：从分类到定位的跨越

二、目标检测的任务与挑战

2.1 输入与输出

2.2 核心挑战

三、目标检测的经典流程

3.1 传统方法：滑动窗口 + 手工特征

四、深度学习时代的革命

五、两阶段检测器（Two-Stage Detectors）

5.1 R-CNN (2014) —— 开创者

5.2 Fast R-CNN (2015) —— 共享卷积计算

5.3 Faster R-CNN (2016) —— 区域提议网络（RPN）

六、一阶段检测器（One-Stage Detectors）

6.1 YOLO (2016) —— 你看一次就够了

6.2 SSD (2016) —— 多尺度检测

6.3 RetinaNet (2017) —— Focal Loss 解决类别不平衡

七、评估指标

7.1 IoU（Intersection over Union）

7.2 Precision 和 Recall

7.3 Average Precision (AP)

7.4 mean Average Precision (mAP)

八、常用数据集

九、实战：使用预训练模型进行目标检测

9.1 用 OpenCV DNN 加载 YOLOv4

9.2 用 PyTorch 加载 Faster R-CNN 预训练模型

总结：定位、识别、理解

最适合0基础自学python,爬虫保姆级教程

一图详解Python最强大的第三方库!

最新文章

热门文章

随机文章