Introduction

Get Bounding boxes around each object.

2 Tasks

  • Regression: Find the position (x,y) and size (w,h) of each bounding box.
  • Classification: Classify each box as a known class (c1,c2,c3,…).

Datasets

🏷️ Bounding Boxes Labeling Formats

  • JSON
    • COCO
    • CreateML
  • XML
    • Pascal VOC
  • TXT
    • YOLO Darknet
    • YOLO v3 Keras
    • YOLO v4 PyTorch
    • Scaled-YOLOv4
    • YOLO v5 PyTorch
  • CSV
    • Tensorflow Object Detection
    • RetinaNet Keras
    • Multiclass Classification
  • Others
    • OpenAI CLIP Classification
    • Tensorflow TFRecord (binary format)

YOLO labeling format

  • One .txt file per image.
  • If no objects in image, no .txt file is required.
  • One row per bounding box (class_id center_x center_y width height).
  • XYWH numbers must be normalized from 0 to 1.
  • Class numbers are zero-indexed (start from 0).

Source:

Labeling Tools

Models

  • Region-based (Sparse Prediction) (two-stage): First determine the regions of interest (boxes), then classify the object.
  • Single-shot (Dense Prediction) (one-stage): Solve the two tasks together.
Name Description Date Type Grid size Anchors
R-CNN   Nov 2013 Region-based    
Fast R-CNN   Apr 2015 Region-based    
Faster R-CNN   Jun 2015 Region-based    
YOLO v1 You Only Look Once Jun 2015 Single-shot 7x7  
SSD Single Shot Detector Dec 2015 Single-shot    
FPN Feature Pyramid Network Dec 2016 Single-shot    
YOLO v2 Better, Faster, Stronger Dec 2016 Single-shot    
Mask R-CNN   Mar 2017 Region-based    
RetinaNet Focal Loss Aug 2017 Single-shot    
PANet Path Aggregation Network Mar 2018 Single-shot    
YOLO v3 An Incremental Improvement Apr 2018 Single-shot 13x13, 26x26, 52x52 3
EfficientDet Based on EfficientNet Nov 2019 Single-shot    
YOLO v4 Optimal Speed and Accuracy Apr 2020 Single-shot    
PP-YOLO PaddlPaddle YOLO Jul 2020 Single-shot    
YOLO v5 No official version Oct 2020 Single-shot 20x20, 40x40, 80x80 3

Models not based on anchor boxes

  • CornerNet
  • CenterNet
  • MatrixNet
  • FCOS
  • RepPoints

Model output = Fixed number of anchor boxes

Each anchor boxes consist of:

  • P: Probability of the box
    • Needs to be between [0,1]
    • Final P = sigmoid(P)
  • X & Y: Position of the box
  • It’s the position of the center of the box
  • Needs to be between [0,1]
  • Final X = sigmoid(X)
  • Final Y = sigmoid(Y)
  • W & H: Size of the box
    • Needs to be positive
    • Final W = eᵂ
    • Final H = eá´´
  • Probability of each Class
    • One hot encoded vector
    • 80 classes by default in YOLO

Post-processing (Only at inference time)

Choose these 2 thresholds:

  1. Probability of the box threshold
  2. NMS (Non Maximum Suppression): Set an IoU threshold between boxes
    • Soft-NMS: Para cuando dos objetos de la misma clase están muy juntos (un caballo detras de otro caballo)

Ground truth label

We place ground truth boxes in the nearest anchor box accordind to the grid.

YOLOv5

  Number of anchor boxes Anchor box size Final shape
YOLOv5 (3x20x20) + (3x40x40) + (3x80x80) = 25200 X+Y+W+H+P+80classes = 85 25200 x 85

Model size(pixels) mAPval
0.5:0.95
mAPval
0.5
Speed
CPU b1
(ms)
params
(M)
FLOPs
@640 (B)
YOLOv5n 640x640 28.4 46.0 45 1.9 4.5
YOLOv5s 640x640 37.2 56.0 98 7.2 16.5
YOLOv5m 640x640 45.2 63.9 224 21.2 49.0
YOLOv5l 640x640 48.8 67.2 430 46.5 109.1
YOLOv5x 640x640 50.7 68.9 766 86.7 205.7
             
YOLOv5n6 1280x1280 34.0 50.7 153 3.2 4.6
YOLOv5s6 1280x1280 44.5 63.0 385 16.8 12.6
YOLOv5m6 1280x1280 51.0 69.0 887 35.7 50.0
YOLOv5l6 1280x1280 53.6 71.6 1784 76.8 111.4
YOLOv5x6 1280x1280 54.7 72.4 3136 140.7 209.8
model = torch.hub.load('ultralytics/yolov5', 'yolov5n', pretrained=True)

x = torch.rand(1, 3, 640, 640)
y = model(x)

# Y:   ____compressed pred_____     ______uncompressed pred______
#    ( torch.Size([1, 25200, 85]) , (torch.Size([1, 3, 80, 80, 85]),   <-- grid=80x80, #anchors=3, xyhwp+80classes = 85
#                                    torch.Size([1, 3, 40, 40, 85]),   <-- grid=40x40, #anchors=3, xyhwp+80classes = 85
#                                    torch.Size([1, 3, 20, 20, 85])) ) <-- grid=20x20, #anchors=3, xyhwp+80classes = 85
#
# Number of total boxes predicted = (3x80x80) + (3x40x40) + (3x20x20) = 25200 boxes

# THEN APPLY NMS (Non Max Suppression)

Detect unknown classes

Metric: mAP (mean Average Precision)

  • Mean Average Precision is the area under the precision-recall curve
  • F1 find the optimal confidence threshold in the precision-recall curve
  • In objet detection the threshold is the IoU threshold.

Source: Roboflow

Get more classes from classification datasets!

References