🔲 Object detection
Introduction
Get Bounding boxes around each object.
2 Tasks
- Regression: Find the position (
x
,y
) and size (w
,h
) of each bounding box. - Classification: Classify each box as a known
class
(c1,c2,c3,…).
Datasets
- COCO_TINY: 200 images
- COCO_SAMPLE
- PASCAL_2007
- PASCAL_2012
- Roboflow public datasets
🏷️ Bounding Boxes Labeling Formats
- JSON
- COCO
- CreateML
- XML
- Pascal VOC
- TXT
- YOLO Darknet
- YOLO v3 Keras
- YOLO v4 PyTorch
- Scaled-YOLOv4
- YOLO v5 PyTorch
- CSV
- Tensorflow Object Detection
- RetinaNet Keras
- Multiclass Classification
- Others
- OpenAI CLIP Classification
- Tensorflow TFRecord (binary format)
YOLO labeling format
- One
.txt
file per image. - If no objects in image, no
.txt
file is required. - One row per bounding box (
class_id
center_x
center_y
width
height
). - XYWH numbers must be normalized from 0 to 1.
- Class numbers are zero-indexed (start from 0).
Source:
Labeling Tools
- labelImg
- https://blog.roboflow.com/labelimg/
- Computer Vision Annotation Tool (CVAT)
- https://blog.roboflow.com/cvat/
- Roboflow Annotate
Models
- Region-based (Sparse Prediction) (two-stage): First determine the regions of interest (boxes), then classify the object.
- Single-shot (Dense Prediction) (one-stage): Solve the two tasks together.
Name | Description | Date | Type | Grid size | Anchors |
---|---|---|---|---|---|
R-CNN | Â | Nov 2013 | Region-based | Â | Â |
Fast R-CNN | Â | Apr 2015 | Region-based | Â | Â |
Faster R-CNN | Â | Jun 2015 | Region-based | Â | Â |
YOLO v1 | You Only Look Once | Jun 2015 | Single-shot | 7x7 | Â |
SSD | Single Shot Detector | Dec 2015 | Single-shot | Â | Â |
FPN | Feature Pyramid Network | Dec 2016 | Single-shot | Â | Â |
YOLO v2 | Better, Faster, Stronger | Dec 2016 | Single-shot | Â | Â |
Mask R-CNN | Â | Mar 2017 | Region-based | Â | Â |
RetinaNet | Focal Loss | Aug 2017 | Single-shot | Â | Â |
PANet | Path Aggregation Network | Mar 2018 | Single-shot | Â | Â |
YOLO v3 | An Incremental Improvement | Apr 2018 | Single-shot | 13x13, 26x26, 52x52 | 3 |
EfficientDet | Based on EfficientNet | Nov 2019 | Single-shot | Â | Â |
YOLO v4 | Optimal Speed and Accuracy | Apr 2020 | Single-shot | Â | Â |
PP-YOLO | PaddlPaddle YOLO | Jul 2020 | Single-shot | Â | Â |
YOLO v5 | No official version | Oct 2020 | Single-shot | 20x20, 40x40, 80x80 | 3 |
Models not based on anchor boxes
- CornerNet
- CenterNet
- MatrixNet
- FCOS
- RepPoints
Model output = Fixed number of anchor boxes
Each anchor boxes consist of:
- P: Probability of the box
- Needs to be between [0,1]
- Final P = sigmoid(P)
- X & Y: Position of the box
- It’s the position of the center of the box
- Needs to be between [0,1]
- Final X = sigmoid(X)
- Final Y = sigmoid(Y)
- W & H: Size of the box
- Needs to be positive
- Final W = eᵂ
- Final H = eá´´
- Probability of each Class
- One hot encoded vector
- 80 classes by default in YOLO
Post-processing (Only at inference time)
Choose these 2 thresholds:
- Probability of the box threshold
- NMS (Non Maximum Suppression): Set an IoU threshold between boxes
- Soft-NMS: Para cuando dos objetos de la misma clase están muy juntos (un caballo detras de otro caballo)
Ground truth label
We place ground truth boxes in the nearest anchor box accordind to the grid.
YOLOv5
 | Number of anchor boxes | Anchor box size | Final shape |
---|---|---|---|
YOLOv5 | (3x20x20) + (3x40x40) + (3x80x80) = 25200 | X+Y+W+H+P+80classes = 85 | 25200 x 85 |
Model | size(pixels) | mAPval 0.5:0.95 |
mAPval 0.5 |
Speed CPU b1 (ms) |
params (M) |
FLOPs @640 (B) |
---|---|---|---|---|---|---|
YOLOv5n | 640x640 | 28.4 | 46.0 | 45 | 1.9 | 4.5 |
YOLOv5s | 640x640 | 37.2 | 56.0 | 98 | 7.2 | 16.5 |
YOLOv5m | 640x640 | 45.2 | 63.9 | 224 | 21.2 | 49.0 |
YOLOv5l | 640x640 | 48.8 | 67.2 | 430 | 46.5 | 109.1 |
YOLOv5x | 640x640 | 50.7 | 68.9 | 766 | 86.7 | 205.7 |
 |  |  |  |  |  |  |
YOLOv5n6 | 1280x1280 | 34.0 | 50.7 | 153 | 3.2 | 4.6 |
YOLOv5s6 | 1280x1280 | 44.5 | 63.0 | 385 | 16.8 | 12.6 |
YOLOv5m6 | 1280x1280 | 51.0 | 69.0 | 887 | 35.7 | 50.0 |
YOLOv5l6 | 1280x1280 | 53.6 | 71.6 | 1784 | 76.8 | 111.4 |
YOLOv5x6 | 1280x1280 | 54.7 | 72.4 | 3136 | 140.7 | 209.8 |
model = torch.hub.load('ultralytics/yolov5', 'yolov5n', pretrained=True)
x = torch.rand(1, 3, 640, 640)
y = model(x)
# Y: ____compressed pred_____ ______uncompressed pred______
# ( torch.Size([1, 25200, 85]) , (torch.Size([1, 3, 80, 80, 85]), <-- grid=80x80, #anchors=3, xyhwp+80classes = 85
# torch.Size([1, 3, 40, 40, 85]), <-- grid=40x40, #anchors=3, xyhwp+80classes = 85
# torch.Size([1, 3, 20, 20, 85])) ) <-- grid=20x20, #anchors=3, xyhwp+80classes = 85
#
# Number of total boxes predicted = (3x80x80) + (3x40x40) + (3x20x20) = 25200 boxes
# THEN APPLY NMS (Non Max Suppression)
Detect unknown classes
- Paper: VOS: Learning What You Don’t Know by Virtual Outlier Synthesis
- Repo: This is the source code accompanying the paper
Metric: mAP (mean Average Precision)
- Mean Average Precision is the area under the precision-recall curve
- F1 find the optimal confidence threshold in the precision-recall curve
- In objet detection the threshold is the IoU threshold.
Source: Roboflow
Get more classes from classification datasets!
- Paper: Detecting Twenty-thousand Classes using Image-level Supervision
- Github code
- Twit de Ivan Prado
References
- Theory
- AndrewNG videos
- Decoding: State Of The Art Object Detection
- YOLOv4
- https://blog.roboflow.com/a-thorough-breakdown-of-yolov4/
- detectron 2.
- Practical Projects
- Roboflow video porject Detect rebbits
- Counting-Fish
- IceVision
- Video: Tensorflow: Object Detection in 5 Hour
- Video: Pytorch: YOLOv3 from scratch
- Paperspace blog: YOLOv3 from scratch in PyTorch