Hello, and welcome to week four of

our Visual Perception for Autonomous Driving Course.

This week, we'll dive into object detection,

a core requirement for

any self-driving car perception stack.

Object detection is a top level module

required to identify the locations of vehicles,

pedestrians, signs, and lights,

so our car knows where and how it can drive.

This week, we will be going in depth to

define the 2D object detection problem,

how to apply ConvNets to tackle it,

and why the problem of 2D object detection is difficult.

We will then explain how object detection is

used as input for the 2D object tracking problem,

and how to perform the tracking task in

the context of self-driving cars.

This lesson will formulate

the object detection problem in 2D.

You will also learn how to

evaluate 2D object detectors using

standard performance metrics established

by researchers in the field. Let's get started.

The history of 2D object detection begins in 2001 when

Paul Viola and Michael Jones invented

a very efficient algorithm for face detection.

This algorithm now called the Viola,

Jones Object Detection Framework,

was the first object detection framework

to provide reliable,

real-time 2D object detections from a simple webcam.

The next big breakthrough in object detection

happened four years later when Navneet Dalal

and Bill Triggs formulated

the histogram of oriented gradient feature descriptor.

Their algorithm applied to

the 2D pedestrian detection problem,

outperformed all other proposed methods at the time.

The Dalal, Triggs algorithm remained on the top of

the charts until 2012 when Alex Krizhevsky,

Ilya Sutskever and Geoffrey Hinton from

the Computer Science Department here

at the University of Toronto,

shook the computer vision world with

their convolutional neural network dubbed AlexNet.

AlexNet won the ImageNet Large

Scale Visual Recognition Challenge in

2012 with a wide margin

over the algorithm that took second place.

In 2012, it was

the only deep learning-based entry in the challenge.

But since then, all winning

entries in this competition are based on

convolutional neural networks with the entry

surpassing the human recognition rate

of 95 percent recently.

This performance extended from

2D object recognition to 2D object detection

with current day detectors being almost

exclusively based on convolutional neural networks.

Before we go through how to use

ConvNets for object detection and self-driving cars,

let's formulate the general 2D object detection problem.

Given a 2D image's input,

we are required to estimate the location defined by

a bounding box and the class of all objects in the scene.

Usually, we choose classes

that are relevant to our application.

For self-driving cars, we usually are

most interested in object classes that are dynamic,

that is ones that move through the scene.

These include vehicles in

their subclasses, pedestrians, and cyclists.

The problem of 2D object detection is not trivial.

The extent of objects we require to estimate

are not always fully observed in the image space.

As an example, background objects are

usually occluded by foreground objects.

This requires any algorithm we use to be able to

hallucinate the extent of

objects to properly detect them.

Furthermore, objects that are near

the edge of the image are usually truncated.

This phenomenon creates huge variability

in the bounding box sizes,

where the size of the estimated bounding box

depends on how truncated the object is.

Another issue faced by

2D object detection algorithms is that of scale.

Objects tend to appear very

small as they go further away from our censor.

Our algorithm is expected to determine

the class of these objects at variable scales.

Finally, our algorithm should also be

able to handle illumination changes.

This is especially important in

the context of self-driving cars,

where images can be affected by

whole image illumination variations

from bright sun to night driving,

and partial variations due to reflections,

shadows, precipitation, and other nuisance effects.

Now that we've intuitively

understood what object detection is,

let us formalize the problem mathematically.

Object detection can be defined

as a function estimation problem.

Given an input image x,

we want to find the function f of

x in Theta that produces

an output vector that

includes the coordinates of the top-left of the box,

xmin and ymin, and

the coordinates of the lower right corner of the box,

xmax and ymax, and a class score Sclass1 to Sclassk.

Sclassi specifies how confident

our algorithm is that the object belongs to the class i,

and i ranges from one to k,

where k is the number of classes of interest.

Can you think of any way to estimate this function?

Convolutional neural networks,

which we described last week are

an excellent tool for estimating this kind of function.

For object detection,

the input data is defined on a 2D grid,

and as such, we use ConvNets as

our chosen function estimators to perform this task.

We will discuss how to perform

2D object detection with ConvNets in the next lesson.

But first, we need to figure out how to

measure the performance of our algorithm.

Given the output of a 2D object detector in red,

we want to be able to compare how

well it fits the true output,

usually labeled by humans.

We call the true output our ground truth bounding box.

The first step of our evaluation process is to compare

our detector localization output to

the ground truth boxes

via the Intersection-Over-Union metric,

or IOU for short.

IOU is defined as the area of the intersection of

two polygons divided by the area of their union.

However, calculating the intersection-over-union does

not take into consideration the class scores.

To account for class scores,

we define true positives.

True positives are output bounding boxes that have an IOU

greater than a predefined threshold

with any ground truth bounding box.

In addition, the class of those output boxes

should also match the class

of their corresponding ground truth.

That means that the 2D detector should give

the highest class score to the correct class,

have a score that is greater than a score threshold.

On the other hand, false positives are

the output boxes that have

a score greater than the score threshold,

but an IOU less than

the IOU threshold with all ground truth bounding boxes.

This can be easily computed as

the total number of detected objects

after the application of

the score threshold minus the number of true positives.

The final base quantity we would like to

estimate is the number of false negatives.

False negatives are the ground truth

bounding boxes that have

no detections associated with them through IOU.

Once we have determined the true positives,

false positives, and false negatives;

we can determine the precision and recall of

our 2D object detector according to the following.

The precision is the number of true positives

divided by the sum of

the true positives and the false positives.

The recall on the other hand is the number of

true positives divided by

the total number of ground truth objects,

which is equal to the number of true positives

added to the number of false negatives.

Once we determine the precision and recall,

we can vary the object class score threshold

to get a precision recall curve,

and finally, we determine

the average precision as

the area under the precision-recall curve.

The area under the curve can be

computed using numerical integration,

but is usually approximated using an average of

the precision values at

11 recall points ranging from zero to one.

I know these are quite a few concepts

to understand the first time through.

But don't worry, as you'll

soon get a chance to work through

a step-by-step practice notebook on how to code

all of these methods in Python in the assessments.

Let's work through an example on

how to assess the performance of

a 2D object detection network using the learned metrics.

We are interested in detecting only cars in a road scene.

That means that we have a single class of interest,

and therefore only one set of scores to consider.

We are given ground truth bounding boxes of cars

labeled by human beings and shown in green.

We process our image with a confinet to

get the detection output bounding boxes, shown in red.

You can notice that the network mistakenly

detects the front of a large truck as a car.

Looking at the scores,

we see that our confinet gave

this miss detection quite a high score of being a car.

Let's now evaluate the performance of

our confinet using average precision.

The first step is to take all of

our estimated bounding boxes and

sort them according to object class score.

We then proceed to compute the IOU between

each predicted box and

the corresponding ground truth box.

If a box does not intersect any ground-truth boxes,

it's IOU is set to zero.

First, we said a class score threshold, let's say 0.9.

This threshold means that we

only trust our network prediction,

if it returns a score that is greater than nine,

and we eliminate any bounding boxes

with a score less than 0.9.

Next, we set an IOU threshold,

we'll use 0.7 in this case and proceed to

eliminate any remaining predictions

with an IOU less than 0.7.

In this case, both the remaining predictions

have an IOU of greater than 0.7,

and so we don't eliminate any.

We can now compute the number of

true positives as the number of

remaining bounding boxes after

the application of both the score and the IOU thresholds,

which in this case is two.

The number of false positives is zero in this case,

since all boxes remaining after the application of

the score thresholds also

remain after the application of the IOU threshold.

Finally, the number of false negatives are

boxes in the ground truth that have no detections

associated with them after the application of

both the score and the IOU thresholds.

In this case, the number of false negatives is two.

The precision of our neural network

is computed as the number of

true positives divided by

their sum with the number of false positives.

In this case, we don't have false positives.

So the precision is 2 over 2 equal to 1.

To compute the recall,

we divide the number of true positives by

the number of ground truth bounding boxes,

which is equal to the number of

false positives summed with

the number of false negatives.

The recall in this case is 2 over 4.

The detector in this case is

a high precision low recall detector.

This means that the detector

misses some objects in the scene,

but when it does detect an object,

it makes very few mistakes in

category classification and bounding box location.

Let's see how the performance of

our detector changes when we decrease

the score threshold from 0.9 to 0.7.

All bounding boxes have a score greater than 0.7,

so we do not eliminate

any of them through score thresholding.

However, when we examine the IOU of the remaining boxes,

we can see that two of them have an IOU less than 0.7.

By eliminating these two boxes,

we get three true positives.

To compute the number of false positives,

we need to look at how many detections

remained after the application of the score threshold,

but before the application of the IOU threshold.

In this case, the number of false positives is two.

Finally, we take a look at the number of

ground truth bounding boxes that have

remained without an associated detection

after the application of

both the IOU and

score thresholds to get

one as the number of false negatives.

Notice that the precision has dropped after

decreasing the score threshold from one to 0.6,

while the recall has increased from 0.5 to 0.75.

We can conclude that the effect of

lowering the score threshold is

less accurate detection results at

the expense of detecting more objects in the scene.

If we continue this process and estimate

the score threshold at decrements of 0.01,

we arrive at the following table.

We then proceed to plot the precision-recall curve,

using the precision values on the y-axis

and the recall values on the x-axis.

Note that we also add the precision recall points of

one and zero as the first in the plot,

and zero one as the final point in the plot.

This allows us to approximate

the average precision by calculating

the area under the P-R curve using

11 recall points between zero and one,

at 0.01 recall increments.

Computing this average produces an AP of

0.75 for a car detector.

The value of the average precision of

the detector can be thought of as an average of

performance over all score thresholds

allowing objective comparison of

the performance of detectors without having to

consider the exact score threshold

that generated those detections.

In this video, you learned how to formulate

the 2D object detection problem and how to

evaluate a 2D object detectors performance

using the average precision performance metric.

Next lesson, you will learn how to use confinet as

2D object detectors for self-driving cars. See you then.