Darknet - Yolo Layer

Input Shape

The convolutional layer before yolo layer should have filters=n*(4+1+classes). n is number of the prior anchors we used in the following yolo layer, namely sizeof(mask). classes is number of the classes.

[convolutional]
size=1
stride=1
pad=1
filters=75
activation=linear

The shape of the input tensor is $(b, n*(4+1+classes), h, w)$. More specifically, it's the concatenation of $n$ individual $(4+1+classes, h, w)$ tensor per image. It's an 1-d array actually, but imagine it to be a $(b, n, (4+1+classes), h, w)$ tensor. The first and second dimensions are $w$ and $h$ and the third dimension is $(4+1+classes)$. So for all b images and all n anchors, we have a $(4+1+classes)$ prediction tensor at each location. And the stride of these predictions' elements are l.h*l.w.

static int entry_index(layer l, int batch, int location, int entry)
{
    int n =   location / (l.w*l.h);
    int loc = location % (l.w*l.h);
    return batch*l.outputs + n*l.w*l.h*(4+l.classes+1) + entry*l.w*l.h + loc;
}

location should have the value between 0 and l.n*l.h*l.w-1, it gives the number of the prior anchor n and the location loc simultaneously. entry should between 0 and 4+1+classes-1, gives the index of the third dimension(the prediction tensor).

Prior anchor boxes

In Yolo Layer, the net predicts offset from the bounding box prior width and height. We define 3 options mask, num, anchors to use the prior anchor boxes.

In cfg file:

[net]
width=416
height=416

[yolo]
mask = 0,1,2
num=9
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326

In c source file:

// https://github.com/pjreddie/darknet/blob/master/src/parser.c#L306-L342
int total = option_find_int(options, "num", 1);
int num = total;
char *a = option_find_str(options, "mask", 0);  // char *a = "0,1,2";
int *mask = parse_yolo_mask(a, &num);  // int *mask = {0, 1, 2};

num in cfg file or total in the source file is the total number of the prior anchors that we can use in the entire network. mask in cfg file gives the index of the prior anchors that we use in the current yolo layer. So we can define lots of anchors and use only few of them in current yolo layer. anchors in cfg file gives all num available anchors with the shape $(p_w, p_h)$. The anchor sizes $(p_w, p_h)$ are actual pixel values on input image of the network, in this case is $(416, 416)$. So $(10, 13)$ is a prior anchor with width=10 pixels and height=13 on the $(416, 416)$ resized input image.

# Example
#yolo_layer0
[yolo]
mask = 0,1
num=3
anchors = 10,13,  16,30,  33,23

#yolo_layer1
[yolo]
mask = 1,2
num = 3
anchors = 10,13,  16,30,  33,23

#yolo_layer2
[yolo]
num=2

In the example above, yolo_layer0 uses anchors $(10, 13), (16, 30)$, yolo_layer1 uses anchors $(16, 30), (33, 23)$ and yolo_layer2 does not use prior anchor. In fact, yolo_layer2 uses 2 $(0.5, 0.5)$ anchors as default.

Gradients of Objectiveness Prediction

Yolo layer predicts l.n*l.h*l.w bounding boxes per image(l.n is the length of *mask, namely the number of the prior anchors used in the current yolo layer). And there is one objectiveness score for each predicted bounding box which gives the $Pr(Object)$ of it. What we want is to predict 1 objectiveness for all positive samples and 0 for all negative ones.

[yolo]
ignore_thresh = .5
truth_thresh = .9

Two kinds of predictions are considered as positive:

For all num prior anchors centered in the same cell with the GT Bbox(ground truth bounding box), the anchor which has the most similar shape with the GT Bbox will be the only anchor responsible for that GT Bbox. In the other word, at most one best prior anchor will be allocated to each GT Bbox in the current yolo layer. And if this best anchor is not used in the current yolo layer(the index of the best anchor is not in *mask of the current yolo layer), no anchor will be allocated for that GT Bbox.
For all l.n*l.h*l.w predictions, if the highest IoU between the prediction and all ground truth bounding boxes is grater than truth_thresh, this prediction will be responsible for the GT Bbox which gives the highest IoU with it.

Additionally, yolo layer sets truch_thresh = 1 as default. Since IoU is always less than or equal to 1, the second situation will never happen. So yolo layer penalizes at most 1 (of l.n*l.h*l.w) prediction for each GT Bbox. Penalize its objectiveness for not being 1.

And there is a ignore_thresh for negative(background) definition. If the highest IoU between the prediction and all GT Bboxes are less than or equal to ignore_thresh, that prediction will be assigned as negative. Penalize its objectiveness score for not being 0.

Gradients of Class Probability

*output is the input *state.input of the last convolutional layer, namely the prediction tensor. *delta is the gradient of the yolo layer. index gives the index of the starting class probability $Pr(Class_0|Object)$ for certain batch b, certain anchor n and certain position w, h. Remember we have b images, n anchors for each position and w*h locations. class is class of the ground truth and classes gives the number of the classes. stride will always be l.w*l.h and *avg_cat is for statistic usage, to calculate the average class probability

void delta_yolo_class(float *output, float *delta, int index, int class, int classes, int stride, float *avg_cat)
{
    int n;
    if (delta[index]){  // if some anchor is responsible for more than one GT
        delta[index + stride*class] = 1 - output[index + stride*class];
        if(avg_cat) *avg_cat += output[index + stride*class];
        return;
    }
    for(n = 0; n < classes; ++n){  // common situation
        // penalize Pr(Classi|Object) for all classes
        delta[index + stride*n] = ((n == class)?1 : 0) - output[index + stride*n];
        if(n == class && avg_cat) *avg_cat += output[index + stride*n];
    }
}

Given the index of the $Pr(Class_0|Object)$ of delta_yolo_class penalize $Pr(Class_i|Object)$ for all $Class_i$. It wants the $Pr(Class_{i=gt}|Object)$ to be 1 and others to be 0. And if some lucky anchor is responsible for more than one ground truth box and these GT boxes may or may not contain the same class. Just overwrite the gradients for the other ground truth class probability and leave others along. For example if we have 20 classes and some lucky anchor is responsible for 2 different classes(let's say there are dog and cat) in some naughty image. It will penalize $Pr(Class_{i=dog}|Object)$ and $Pr(Class_{i=cat}|Object)$ for not be 1 and penalize others for not be 0.

int obj_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 4);  // index of objectiveness
avg_anyobj += l.output[obj_index];  // sum the objectiveness for all pred box
l.delta[obj_index] = 0 - l.output[obj_index];  // common situation, low iou
if (best_iou > l.ignore_thresh) {  // best_iou > ignore_thresh -> Positive, then don't penalize the 
    l.delta[obj_index] = 0;
}
if (best_iou > l.truth_thresh) {  // nerver gonna happen when l.truth_thresh = 1
    l.delta[obj_index] = 1 - l.output[obj_index];  // 

    int class_id = state.truth[best_t*(4 + 1) + b*l.truths + 4];  // get the class_id of the GT box
    if (l.map) class_id = l.map[class_id];
    int class_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 4 + 1);
    delta_yolo_class(l.output, l.delta, class_index, class_id, l.classes, l.w*l.h, 0, l.focal_loss);
    box truth = float_to_box_stride(state.truth + best_t*(4 + 1) + b*l.truths, 1);
    delta_yolo_box(truth, l.output, l.biases, l.mask[n], box_index, i, j, l.w, l.h, state.net.w, state.net.h, l.delta, (2-truth.w*truth.h), l.w*l.h);
}

Gradient of Box Prediction

The network predicts 4 coordinates for each bounding box, $t_x, t_y, t_w, t_h$.

$\sigma(t_x)$ and $\sigma(t_y)$ are the box center position relative to the cell. $t_w$ and $t_h$ predict how much the bounding box is grater or smaller than the prior anchor. For example, if $t_w > 0$, then there will be $\mathrm{e}^{t_w} > 1$, and we will have $b_w > p_w$.

$$
\begin{align}
b_x & = \sigma(t_x) + c_x \\
b_y & = \sigma(t_y) + c_y \\
b_w & = p_w\mathrm{e}^{t_w} \\
b_h & = p_h\mathrm{e}^{t_h} \\
\end{align}
$$

$b_x$ and $b_y$ are the pixel distance from the top left corner of the current feature map $(l.w, l.h)$. And since $p_w$ and $p_h$ are actual pixel values, $b_w$ and $b_h$ are the actual pixel values on the resized network input image. To get normalized prediction result, $b_x$ and $b_y$ should be divided by the size of current feature map lw and lh. Similarly, $b_w$ and $b_h$ should be divided by the size of the resized input of the network w and h.

box get_yolo_box(float *x, float *biases, int n, int index, int i, int j, int lw, int lh, int w, int h, int stride)
{
    box b;
    b.x = (i + x[index + 0*stride]) / lw;
    b.y = (j + x[index + 1*stride]) / lh;
    b.w = exp(x[index + 2*stride]) * biases[2*n]   / w;
    b.h = exp(x[index + 3*stride]) * biases[2*n+1] / h;
    return b;
}

get_yolo_box converts the prediction result $\sigma(t_x), \sigma(t_y), t_w, t_h$ to the normalized box struct instance. Inversely, to compute the gradients of the bounding box prediction, we should convert the already normalized ground truth label box truth back to $\sigma(\hat{t}_x), \sigma(\hat{t}_y), \hat{t}_w, \hat{t}_h$.

float delta_yolo_box(box truth, float *x, float *biases, int n, int index, int i, int j, int lw, int lh, int w, int h, float *delta, float scale, int stride)
{
    box pred = get_yolo_box(x, biases, n, index, i, j, lw, lh, w, h, stride);
    float iou = box_iou(pred, truth);

    float tx = (truth.x*lw - i);
    float ty = (truth.y*lh - j);
    float tw = log(truth.w*w / biases[2*n]);
    float th = log(truth.h*h / biases[2*n + 1]);

    delta[index + 0*stride] = scale * (tx - x[index + 0*stride]);
    delta[index + 1*stride] = scale * (ty - x[index + 1*stride]);
    delta[index + 2*stride] = scale * (tw - x[index + 2*stride]);
    delta[index + 3*stride] = scale * (th - x[index + 3*stride]);
    return iou;
}

So using delta_yolo_box, we can convert normalized bounding box label to $\sigma(\hat{t}_x), \sigma(\hat{t}_y), \hat{t}_w, \hat{t}_h$ then subtract $\sigma(t_x), \sigma(t_y), t_w, t_h$ to get the gradients. But if we only do the subtraction, large bounding boxes will take advantage of their big size to impact the gradient. To compromise this, we multiply the gradients by scale to magnify the gradients of the relatively small GT bounding boxes. Generally, set scale to 2-truth.w*truth.h to do this, gradients of the small GT bounding boxes will be magnify by a factor close to 2 while gradients of the big GT bounding boxes will be magnify by a factor close to 1.