Darknet - Detection Layer

All GPU implementations have been ignored
Source Code
Written by ivanpp for study purpose

make_detection_layer()

detection_layer make_detection_layer(int batch, int inputs, int n, int side, int classes, int coords, int rescore)
{
    detection_layer l = {0};
    l.type = DETECTION;
    // Get the params
    l.n = n;  // num of bbox per cell: B = 2
    l.batch = batch;  // num of image per batch
    l.inputs = inputs;  // length of the score vector per image
    l.classes = classes;  // num of classes that can be detected: C = 20
    l.coords = coords;  // num of coords for each bbox: coords = 4
    l.rescore = rescore;
    l.side = side;  // num of cell per side: S = 7
    l.w = side;  // num of cell per row
    l.h = side;  // num of cell per column
    assert(side*side*((1 + l.coords)*l.n + l.classes) == inputs);
    // Allocate memory
    l.cost = calloc(1, sizeof(float));  // store the loss (scalar value)
    l.outputs = l.inputs;
    l.truths = l.side*l.side*(1+l.coords+l.classes);
    l.output = calloc(batch*l.outputs, sizeof(float));  // store output of upper FC layer
    l.delta = calloc(batch*l.outputs, sizeof(float));  // store derivatives for backprop
    // Assign forward and backward function
    l.forward = forward_detection_layer;
    l.backward = backward_detection_layer;

    fprintf(stderr, "Detection Layer\n");
    srand(0);

    return l;
}

Get the params
Allocate memory and get the starting addresses simultaneously

l.cost for the loss of the current batch(scalar value).

l.output for prediction tensor from upper layer(copied from net.input, namely the l.output of the upper layer). $l_{fc}.output \frac{forward_network()}{}> net.input \frac{forward_detection_layer()}{}> l_{detection}.output$

l.delta for the derivatives of prediction tensor for backprop.
1. Assign forward and backward function
Prediction tensor is stored as $[Pr(Class_i|Obj), Pr(Obj), coords]$ for each image

First $[S \times S \times C]$ stores the $Pr(Class_i|Obj)$ for each location(so called cell).

The next $[S \times S \times B]$ memory stores the confidence for each cell and each predicted box. Consider it as a $(S \times S, B)$ matrix.

The last $[S \times S \times 4B]$ memory stores the coordinates for each cell and each predicted box. Consider it as a $(S \times S, B, 4)$ matrix, and coordinates are stored as $(x, y, w, h)$.

forward_detection_layer()

Truth tensor costs $[S \times S \times (1+C+coords)]$ of float memory per image. Consider it as a $(S \times S, 1+C+coords)$ matrix. For each location in $(S \times S)$, we have a $(1+C+coords)$ 1-d tensor.

The first element of the 1-d tensor represents whether or not this cell contains an object.

The next C elements of this 1-d tensor give the class of current cell.

The last 4 elements should be the coordinate, stored as $(x,y,w,h)$.

So we have $[Confidence_1, Class_{20}, Coordinates_4]$ for each location. $Confidence_1$ and $Class_{20}$ are binary values.
As we discussed above, we have $[Pr(Class_i|Obj), Pr(Obj), coords]$ for prediction tensor. That gives us $[Classes_{20},Confidence_{B},Coordinates_{4B}]$ for each location.
Loss function is defined as:
$$
\begin{align}
loss=
&\lambda_{noobj} \displaystyle\sum_{i=0}^{S^2} \sum_{j=0}^{B}1_{ij}^{noobj}(C_i- \hat C_i)^2\tag{1}\\
+&\displaystyle\sum_{i=0}^{S^2}\sum_{c\in{classes}}1_{i}^{obj}(p_i(c) - \hat p_i(c))^2\tag{2}\\
+&\displaystyle\sum_{i=0}^{S^2}\sum_{j=0}^{B}1_{ij}^{obj}(C_i - \hat C_i)^2\tag{3}\\
+&\lambda_{coord}\displaystyle\sum_{i=0}^{S^2}\sum_{j=0}^{B}1_{ij}^{obj}[(x_i - \hat x_i)^2+(y_i - \hat y_i)^2]\tag{4}\\
+&\lambda_{coord}\displaystyle\sum_{i=0}^{S^2}\sum_{j=0}^{B}1_{ij}^{obj}[(\sqrt{w_i} - \sqrt{\hat w_i})^2+(\sqrt{h_i} - \sqrt{\hat h_i})^2]\tag{5}\\
\end{align}
$$

For each location:
1. Penalize the value of $Confidence_B$ that not equal to 0. $(1)$
For each location with object:
1. Penalize the class probability $Class_{20}$ that not equal to the ground truth(0/1). $(2)$
2. Find the best box that responsible for the prediction.
3. De-penalize the value of $Confidence_1$(only for best box) that not equal to 0 and penalize the value of it not equal to 1(or not equal to $1 \times IOU_{pred}^{truth}$ according to the paper). $(3)$
4. Penalize the localization error of the best box. $(4)(5)$

Here is the code:

void forward_detection_layer(const detection_layer l, network net)
{
    int locations = l.side*l.side;  // Channel stride: 49
    int i,j;
    memcpy(l.output, net.input, l.outputs*l.batch*sizeof(float));
    //if(l.reorg) reorg(l.output, l.w*l.h, size*l.n, l.batch, 1);
    int b;
    if (l.softmax){  // if l.softmax, compress Pr(Class) part of the l.output
        for(b = 0; b < l.batch; ++b){
            int index = b*l.inputs;
            for (i = 0; i < locations; ++i) {
                int offset = i*l.classes;
                softmax(l.output + index + offset, l.classes, 1, 1,
                        l.output + index + offset);
            }
        }
    }
    if(net.train){  // if training
        float avg_iou = 0;
        float avg_cat = 0;
        float avg_allcat = 0;
        float avg_obj = 0;
        float avg_anyobj = 0;
        int count = 0;
        *(l.cost) = 0;  // initialize loss(scalar)
        int size = l.inputs * l.batch;
        memset(l.delta, 0, size * sizeof(float));
        for (b = 0; b < l.batch; ++b){
            int index = b*l.inputs;
            for (i = 0; i < locations; ++i) {
                // *net.truth stores the ground truth labels
                // with the shape of (batch, locations, (flag, classes, coords)).
                // At most 1 object for each cell
                int truth_index = (b*locations + i)*(1+l.coords+l.classes);
                int is_obj = net.truth[truth_index];  // 'is object' flag
                // length of the prediction tensor is (locations*(n*(1+coords)+classes))
                // stored as (locations*classes, locations*n, locations*n*coords)
                // namely, (locations*prob_class, locations*n*prob_obj, locations*n*coords)
                for (j = 0; j < l.n; ++j) {  // gradient and loss of 'misclassify backgroud as an object'
                    int p_index = index + locations*l.classes + i*l.n + j;  // index of the confidence
                    l.delta[p_index] = l.noobject_scale*(0 - l.output[p_index]);
                    *(l.cost) += l.noobject_scale*pow(l.output[p_index], 2);
                    avg_anyobj += l.output[p_index];
                }

                int best_index = -1;
                float best_iou = 0;
                float best_rmse = 20;

                if (!is_obj){
                    continue;
                }

                int class_index = index + i*l.classes;
                for(j = 0; j < l.classes; ++j) {  // gradient and loss of 'misclassify an object as another', namely Pr(Class) part
                    l.delta[class_index+j] = l.class_scale * (net.truth[truth_index+1+j] - l.output[class_index+j]);
                    *(l.cost) += l.class_scale * pow(net.truth[truth_index+1+j] - l.output[class_index+j], 2);
                    if(net.truth[truth_index + 1 + j]) avg_cat += l.output[class_index+j];
                    avg_allcat += l.output[class_index+j];  // sum of prob_class for all categories
                }

                box truth = float_to_box(net.truth + truth_index + 1 + l.classes, 1);
                truth.x /= l.side;
                truth.y /= l.side;

                for(j = 0; j < l.n; ++j){  // for each box in current cell
                    int box_index = index + locations*(l.classes + l.n) + (i*l.n + j) * l.coords;
                    box out = float_to_box(l.output + box_index, 1);
                    out.x /= l.side;
                    out.y /= l.side;

                    if (l.sqrt){  // if l.sqrt, out.w, out.h represent sqrt(w), sqrt(h)
                        out.w = out.w*out.w;
                        out.h = out.h*out.h;
                    }

                    float iou  = box_iou(out, truth);
                    //iou = 0;
                    float rmse = box_rmse(out, truth);
                    // If there is any iou grater than 0, store it to best_iou(if it is the best),
                    // and best_index gives the index of it.
                    // If all iou are smaller or equal to 0,
                    // best_rmse gives the smallest rmse that smaller than the default value(20)
                    // and best_index gives the index of it.
                    if(best_iou > 0 || iou > 0){  // find the best bbox
                        if(iou > best_iou){
                            best_iou = iou;
                            best_index = j;
                        }
                    }else{  // if all iou<0, find the best rmse
                        if(rmse < best_rmse){
                            best_rmse = rmse;
                            best_index = j;
                        }
                    }  // if the best_index=-1, it appears that all iou are non-positive and all rmse are nonsense
                }

                if(l.forced){  // forced mode: 2 boxes, 0 for big objects, 1 for samll ones
                    if(truth.w*truth.h < .1){
                        best_index = 1;
                    }else{
                        best_index = 0;
                    }
                }
                if(l.random && *(net.seen) < 64000){
                    best_index = rand()%l.n;
                }

                // get the index of the best box and corresponding gt_box
                int box_index = index + locations*(l.classes + l.n) + (i*l.n + best_index) * l.coords;
                int tbox_index = truth_index + 1 + l.classes;

                box out = float_to_box(l.output + box_index, 1);
                out.x /= l.side;
                out.y /= l.side;
                if (l.sqrt) {  // restore w and h before compute the iou
                    out.w = out.w*out.w;
                    out.h = out.h*out.h;
                }
                float iou  = box_iou(out, truth);

                //printf("%d,", best_index);
                // index of the confidence of the best box
                int p_index = index + locations*l.classes + i*l.n + best_index;
                // gradient and loss of 'confidence of box that responsible for any object'
                *(l.cost) -= l.noobject_scale * pow(l.output[p_index], 2);  // compensate
                *(l.cost) += l.object_scale * pow(1-l.output[p_index], 2);
                avg_obj += l.output[p_index];
                l.delta[p_index] = l.object_scale * (1.-l.output[p_index]);

                if(l.rescore){  // rescore mode
                    l.delta[p_index] = l.object_scale * (iou - l.output[p_index]);
                }

                // gradient and loss of 'localization error'
                l.delta[box_index+0] = l.coord_scale*(net.truth[tbox_index + 0] - l.output[box_index + 0]);
                l.delta[box_index+1] = l.coord_scale*(net.truth[tbox_index + 1] - l.output[box_index + 1]);
                l.delta[box_index+2] = l.coord_scale*(net.truth[tbox_index + 2] - l.output[box_index + 2]);
                l.delta[box_index+3] = l.coord_scale*(net.truth[tbox_index + 3] - l.output[box_index + 3]);
                if(l.sqrt){  // if l.sqrt, coords gradients should be lmbda*(sqrt(truth)-sqrt(pred))
                    l.delta[box_index+2] = l.coord_scale*(sqrt(net.truth[tbox_index + 2]) - l.output[box_index + 2]);
                    l.delta[box_index+3] = l.coord_scale*(sqrt(net.truth[tbox_index + 3]) - l.output[box_index + 3]);
                }

                *(l.cost) += pow(1-iou, 2);
                avg_iou += iou;
                ++count;
            }
        }

        *(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2);


        printf("Detection Avg IOU: %f, Pos Cat: %f, All Cat: %f, Pos Obj: %f, Any Obj: %f, count: %d\n", avg_iou/count, avg_cat/count, avg_allcat/(count*l.classes), avg_obj/count, avg_anyobj/(l.batch*locations*l.n), count);
        //if(l.reorg) reorg(l.delta, l.w*l.h, size*l.n, l.batch, 0);
    }
}

Optional parameters:

l.softmax = 0

If l.softmax is True, compress the $Pr(Class_{i}|Obj)$ part of the prediction tensor to $(0,1)$.

if (l.softmax){
    for(b = 0; b < l.batch; ++b){
        int index = b*l.inputs;
        for (i = 0; i < locations; ++i) {
            int offset = i*l.classes;
            softmax(l.output + index + offset, l.classes, 1, 1,l.output + index + offset);
        }
    }
}

l.force = 0

If I.force is True, the net forces the $box_0$ takes charge of the relatively big object, and $box_1$ for the small ones, regardless of the iou or the rmse.
```
if(l.forced){
    if(truth.w*truth.h < .1){
        best_index = 1;
    }else{
        best_index = 0;
    }
}
```
l.random = 0

TODO
l.sqrt = 1

If l.sqrt is True, prediction tensor gives the coordinates of $(\hat x,\hat y,\sqrt{\hat w},\sqrt{\hat h})$

So we should restore $\sqrt {\hat w}$ and $\sqrt {\hat h}$ to $\hat w$ and $\hat h$ before we compute the iou or rmse with the ground truth boxes.
```
if (l.sqrt) {
    out.w = out.w*out.w;
    out.h = out.h*out.h;
}
```
And since our optimization objectives are $\sqrt w$ and $\sqrt h$ and truth tensor holds the values of $w$ and $h$, we should sqrt it before compute the derivatives.
```
if(l.sqrt){
    l.delta[box_index+2] = l.coord_scale*(sqrt(net.truth[tbox_index + 2]) - l.output[box_index + 2]);
    l.delta[box_index+3] = l.coord_scale*(sqrt(net.truth[tbox_index + 3]) - l.output[box_index + 3]);
}
```
l.rescore = 1

If l.rescore is True, optimization objective of the confidence for each bounding box should be $Pr(Object)\times {IOU}_{pred}^{truth}$
```
if(l.rescore){
    l.delta[p_index] = l.object_scale * (iou - l.output[p_index]);
}
```
And if l.score is False, optimization objective of the confidence should be just the $Pr(Object)$. 1 for boxes that responsible for any object and 0 for boxes that don't contain any object.
```
l.delta[p_index] = l.noobject_scale*(0 - l.output[p_index]);  // background
l.delta[p_index] = l.object_scale * (1.-l.output[p_index]);   // object
```

backward_detection_layer()

Backward of the detection layer is very simple:

void backward_detection_layer(const detection_layer l, network net)
{
    axpy_cpu(l.batch*l.inputs, 1, l.delta, 1, net.delta, 1);
}

Copy *l.delta(Already computed in forward pass) to *net.delta, namely *l.delta of the previous layer.

get_detection_detections()

void get_detection_detections(layer l, int w, int h, float thresh, detection *dets)
{
    int i,j,n;
    float *predictions = l.output;
    //int per_cell = 5*num+classes;
    for (i = 0; i < l.side*l.side; ++i){
        int row = i / l.side;
        int col = i % l.side;
        for(n = 0; n < l.n; ++n){
            int index = i*l.n + n;
            int p_index = l.side*l.side*l.classes + i*l.n + n;
            float scale = predictions[p_index];
            int box_index = l.side*l.side*(l.classes + l.n) + (i*l.n + n)*4;
            box b;
            b.x = (predictions[box_index + 0] + col) / l.side * w;
            b.y = (predictions[box_index + 1] + row) / l.side * h;
            b.w = pow(predictions[box_index + 2], (l.sqrt?2:1)) * w;
            b.h = pow(predictions[box_index + 3], (l.sqrt?2:1)) * h;
            dets[index].bbox = b;
            dets[index].objectness = scale;
            for(j = 0; j < l.classes; ++j){
                int class_index = i*l.classes;
                float prob = scale*predictions[class_index+j];
                dets[index].prob[j] = (prob > thresh) ? prob : 0;
            }
        }
    }
}