ivanpp vs. AUTOCRACY

Bavarian Adventure Pt. 1

ivan Ding — Mon, 11 Mar 2024 22:41:33 GMT

This post is about the first semester of my Erasmus exchange in TUM (Technical University of Munich) Germany🇩🇪 from Chalmers Sweden🇸🇪.

Early Scouting

During the semester break, I chose to explore the Alps instead of enjoying the 18 hours of daylights in Scandinavian. It was the exam weeks of TUM when I visited Munich. I sited in front of the Parabola Slide in CIT building, joined the welcome meeting via Zoom. Following their advice, I immediately started my house finding in Munich. After more than 100 applications, I finally got a cozy place in Schwabing.

Apart from getting a place to stay, this "30-day (NOT) free trial" gave me an expectation of what my life will be for the following year. I also got to know the most delicious dishes in Mensa quite in advance.

I also travelled to Switzerland🇨🇭, "The Notorious B.A.H.N" (not a German Rapper) drove me to my bff. It is a beautiful but expensive heaven.

Chaos

I went back to Sweden in August, with the feeling of something went wrong. The Swedish visa extension I applied in April 2023 went totally silent. I tried to contact my case officer, and had some really bad experience. I decided to start my "Plan B". I contacted TUM and German embassy, they gave me every information I need to start the German visa application.

I got a termin in October in Stockholm. Although a lot of documents were needed, the whole process only took one month.

This was the most difficult, most stressful part of my first semester exchange study. From October, the start of TUM's winter semester, till the day I got my passport back from German Embassy, the stress piled up. It was my partner and friends' support made me the way to Munich.

Servus

Finally, I arrived Munich again at the end of November. Like every other settlers here, I have to anmeldung my address, get a bank account, find a family doctor... Everything makes me happy because I'm finally here.

I lost my position in Praktikum because I cannot attend in person, but for other courses I was not lagging behind too much (thanks to TUM-Live). The Oktoberfest passed by but we prepared and celebrated Christmas together.

Courses and Exams

As a bigger uni (of ~50k students), TUM has more types of courses, with finer granularity. Most of the lecture-based courses have 6 or 8 credits, seminars have 4 credits, and praktikums have 10 credits. Lecture-based courses are examined by written exams, seminars are evaluated by presentation and/or report. Some praktikums are lab course, they're evaluated by lab report and presentation. Some praktikums are project course, they're evaluated by final project.

Exams in my home university have a length of 4 hours, getting 80% of the points will give you the highest grade (5.0). In TUM, you only have 90-120 minutes to finish lot of questions, this is called "Überhangklausur" (overhang exam). But don't be panic, to get a highest grade (1.0), you don't have to get full points, or certain percentage of points. You ONLY have to beat your peers, become top x% of all hundreds of students (grading distribution or bar is defined by the examiner).

example of the grade distribution

One important thing to know is, once you passed one exam, you can never attend any of its retake. For someone who cares the grade, getting a 4.0 (lowest passing grade) is the biggest nightmare. Many students fail the exam on purpose when they find it's not worth to pass it this time😱.

Entertainment

As a fan of FCB for more than 15 years, of course I will be happy to move to Munich. Now I live only about 1300 kilometers away from my favorite team - FC Barcelona❤️💙!

I boulder 2-3 times a week at Boulderwelt, but Fysiken Klätterlabbet Centrum will always be my favorite boulder gym.

Pt. 2

The adventure is a continuous and differentiable function, I'm still exploring and optimizing it, and I know what direction to go.

Why Instant NGP is extremely fast?

ivan Ding — Fri, 03 Nov 2023 22:50:00 GMT

Neural Graphics Primitives (NGP) is an object represented by quires to a neural network.

Instant NGP speeds up the training of the original NeRF by 1000x, while still using neural network to implicitly store the scene. What is the magic in it?

Trainable Multiresolution Hash Encoding

The trainable multiresolution hash encoding permits the use of a smaller neural network without sacrificing quality, and remains generality. Several techniques are used to make the encoding works better on modern GPU.

The trainable features are arranged into $L=16$ levels of hash tables, each mapped to one resolution of a virtual 3D voxel grid. Given a 3D location $(x,y,z)$, on each level of 3D voxel grid, we interpolate the feature vectors from the feature vectors of its 8 integer corners (4 corners if 2D, as shown in the picture). All feature vectors ($F$ dimension) of all these integer corners are stored in a static data structure, i.e. the hash table of size $T$. So for each location we're interested, at each $L$ level, we lookup hash table for 8 times, and interpolate to get a feature vector of size $F$. Then we concatenate all feature vectors of all levels with a auxiliary feature vector (can be anything!) of size $E$. Finally, we get a feature vector of size $(L\times F + E)$.

To be notice, this process can be done in parallel efficiently. For every pixels we try to render at a time, we load one level of the hash table into the GPU cache, do the hash, look-up the feature vectors of all these pixels, then interpolate. Then move on to the next level, do the same thing. Finally, all the interpolated features are concatenated together, along with a auxiliary input, becomes the input of the neural network.

The efficiency of the static hash table is better than dynamic structures like tree, and it is more general. When the resolution of a certain level is larger than the hash table size, there will be hash collision. But it is automatically solved by multi-resolution and interpolation. (The chance that 2 different location has the same final feature vector input is near zero)

The proposed hash encoding is highly efficient and is tailored with several techniques to improve even more.

Mixed-precision

The hash table entries are stored in half-precision, and mixed precision training were used. That enables faster training and faster inference.

GPU Cache Optimization

As mentioned before, the hash tables are evaluated level by level. So only some levels of the hash tables will reside in caches and will be reused over and over again, at any given time.

More importantly, the use of the multiresolution hash encoding makes it possible to use a smaller neural network without sacrificing quality.

Smaller Neural Networks (Fully-fused MLPs)

Instant NGP uses highly optimized fully-fused MLP, which is 5-10x faster than TensorFlow implementation (e.g. in original NeRF).

By using a relatively small neural network, and make good use of the GPU, instant NGP gets their neural network part close to voxel lookup speed.

Voxel versus Neural Network

Voxel based method stores the scene 3D voxels, like store image data in 2D pixels. To know the attribute of a given position (3D coordinates), a simple look-up is enough. Methods like Plenoxels uses voxels in replace of neural network to significantly speedup the pipeline. But to store a high-resolution scene, excess memory are needed. That amount of storage makes a simple look-up not simple anymore. When training, huge amount of the voxel data are needed to be transferred into memory and cache repeatedly. Excess memory operations bound the speed (but it is still relatively fast).

Theoretically, voxel based method can be faster when we have more memory and cache. And neural network based method trades the memory footprint for compute. It can be faster if we make the computation more effectively.

Fully-fused NN

For a standard neural network, given a fixed batch size, the compute cost is $O(M)$, and memory cost is $O(M^2)$, while $M$ is number of neurons per layer. On bigger neural networks, focus on optimization of computation is wise, but on smaller ones, the memory is the most important thing.

They made their neural network so small, so that the whole network can fit into the on-chip memory of the GPU. When evaluating the network (imagine ray marching and query thousands of vales at the same time), each thread block can run the whole network independently, using the weights and bias stored in on-chip memory.

Speedup against TensorFlow

The authors are from NVIDIA, they know their hardware well, and they know CUDA well, so they implemented instant NGP in CUDA and integrated with fully-fused MLPs of the tiny-cuda-nn framework. With carefully tailored neural network and good use of the NVIDIA GPU, 5-10x speedup is achieved compared with TensorFlow version.

Accelerated Ray Marching

Overall, instant NGP takes 10-100x fewer steps than the naïve dense stepping, which means 10-100x fewer less query of the neural network.

Exponential stepping for large scenes.
Skipping of empty space and occluded regions.
Compaction of samples into dense buffers for efficient execution.

Exponential stepping for large scenes

Typically, larger scenes have more empty regions, and coarser details is not too noticeable. A exponential step size is so the computation grows with scene size.

Skipping of empty space and occluded regions

A multi-scale occupancy grid is maintained to indicate where in the space is empty. For empty spaces we don't have to infer the neural network, hence computation is saved. (little extra memory but less computation)

Reference

Back to Computer Vision

ivan Ding — Sun, 04 Jun 2023 03:00:00 GMT

After 3 years, me decides to dive Deeep into CV again...

Resources

Paper List

Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." Communications of the ACM 65.1 (2021): 99-106.

Works that prior to NeRF

NeRF related work

WCET Analysis - Fibonacci recursion, structure matters

ivan Ding — Fri, 10 Feb 2023 22:35:35 GMT

The natural way

Here is a recursion function, generating Fibonacci number.

int fib (int z) {
    int r;
    if (z == 0)
        r = 0;
    else if (z == 1)
        r = 1;
    else
        r = fib(z-1) + fib(z-2);
    return r;
}

The first two items of the Fib are 0 and 1. So it is very nature to write the code like this. But it is under-optimized, in a "Fibonacci-way".

WCET Analysis

Now we do the WCET(worst case execution time) analysis, assuming:

Each declaration or assignment statement costs 1 time unit
Each compare statement costs 1 time unit
Each return statement costs 1 time unit
Each addition or subtraction operation costs 4 time units.
A function call costs 2 time units plus WCET for the function in question.
All other language constructs can be assumed to take 0 time units to execute.

Let's make $f(z)$ to be the WCET of the fib(z), so $f(0)$ is the WCET of the fib(0) function execution. For different z value, the code being executed is different, with different length, and different WCET.

Now divide the code snippet to 3 parts, and analyze them seperately:

// path a (z=0): 4
int fib (int z) {
    int r; // declaration (1)
    if (z == 0) // compare (1)
        r = 0; // assignment (1)
    return r; // return (1)
}

For path a, the WCET of fib(0) is 4, that is $f(0)=4$

// path b (z=1): 5
int fib (int z) {
    int r; // declaration (1)
    if (z == 0) // compare (1)
    	;
    else if (z == 1) // compare (1)
    	;
    else
        r = 1; // assignment (1)
    return r; // return (1)
}

For path b, $f(1)=5$ because we have one more comparison executed with the if-else statement.

// path c (z>=2): 21+f(z-1)+f(z-2)
int fib (int z) {
    int r; // declaration (1)
    if (z == 0) // compare (1)
        ;
    else if (z == 1) // compare (1)
        ;
    else
        r = fib(z-1) + fib(z-2); // 3*4+2*2+f(z-1)+f(z-2)
    return r; // return (1)
}

For r = fib(z-1) + fib(z-2); it takes 3 ALUs(add/sub), 2 function calls, 1 assignment, and it contains the code of other 2 fib calling. So $f(z)=21+f(z-1)+f(z-2)$ for path c. And you may noticed that, we need to do one more comparison, same as path b.

For a if-else code snippet, the structure/placement matters, especially it will be called recursively.

if (cond1)
    func1(); // execute after cond1
else if (cond2)
    func1(); // execute after cond1, cond2
else if (cond3)
    func1(); // execute after cond1, cond2, cond3
else
    func1(); // execute after cond1, cond2, cond3

Let me use some real numbers, if we want to compute the WCET for fib(5), we'll have:

$$
\begin{align*}
f(2)=f(1)+f(0)+c=a+b+c \\
f(3)=f(2)+f(1)+c=a+2b+2c \\
f(4)=f(3)+f(2)+c=2a+3b+4c \\
f(5)=f(4)+f(3)+c=3a+5b+7c \\
\end{align*}
$$

The WCET of fib(5) is composed of some path a, more path b, much more path c. And try to look at it vertically, you may understand why I said, this is under-optimized in a Fibonacci-way.

The weight of path A grows in a Fibonacci-way; the weight of path B grows in a same fashion but one step ahead (one step in Fibonacci...). And for path C? Also grows in a Fibonacci fashion, but every time 1 is added to the weight (for the execution of path c itself).

For this natural version code, the WCET of fib(5) is $3\times 4+5\times 5+7\times 21=184$

Better Structure, nature way for the machine

int fib (int z) {
    int r;
    if (z > 1)
        r = fib(z-1) + fib(z-2);
    else if (z == 1)
        r = 1;
    else
        r = 0;
    return r;
}

If we switch the position of path C with path A, it will save one comparison for path C and add one to path A, which makes WCET of the new fib function is $3\times 5 + 5\times 5 + 7\times 20=180$. The difference is 4 time units of comparison.

For a computation of fib(10), the difference will be 54 time units.

For a computation of fib(20), the difference will be 6764 time units...

GEMM - Part 1: Basics and a CPU Implementation Example

ivan Ding — Mon, 12 Dec 2022 14:33:00 GMT

GEMM stands for general matrix multiply, it is the "level 3" routine of the BLAS (Basic Linear Algebra Subprograms), built for common linear algebra operations. GEMM is also widely used in areas like computer vision, machine learning.

The Formula

The formula of GEMM is:

$C=\alpha AB+\beta C$

where $A$, $B$, and $C$ are matrices, $\alpha$ and $\beta$ are constant values.

GEMM C Example

Here is a very neat GEMM implementation written in C, in the well-known neural network framework darknet.

darknet/gemm.c at master · pjreddie/darknet

Convolutional Neural Networks. Contribute to pjreddie/darknet development by creating an account on GitHub.

GitHubpjreddie

void gemm_cpu(int TA, int TB, int M, int N, int K, float ALPHA, 
        float *A, int lda, 
        float *B, int ldb,
        float BETA,
        float *C, int ldc)
{
    int i, j;
    for(i = 0; i < M; ++i){
        for(j = 0; j < N; ++j){
            C[i*ldc + j] *= BETA;
        }
    }
    if(!TA && !TB)
        gemm_nn(M, N, K, ALPHA,A,lda, B, ldb,C,ldc);
    else if(TA && !TB)
        gemm_tn(M, N, K, ALPHA,A,lda, B, ldb,C,ldc);
    else if(!TA && TB)
        gemm_nt(M, N, K, ALPHA,A,lda, B, ldb,C,ldc);
    else
        gemm_tt(M, N, K, ALPHA,A,lda, B, ldb,C,ldc);
}

By default, matrix $A$, $B$ is not transposed (TA=0 && TB=0), which means:

*A is a 1-d array which stores a $(M, K)$ matrix

*B is a 1-d array which stores a $(K, N)$ matrix

*C is a 1-d array which stores a $(M, N)$ matrix, and it will be used to store the final result.

$C=\beta C$ will be computed first for better efficiency. Then we will do $C=\alpha AB+C$ part.

Matrix $A$, $B$, and $C$ are all stored in a row-majored order, which means elements of the same row are stored consecutive in memory. (This doesn't mean all elements in the matrix is stored consecutive in memory).

Leading Dimension

Elements of the matrix used in the gemm function is not necessarily stored consecutive in memory? A little counter-intuitive right? To explain this, I need to introduce leading dimension (argument lda, ldb, and ldc).

Actually, elements of a matrix is stored consecutive in memory, but when multiplying matrix, sometimes we want to use part of an existing matrix as the input/output, not all of it.

Suppose we have a $(6, 8)$ matrix $Q$ in our memory (row-majored order), and we want to do matrix multiply on part of it, which is a $(3, 4)$ matrix $q$.

/*
Q Q Q Q Q Q Q Q
Q Q q q q q Q Q
Q Q q q q q Q Q
Q Q q q q q Q Q
Q Q Q Q Q Q Q Q
Q Q Q Q Q Q Q Q
*/

i want to use q in gemm, but i don't need to copy it explicitly

Apparently, elements in matrix $q$ were not stored consecutively in memory. Instead of copying the data first then do the gemm, we can do gemm directly if we use the right parameters *A, M, K, and most importantly, lda. In the previous example:

TA=0 means matrix $Q$ and of course matrix $q$ are row-majored, or not transposed.

lda=8 means the leading dimension (number of columns in this case) of the matrix stored in the memory is $8$, which is the dimension of the matrix $Q$.

K=4 means the dimension (number of columns in this case) of the matrix used for gemm is $4$, which is the dimension of the matrix $q$.

M=3 means the number of rows is $3$ for matrix $q$.

*A=*(Q+10) means the first element of matrix $q$, is the 11th element of the matrix $Q$, starting address together with offset were given here.

These are all we need for one input/output of the gemm function. And if you're familiar with numpy, here's an example in Python:

import numpy as np

# 1-d array Q, to get the idea how it is stored in memory
Q = np.arange(6 * 8)
print(Q)
# 2-d array QQ, how we understand the matrix, with 2-d shape information
QQ = Q.reshape(6, 8)
print(QQ)
# to help you understand the C explanation above
lda = QQ.shape[1] # 8
K = 4
M = 3
offset = 10
# these are all we need to get the q, or to use it directly in gemm function
q = QQ[offset//lda: offset//lda+M, offset%lda: offset%lda+K]
# q = QQ[1:4, 2:6]
print(q)

And here's the explanation of leading dimension provided by IBM for their ESSL (Engineering and Scientific Subroutine Library).

Matrix Multiplication

After the easy part $C=\beta C$ is done, $\alpha AB$ will be computed. The order of storage of matrix $A$, $B$ should be considered and taken care of.

from wikipedia (Row- and column-major order), link below

Matrices can be stored in row-major order or column-major order. Row-major order is used for C-style arrays. That means, by default, elements of the same row are considered to be stored consecutively. But in some cases (an example below will show a situation that can benefit from it) we need to store matrices in a column-major order. Storing a matrix in column-major order in a row-major order convention is eqivalent to store the transpose matrix of the origianl in the memory.

Now you may know why we need int TA and int TB parameters in our gemm function. In our simple example, TA=0 means matrix $A$ is stored in a row-major order. And TA!=0 means matrix $A$ is stored in a column-major order, or you can say the transpose matrix of $A$, which is $A^T$, is stored in the memory.

// if (TA == 0 && TB == 0)
void gemm_nn(int M, int N, int K, float ALPHA, 
        float *A, int lda, 
        float *B, int ldb,
        float *C, int ldc)
{
    int i,j,k;
    #pragma omp parallel for
    for(i = 0; i < M; ++i){
        for(k = 0; k < K; ++k){
            register float A_PART = ALPHA*A[i*lda+k];
            for(j = 0; j < N; ++j){
                C[i*ldc+j] += A_PART*B[k*ldb+j];
            }
        }
    }
}

When TA==0 && TB==0, the snippet above will be used to compute $C=C+\alpha AB$

A More Efficient Way

// if (TA == 0 && TB != 0)
void gemm_nt(int M, int N, int K, float ALPHA, 
        float *A, int lda, 
        float *B, int ldb,
        float *C, int ldc)
{
    int i,j,k;
    #pragma omp parallel for
    for(i = 0; i < M; ++i){
        for(j = 0; j < N; ++j){
            register float sum = 0;
            for(k = 0; k < K; ++k){
                sum += ALPHA*A[i*lda+k]*B[j*ldb + k];
            }
            C[i*ldc+j] += sum;
        }
    }
}

cvgear, inauguration

ivan Ding — Wed, 20 May 2020 02:47:00 GMT

cvgear 0.1.0 was released on 20 May, 2020.

CVGear means Computer Vision Gear. It is under MIT License and contains computer vision gears for good uses.

`TorchNestedLoader`

TorchNestedLoader allows you to save/load between different modules with actually the same logic structure.

Suppose we have a SimpleNet:

import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(
            in_channels=3,
            out_channels=32,
            kernel_size=3,
            stride=1,
            padding=1,
            bias=False
        )
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(
            in_channels=32,
            out_channels=32,
            kernel_size=3,
            stride=1,
            padding=1,
            bias=False
        )
        self.bn2 = nn.BatchNorm2d(32)
        
    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.conv2(x)
        x = self.bn2(x)
        return x

simplenet = SimpleNet()

The structure of SimpleNet is:

SimpleNet(
  (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

And we have a more "wrapped" version WrappedSimpleNet:

import torch.nn as nn

class Conv2d(torch.nn.Conv2d):
    def __init__(self, *args, **kwargs):
        norm = kwargs.pop("norm", None)
        super().__init__(*args, **kwargs)
        self.norm = norm
	
    def forward(self, x):
        x = super().forward(x)
        if self.norm is not None:
            x = self.norm(x)
        return x

class WrappedSimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.stem = Conv2d(
            in_channels=3,
            out_channels=32,
            kernel_size=3,
            stride=1,
            padding=1,
            bias=False,
            norm=nn.BatchNorm2d(32)
        )
        self.conv1 = Conv2d(
            in_channels=32,
            out_channels=32,
            kernel_size=3,
            stride=1,
            padding=1,
            bias=False,
            norm=nn.BatchNorm2d(32)
        )
	
    def forward(self, x):
        x = self.stem(x)
        x = self.conv1(x)
        return x
    
wrappedsimplenet = WrappedSimpleNet()

The structure of WrappedSimpleNet is:

WrappedSimpleNet(
  (stem): Conv2d(
    3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
    (norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (conv1): Conv2d(
    32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
    (norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
)

The logic structure of SimpleNet and WrappedSimpleNet is exactly the same. But they differ from submodule names and tree structure. So you cannot save/load state_dict between these two modules using .state_dict() and .load_state_dict() methods easily.

But with TorchNestedLoader, you can save/load nested_dict between these two modules easily:

from cvgear.framework.torch import TorchNestedLoader

simplenetloader = TorchNestedLoader(simplenet)
wrappedsimplenetloader = TorchNestedLoader(wrappedsimplenet)

# save as nested_dict
nested_dict = simplenetloader.nested_dict()
# load nested_dict
wrappedsimplenetloader.load_nested_dict(nested_dict)

Imagine you just implement one state-of-the-art model in torch and want to test it. Train your model from scratch will be time-consuming. After downloading pre-trained model from the Internet then found it painful to load it to your model manually?

Use TorchNestedLoader as your gear!

`DarknetParser`

Everyone loves darknet. It is very fast and in public domain.

The configuration file of the darknet network is often long and tedious(due to its sequential structure), hard to read through. With darknet installed, the network configuration file can be parsed easily and you can get a clear sense of what the network structure is from the information it displayed.

Without darknet installed:

from cvgear.framework.darknet import DarknetParser, build_darknet_parser

# create a DarknetParser instance, then load network configuration
darknet53 = DarknetParser("darknet53")
darknet53.load_darknet_cfg("path/to/darknet53.cfg")
# or build a DarknetParser from network configuration file directly
darknet53 = build_darknet_parser("path/to/darknet53.cfg")

print(darknet53)

Crystal clear!

parse darknet network with DarknetParser

`DarknetNestedLoader`

Save/load torch module...

Parse darknet network...

What about... Save/load darknet network?

Even more! Save/load between darknet network and torch.nn.Module ! DarknetNestedLoader is made for save/load darknet network(DarknetParser) as binary weights(.weights file) or as nested_dict.

from cvgear.framework.darknet import DarknetNestedLoader, build_darknet_nested_loader

# create a DarknetNestedLoader instance with DarknetParser, then load from binary weights file
darknet53loader = DarknetNestedLoader(darknet53)
darknet53loader.load_darknet_weights("path/to/darknet53.weights")
# or build a DarknetNestedLoader from network configuration file and binary weights file
darknet53loader = build_darknet_nested_loader("path/to/darknet53.cfg", "path/to/darknet53.weights")

# save weights to nested_dict
nested_dict = darknet53loader.nested_dict()

# load nested_dict to a torch.nn.Module with TorchNestedLoader
# ...

Recap

save/load between darknet and torch models as nested_dict

DarknetParser describes a darknet network(as torch.nn.Module describes a torch module)
DarknetNestedLoader can save/load a darknet network as nested_dict or binary file
TorchNestedLoader can save/load a torch module as nested_dict
So with DarknetNestedLoader and TorchNestedLoader, you can convert between darknet weights and torch weights easily.

That is little cvgear 0.1.0
More gears are coming up...
Happy inauguration!🎉🎉🎉

Detectron2 walkthrough (Windows)

ivan Ding — Thu, 06 Feb 2020 13:30:00 GMT

New research starts with understanding, reproducing and verifying previous results in the literature. Detectron2 made the process easy for computer vision tasks.
This post contains the #installation, #demo and #training of detectron2 on windows.

update:
2020/07/08

install pycocotools 2.0.1 from PyPi
add File 5 and File 6

Installation

Learning detectron2 starts with installing.

Requirements

Windows 10 with Python ≥ 3.6
PyTorch ≥ 1.3 and corresponding torchvision
CUDA ≥ 9.2
Visual Studio 2013-2019
(Optional) OpenCV, needed by demo and visualization

Step 0. Setup a conda environment with the right python version(optional but recommended)

REM "Create a conda environment named 'detectron2' with the latest version of Python 3.7.x"
conda create --name detectron2 python=3.7
REM "Activate the conda environment for 'detectron2'"
conda activate detectron2

Note: All required python package will be installed in this environment(so does detectron2 itself), make sure activate the environment by command conda activate detectron2 before you do anything with detectron2. Deactivate the environment by conda deactivate so you can go back to your previous working environment.

Step 1. Install Python COCO API(pycocotools 2.0.1)

The latest version of detectron2 requires pycocotools 2.0.1

Install it by pip install pycocotools>=2.0.1 for Linux

But for windows, you should first download pycocotools-2.0.1.tar.gz from PyPi.

Unzip it then edit pycocotools-2.0.1\setup.py:

replace extra_compile_args=['-Wno-cpp', '-Wno-unused-function', '-std=c99'], with extra_compile_args={'gcc': ['/Qstd=c99']},

Back to command prompt, install pycocotools to site-packages of current environment(detectron2):

cd pycocotools-2.0.1
python setup.py build_ext install

If it works, you should see the info Finished processing dependencies for pycocotools==2.0.1, then you can delete the cocoapi directory if you like:

cd ..
RMDIR /S pycocotools-2.0.1

Step 2. Install PyTorch and torchvision

Check your CUDA version first:

nvcc --version

It should be ≥ 9.2 (that is 9.2, 10.0 or 10.1), go to https://pytorch.org/get-started/locally/, select your CUDA version copy the command (e.g. for CUDA 10.1 it should be)

conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

Step 3. Install Detectron2

Official version doesn't support windows currently. To build and use it successfully on windows, you should edit some files: File 1, File 2, File 3, File 4, File 5, File 6

This repository ivanpp/detectron2 contains the latest version of official detectron2 with windows patches mentioned above. So the easy way to do this is to clone and build it:

git clone https://github.com/ivanpp/detectron2.git
cd detectron2
pip install -e .

Or use the official version:

git clone https://github.com/facebookresearch/detectron2.git

Then edit the files mentioned above and build it:

cd detectron2
pip install -e .

Note: it may took a while to build all the .cu and .cpp files, be patient!

Step 4. Check the installation

Check the installation:

python -m detectron2.utils.collect_env

The result should like:

environment info

Make sure the NVCC version of detectron2 matches the NVCC version of PyTorch. If not, you may choose the wrong version at Step 2.

Run a pre-trained model

Choose a model in the model zoo, set the input config file and specify the corresponding MODEL.WEIGHT for it.

python demo/demo.py ^
	--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml ^
	--input datasets/coco/unlabeled2017/000000000361.jpg ^
	--output output.jpg ^
	--opts MODEL.WEIGHTS detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x/137260431/model_final_a54504.pkl

demo on pre-trained model

Note:

"detectron2://" is equal to "https://dl.fbaipublicfiles.com/detectron2/" here, it will be resolve by Detectron2Handle, see detectron2/detectron2/checkpoint/catalog.py for details.
Pre-trained weights from Internet will be cached to %USERPROFILE%/.torch/fvcore_cache if $FVCORE_CACHE environment variable is not set. (For Linux, the default cache file is ~/.torch/fvcore_cache), see fvcore/fvcore/common/file_io.py for details.
If you don't want detectron2 to download and cache the model weight automatically. Specify the local path to the pre-trained weight after downloading it, like --opts PATH/TO/model_final_a54504.pkl.

Reproduce the result

Training mask r-cnn model

All the config files are made for 8-GPU training. To reproduce the result on 1 GPU, there are changes to made. For example, to reproduce the result in configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml, you can edit the corresponding .yaml file(mask_rcnn_R_50_FPN_1x.yaml or Base-RCNN-FPN.yaml) or overwrite the training parameters in command line.

Inconvenient but once-for-all way:

Edit configs\Base-RCNN-FPN.yaml:

SOLVER:
  IMS_PER_BATCH: 2
  BASE_LR: 0.0025
  STEPS: (480000, 640000)
  MAX_ITER: 720000

Train the model:

python tools/train_net.py ^
	--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml ^
	OUTPUT_DIR output/mask_rcnn_r50_fpn_1x

Convenient way:

Simply overwrite it through command line, no need to edit any file:

python tools/train_net.py ^
	--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml ^
	SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 ^
	SOLVER.MAX_ITER 720000 SOLVER.STEPS (480000,640000) ^
	OUTPUT_DIR output/mask_rcnn_r50_fpn_1x

All the checkpoints and the final model will be stored at the OUTPUT_DIR we defined, output/mask_rcnn_r50_fpn_1x, along with tensorflow eventlog file, log file... A comprehensive model config file will be generated automatically(output/mask_rcnn_r50_fpn_1x/config.yaml).

Resume training progress

Training progress may shut down sometimes, manually or accidentally. To resume training, simply run:

python tools/train_net.py ^
	--config-file output/mask_rcnn_r50_fpx_1x/config.yaml
	--resume

The training will be resumed from the last checkpoint automatically, no need to specify the checkpoint unless you need it for some reason.

Visualize the training progress through TensorBoard

Use tensorboard to visualize the training progress during or after training:

tensorboard --logdir output

visualization through tensorboard

Evaluate the performance

Detectron2 will evaluate the final model after the training progress. To evaluate the performance of any checkpoint:

python tools/train_net.py ^
	--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml ^
	--eval-only MODEL.WEIGHTS /path/to/checkpoint_file

📙Conda Note

ivan Ding — Sun, 01 Sep 2019 10:18:16 GMT

Conda cheat sheet
Conda documentation

Conda Channels

Conda installs packages from default channel if no channel is specified.

Use conda config --show channels to see the current channel list.

Subject to the GFW, downloading process may be very slow in mainland of China.

The once for all solution is breaking the wall or leaving the mainland, both hard to achieve.

You can circumvent the GFW through proxy or use domestic channel as alternative. I prefer the former.

Setup Proxy for Conda

Suppose you have a local socks5 proxy listening port 1080, simply modified the .condarc :

proxy_servers:
  http: socks5://127.0.0.1:1080
  https: socks5://127.0.0.1:1080

Add Channel List

Two ways to add channel list:

Override the config file:

Create .condarc in %UserProfile%/.conda, follow the YAML syntax, override the channel list configuration like:

channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
  - defaults

Channels are organized from highest to lowest priority.

Manager channels from command prompt:

REM "Add to the top of the channel list"
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/

REM "Add to the bottom of the channel list"
conda config --append channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --append channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/

Specify Channels When Installing Packages

Use -c or --channel flag to add additional channel to search when installing:

conda install numpy --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/

And use --override-channels flag to skip the channels list in .condarc

conda install numpy --override-channels --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/

Conda Packages

Search package

REM "Searh package and display the detailed information"
conda search PKGNAME --info

Install package

REM "Specify version"
conda install PKGNAME==3.14
REM "Specify channel"
conda install --channel conda-forge
conda install -c conda-forge
REM "Specify environment"
conda install PKGNAME --name ENVNAME
conda install PKGNAME -n ENVNAME
REM "Install from local directory"
conda install PATH/TO/PKGNAME.tar.bz2 --offline

Update package

conda update PKGNAME --name ENVNAME

Remove package

conda remove PKGNAME --name ENVNAME

Conda Environment

Conda allows you to create environments containing different packages and even different python version that will not interact with other environments.

Conda will install package to base environment as default if no environment is specified when installing.

Here are some convenient command to manager conda environment:

List available conda environments(name & path):

REM "Current environment is highlighted with an asterisk(*)"
conda info --envs

Or just go to %UserProfile%/.conda/environments.txt

Create environment

REM "Create environment with specific python version"
conda create --name ENVNAME python==VERSION
REM "Create environment to specific path"
conda create --prefix PATH/TO/ENVNAME
REM "Create environment from .yaml file"
conda env create --file PATH/TO/environment.yaml
REM "Create environment from .txt file"
conda create --name ENVNAME --file PATH/TO/spec-file.txt
REM "Create environment from existing environment"
conda env create --name NEWNAME --clone OLDNAME

Activate environment

conda activate ENVNAME

Deactivate environment

conda deactive

List packages in specific environment

conda list --name ENVNAME

Using pip in an environment

conda install --name ENVNAME pip
conda activate myenv
pip 
REM "example: pip install PKGNAME -f LINK/LOCAL_PATH"

Export environment

REM "Export to .yaml file"
conda activate ENVNAME
conda env export > PATH/TO/environment.yaml
REM "Export spec list as txt file"
conda list --name ENVNAME --explicit > PATH/TO/spec-file.txt

Remove environment

conda remove --name ENVNAME --all

Telegram Enslaves WeChat

ivan Ding — Sun, 10 Feb 2019 16:59:00 GMT

Happy Chinese New Year🎉! It's seventh lunar month also end of the holiday. So i learn to do something meaningful, get rid of some bad shit.

The ringleader of Chinese increasingly closed Internet - WeChat, kidnapped my family and friends, forced me to use their so-called social software(spyware actually). And I'm not looking for network neutrality cuz we're far far away from it. That fuckin spyware doing something worse. Like scan my disk for privacy, do content review to decide what I could see, filter what I'm sayin without telling me👿.

Since that imp enslaves me and I have no escape, I decide to let my Telegram enslaves that imp and then talk to my Telegram equally and freely, without getting my hands dirty.

Libs Version

Libraries to use:

EH Forwarder Bot (EFB) 2.0 Beta - extensible chat tunneling framework
EFB Telegram Master Channel (ETM) - EFB Telegram Master Channel
EFB WeChat Slave Channel (EWS). - EFB WeChat Slave Channel

It's important to know EWS is now Alpha version so it's unstable and changes rapidly. So again, I use:

Python 3.6 (It requires Python >= 3.6)
EFB 2.0.0b13
ETM 2.0.0b18
EWS 2.0.0a16

Requirements

Telegram and WeChat account (seriously?)
Telegram bot
VPS (that can access the real Internet)

Setup the Telegram Bot

Ask @BotFather for a new bot, /newbot
Name it, WeChat Slave
Choose a username for ur bot (unique), panda_wechat_bot
Set /setprivacy status to Disable
Set /setjoingroups status to Enable

Optional:

Set bot's profile photo: /setuserpic
Set bot's description: /setdescription
Set bot's about text: /setabouttext

Set commands helper: /setcommands

Commands helper:

help - Show commands list.
link - Link a remote chat to a group.
chat - Generate a chat head.
info - Display information of the current Telegram chat.
update_info - Update the group name and profile picture.
unlink_all - Unlink all remote chats from a group.
extra - Access additional features from Slave Channels.

Get Token

Ask @BotFather for your Bot's token: /token, record it like: 123456789:EXAMPLEOF5BOTTOEKN5TOACCESS5HTTPAPI.

Ask @get_id_bot for your Chat ID, record it like: 716124421.

Deploy and Config on Server

Installation

sudo apt update
sudo apt install -y python3 python3-pip python3-pil python3-setuptools python3-numpy python3-yaml python3-requests
sudo apt install -y ffmpeg libmagic-dev libwebp-dev screen
pip3 install imageio==2.4.0
pip3 install ehforwarderbot==2.0.0b13
pip3 install efb-telegram-master==2.0.0b18
pip3 install efb-wechat-slave==2.0.0a16

Enable EFB in the profile

mkdir -p ~/.ehforwarderbot/profiles/default
vim ~/.ehforwarderbot/profiles/default/config.yaml

Set the master and slave:

master_channel: "blueset.telegram"
slave_channels: 
    - "blueset.wechat"

Set token and admin

mkdir -p ~/.ehforwarderbot/profiles/default/blueset.telegram
vim ~/.ehforwarderbot/profiles/default/blueset.telegram/config.yaml

Set token as Bot's token recorded before to access the bot.
And set admins as Chat ID recorded before to make sure only you can access it.

token: "123456789:EXAMPLEof5BOTtoken5toaccess5HTTPAPI"
admins: 
    - 716124421

Launch! 🚀

Launch EFB:

screen ehforwarderbot

Scan the QR Code to login your WeChat account
Chat with your Telegram Bot

Post cover image from Self-Censorship in China Continues, Extends to Mobile Apps

Darknet - Custom Layer

ivan Ding — Wed, 06 Feb 2019 03:30:00 GMT

Darknet - Custom Layer

Note: This post is based on AlexeyAB/darknet version, the procedure of pjreddie/darknet version may differ slightly (have not tried, maybe identical).

8 steps to build your own deep learning lego module in Darknet:

Define LAYER_TYPE
Add LAYER_TYPE for your custom layer in layer.h
```
typedef enum {
    // ...
    CUSTOM;
} LAYER_TYPE;
```
Define layer string
Add layer string for your custom layer in parser.c
```
LAYER_TYPE string_to_layer_type(char * type)
{
// ...
if (strcmp(type, "[custom]")==0) return CUSTOM;
}
```
Then darknet would be able to recognize your custom layer in cfg file:
```
[net]
#...
[custom]
#...
```

Implement your custom layer: custom_layer.c and custom_layer.h
Should contain at least 4 functions:

layer make_custom_layer(int batch, int w, int h, .....);
void forward_custom_layer(const layer l, network_state state);
void backward_custom_layer(const layer l, network_state state);
void resize_custom_layer(layer *l, int w, int h);

(optional) If you want to train it with GPU, implement these:

#ifdef GPU
void forward_custom_layer_gpu(const layer l, network_state state);
void backward_custom_layer_gpu(const layer l, network_state state);
#endif

In parser.c

Include source file of your custom layer(to use make_custom_layer()):

#include "custom_layer.h"

Implement the parse function:

layer parse_custom(list *options, size_params params)
{	
    int param1 = option_find_int(options, "param1", 1);
    //...
    layer l = make_custom_layer(params.batch, params.w, params.h, param1, ...);
    l.param2 = option_find_float(options, "param2", .1);
    //...
    return l;
}

Add your parse function in parse_network_cfg_custom():

network parse_network_cfg_custom(char *filename, int batch)
{
    //...
    while(n){
        //...
        LAYER_TYPE lt = string_to_layer_type(s->type);
        if(lt == CONVOLUTIONAL){
            l = parse_convolutional(options, params);
        }else if(lt == CUSTOM){
            l = parse_custom(options, params);
        }
    }
    //...
    return net;
}

In network.c

Include source file of your custom layer(to use resize_custom_layer()):

#include "custom_layer.h"

Modify int resize_network(network *net, int w, int h) function:

int resize_network(network *net, int w, int h)
{
    //...
    for (i = 0; i < net->n; ++i){
        layer l = net->layers[i];
        if(l.type == CONVOLUTIONAL){
            resize_convolutional_layer(&l, w, h);
        }else if(l.type == CUSTOM){
            resize_custom_layer(&l, w, h);
        }

[optional] If your custom layer is used to produce results(like YOLO, REGION or DETECTION):

Implement custom_num_detections() and get_custom_detections() in custom_layer.c, then modify 2 functions in network.c (to count the detections and get the detections):

int num_detections(network *net, float thresh)
{
    int i;
    int s = 0;
    for (i = 0; i < net->n; ++i) {
        if (l.type == CUSTOM) {
             s += custom_num_detections(l, thresh);
        }
    //...
    }
    return s;
}

void fill_network_boxes(network *net, int w, int h, float thresh, float hier, int *map, int relative, detection *dets, int letter)
{
     int prev_classes = -1;
     int j;
     for (j = 0; j < net->n; ++j) {
         layer l = net->layers[j];
         if (l.type == CUSTOM){
             int count = get_custom_detections(...);
             //...
         }
         //...
     }
}

Add custom_layer.c and custom_layer.h in your Visual Studio Solution build/darknet.sln

Or add custom_layer.o in your Makefile
Rebuild your project

Post cover image from Lego Store | Copenhagen

Le blog de la Rime

ivan Ding — Mon, 04 Feb 2019 18:00:13 GMT

這篇 blog 用來記錄我客製化 Rime 輸入法過程中的一些想法😄
做這件事之前必須搞清楚的一點就是: 何謂客製化?
客製化是根據需求, 進行定製
Rime 可以定製的地方非常多, 但並非因爲一個組件可定製就要去定製它. 根本上還是要基於自己的習慣, 去 customize @ivanpp 自己的 Rime
這篇文章用正體字, 是因爲在瞭解和使用這款軟體的過城中, 我感受到了繁體字文化的意義所在. 但由於大環境的因素, 我並不方便頻繁使用繁體字, 所以就在這裏以小小的行動表現出我的敬意了!

全局設置

配色方案

配色方案我選擇了默認的 ps4 方案, 默認的字體以及字號我也非常滿意, 所以並沒有去爲了客製化而客製化, 只是簡單地選擇了這個配色.

特定程序中默認使用英文輸入

同樣是在 weasel.custom.yaml 文件中進行了的定製, 所以我寫在這裏, 事實上我並不清楚爲什麼爲什麼將這個功能的客製化放在這裏進行.
很簡單的, 根據自己的需求, 在 bash, cmd, atom, MSVS 等軟體中, 我使用了默認的英文輸入.

輸入方案

Rime 支持非常非常多的輸入方案, 我僅保留了我需要的三種方案:

朙月拼音·简化字
粵拼
小鶴雙拼

就像 Counter Strike 1.6 ~ 中使用呼叫控制檯一樣, 我使用 Contrl_L + ~ 呼出 Rime 的控制檯. 我去掉了默認的F4 因爲我覺得難以記住並且會經常引起熱鍵衝突.

選項 1 固定爲目前正在使用的輸入方案, 而選項2固定爲模式切換的選單. 事實上選1和選2都會進入模式切換的選單. It makes sense. 因爲當你已經處於你選擇的方案的時候 , 與其進行無意義的空操作, 不如讓它成爲一個具體模式切換的按鈕. 實際使用起來這樣的方式也非常的舒服.

而具體的模式切换選單, 被我調整成了這樣. 原因如下: 雖然全角標點看起來已經 deprecated 了. 但由於中西文的切換與中西文標點的切換我並不會大費周章的去控制檯切換, 而會去簡單的按下Shift_L 所以它們實際上的使用頻率更低, 理應放在更後. 所以實際上, 我擁有了一個 1 2 或是 2 1 都可以實現的便捷且不用考慮到熱鍵佔用問題的簡繁體切換. 而 1 1 和 2 1 成爲了真正的空操作. 事實上當自己呼出控制檯後不知道自己想要幹什麼的時候, 我們就要執行空操作或者去取消. 根據我的使用發現, 當我自己的思維比較活躍的時候, 我往往會迅速的使用 1 1 進行空操作並繼續碼字, 而當我在思考或者比較遲鈍的時候我往往會按下 Esc 來達到同樣的效果.

中英文切換

我使用了如下的中英文切換方案來應對一些不同的情況:
Control_L 設置成了 commit_code, 當我處於中文輸入的狀態, 即將輸入一段英文, 輸入了第一個單詞才發現自己的狀態, 這時候按 enter 再去按 Shift_L 進行中英文的轉化就很費事, Control_L 這時可以立刻將我已經輸入的英文內容 commit 到屏幕上, 並自動切換爲英文模式, 可以省區不少力氣.

另外, 在沒有任何輸入的情況下, Control_L 也可以被用作中英文切換, 但我的小拇指已經黏在了 Shift_L 上了, 我想我應該很少會去用它實現這個功能.

Shift_L 被設定成 inline_ascii, 在無輸入的狀態下, 它是中英文的切換, 使用頻率比較高. 這裏它做另一種用法, 就是在大量的中文之間, 我需要插入一個單詞量 'grater than 1' 的短語, 而這個時候, 我又又又忘記了切換爲英文模式, 或者說專登沒有去切換. 那麼我需要做的就是, 再輸入了第一個英文字母後按下 Shift_L 然後輸入完整個短語, 或者句子, 或者郵箱? 再按 enter. More effective!

自動識別西文及數字組成的用戶名

經常會輸入一些郵箱或者文件名, 所以遇到 _ 或是 @ 時不能直接上屏, 要允許輸完整個郵箱或者是文件名:

當然如果自己記得事先切換到西文模式, 那也非常好. 郵箱也是一樣:

Convenient~

朙月拼音·简化字

對自己使用頻率最高的'朙月拼音·简化字'我進行了一些定製與擴充, 從而讓我自己的使用更加便利.

全套西文標點

我使用頻率最高的輸入方案是'朙月拼音·简化字'(下邊簡稱爲'簡體方案'), 而我最經常做的事情是 coding, 基於這兩點, 我在簡體方案中 ban 掉了中文標點, 徹徹底底! 因爲由中文符號造成的程式錯誤會讓我發瘋, 而全套的西文標點在日常聊天中也很協調. 至少我用起來很舒服.😃

換頁方案

增加了 MacOS 的換頁方案, [ ], 同時保留了所有默認的方案, 包括 Emacs 風格的那些. 實際上我使用最頻繁的還是 - +.

自定義詞組

拼音是我最熟練的輸入方案, 而我又很少去輸入生僻字, 所以並不需要筆畫輸入作爲編碼反查. 所以我將~ 定義爲自定義詞組鍵位:
~f 用來輸入常用的 Emoji, 😋
~m 用來輸入數學符號, ±
~ar 用來輸入箭頭, ↑

當然還有很多其他什麼的! 還有給自己埋了彩蛋 😂

詞庫擴充

主要就是擴充了英文詞庫, 以滿足我自己的(頻繁地 😄)中英文混合輸入的需要.
夾帶了大量'私貨':

快捷鍵

Shift + Control + 1/2/3/4/5 對應 menu 的五個選項, 其中 1 對應: 下一個輸入法.
其實對於我來說, 一隻手來按的話, 還是用Contrl_L + ~ 再按具體數字更加快捷(手掌配合兩根手指)
使用 Control + Delete 或是 Shift + Delete 可以刪除字典中的錯詞.

備份

需要備份的文件有: default.custom.yaml, weasel.custom.yaml, luna_pinyin_simp.custom.yaml
還有標點定義文件 ivanpp_punc.yaml, 字典定義文件 ivanpp_dict.extended.dict.yaml 以及其中使用的所有字典文件.

可定期備份字典快照, luna_pinyin.userdb.txt, 該文件處於用戶文件夾內.
RIME 也提供 GUI 用于備份及合併詞典快照和導出及導入文本碼表.
所以... 幾個月後, 導出文本碼表, 看看我經常使用哪些詞彙好了! 😋

Darknet - Yolo Layer

ivan Ding — Mon, 28 Jan 2019 14:07:00 GMT

Input Shape

The convolutional layer before yolo layer should have filters=n*(4+1+classes). n is number of the prior anchors we used in the following yolo layer, namely sizeof(mask). classes is number of the classes.

[convolutional]
size=1
stride=1
pad=1
filters=75
activation=linear

The shape of the input tensor is $(b, n*(4+1+classes), h, w)$. More specifically, it's the concatenation of $n$ individual $(4+1+classes, h, w)$ tensor per image. It's an 1-d array actually, but imagine it to be a $(b, n, (4+1+classes), h, w)$ tensor. The first and second dimensions are $w$ and $h$ and the third dimension is $(4+1+classes)$. So for all b images and all n anchors, we have a $(4+1+classes)$ prediction tensor at each location. And the stride of these predictions' elements are l.h*l.w.

static int entry_index(layer l, int batch, int location, int entry)
{
    int n =   location / (l.w*l.h);
    int loc = location % (l.w*l.h);
    return batch*l.outputs + n*l.w*l.h*(4+l.classes+1) + entry*l.w*l.h + loc;
}

location should have the value between 0 and l.n*l.h*l.w-1, it gives the number of the prior anchor n and the location loc simultaneously. entry should between 0 and 4+1+classes-1, gives the index of the third dimension(the prediction tensor).

Prior anchor boxes

In Yolo Layer, the net predicts offset from the bounding box prior width and height. We define 3 options mask, num, anchors to use the prior anchor boxes.

In cfg file:

[net]
width=416
height=416

[yolo]
mask = 0,1,2
num=9
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326

In c source file:

// https://github.com/pjreddie/darknet/blob/master/src/parser.c#L306-L342
int total = option_find_int(options, "num", 1);
int num = total;
char *a = option_find_str(options, "mask", 0);  // char *a = "0,1,2";
int *mask = parse_yolo_mask(a, &num);  // int *mask = {0, 1, 2};

num in cfg file or total in the source file is the total number of the prior anchors that we can use in the entire network. mask in cfg file gives the index of the prior anchors that we use in the current yolo layer. So we can define lots of anchors and use only few of them in current yolo layer. anchors in cfg file gives all num available anchors with the shape $(p_w, p_h)$. The anchor sizes $(p_w, p_h)$ are actual pixel values on input image of the network, in this case is $(416, 416)$. So $(10, 13)$ is a prior anchor with width=10 pixels and height=13 on the $(416, 416)$ resized input image.

# Example
#yolo_layer0
[yolo]
mask = 0,1
num=3
anchors = 10,13,  16,30,  33,23

#yolo_layer1
[yolo]
mask = 1,2
num = 3
anchors = 10,13,  16,30,  33,23

#yolo_layer2
[yolo]
num=2

In the example above, yolo_layer0 uses anchors $(10, 13), (16, 30)$, yolo_layer1 uses anchors $(16, 30), (33, 23)$ and yolo_layer2 does not use prior anchor. In fact, yolo_layer2 uses 2 $(0.5, 0.5)$ anchors as default.

Gradients of Objectiveness Prediction

Yolo layer predicts l.n*l.h*l.w bounding boxes per image(l.n is the length of *mask, namely the number of the prior anchors used in the current yolo layer). And there is one objectiveness score for each predicted bounding box which gives the $Pr(Object)$ of it. What we want is to predict 1 objectiveness for all positive samples and 0 for all negative ones.

[yolo]
ignore_thresh = .5
truth_thresh = .9

Two kinds of predictions are considered as positive:

For all num prior anchors centered in the same cell with the GT Bbox(ground truth bounding box), the anchor which has the most similar shape with the GT Bbox will be the only anchor responsible for that GT Bbox. In the other word, at most one best prior anchor will be allocated to each GT Bbox in the current yolo layer. And if this best anchor is not used in the current yolo layer(the index of the best anchor is not in *mask of the current yolo layer), no anchor will be allocated for that GT Bbox.
For all l.n*l.h*l.w predictions, if the highest IoU between the prediction and all ground truth bounding boxes is grater than truth_thresh, this prediction will be responsible for the GT Bbox which gives the highest IoU with it.

Additionally, yolo layer sets truch_thresh = 1 as default. Since IoU is always less than or equal to 1, the second situation will never happen. So yolo layer penalizes at most 1 (of l.n*l.h*l.w) prediction for each GT Bbox. Penalize its objectiveness for not being 1.

And there is a ignore_thresh for negative(background) definition. If the highest IoU between the prediction and all GT Bboxes are less than or equal to ignore_thresh, that prediction will be assigned as negative. Penalize its objectiveness score for not being 0.

Gradients of Class Probability

*output is the input *state.input of the last convolutional layer, namely the prediction tensor. *delta is the gradient of the yolo layer. index gives the index of the starting class probability $Pr(Class_0|Object)$ for certain batch b, certain anchor n and certain position w, h. Remember we have b images, n anchors for each position and w*h locations. class is class of the ground truth and classes gives the number of the classes. stride will always be l.w*l.h and *avg_cat is for statistic usage, to calculate the average class probability

void delta_yolo_class(float *output, float *delta, int index, int class, int classes, int stride, float *avg_cat)
{
    int n;
    if (delta[index]){  // if some anchor is responsible for more than one GT
        delta[index + stride*class] = 1 - output[index + stride*class];
        if(avg_cat) *avg_cat += output[index + stride*class];
        return;
    }
    for(n = 0; n < classes; ++n){  // common situation
        // penalize Pr(Classi|Object) for all classes
        delta[index + stride*n] = ((n == class)?1 : 0) - output[index + stride*n];
        if(n == class && avg_cat) *avg_cat += output[index + stride*n];
    }
}

Given the index of the $Pr(Class_0|Object)$ of delta_yolo_class penalize $Pr(Class_i|Object)$ for all $Class_i$. It wants the $Pr(Class_{i=gt}|Object)$ to be 1 and others to be 0. And if some lucky anchor is responsible for more than one ground truth box and these GT boxes may or may not contain the same class. Just overwrite the gradients for the other ground truth class probability and leave others along. For example if we have 20 classes and some lucky anchor is responsible for 2 different classes(let's say there are dog and cat) in some naughty image. It will penalize $Pr(Class_{i=dog}|Object)$ and $Pr(Class_{i=cat}|Object)$ for not be 1 and penalize others for not be 0.

int obj_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 4);  // index of objectiveness
avg_anyobj += l.output[obj_index];  // sum the objectiveness for all pred box
l.delta[obj_index] = 0 - l.output[obj_index];  // common situation, low iou
if (best_iou > l.ignore_thresh) {  // best_iou > ignore_thresh -> Positive, then don't penalize the 
    l.delta[obj_index] = 0;
}
if (best_iou > l.truth_thresh) {  // nerver gonna happen when l.truth_thresh = 1
    l.delta[obj_index] = 1 - l.output[obj_index];  // 

    int class_id = state.truth[best_t*(4 + 1) + b*l.truths + 4];  // get the class_id of the GT box
    if (l.map) class_id = l.map[class_id];
    int class_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 4 + 1);
    delta_yolo_class(l.output, l.delta, class_index, class_id, l.classes, l.w*l.h, 0, l.focal_loss);
    box truth = float_to_box_stride(state.truth + best_t*(4 + 1) + b*l.truths, 1);
    delta_yolo_box(truth, l.output, l.biases, l.mask[n], box_index, i, j, l.w, l.h, state.net.w, state.net.h, l.delta, (2-truth.w*truth.h), l.w*l.h);
}

Gradient of Box Prediction

The network predicts 4 coordinates for each bounding box, $t_x, t_y, t_w, t_h$.

$\sigma(t_x)$ and $\sigma(t_y)$ are the box center position relative to the cell. $t_w$ and $t_h$ predict how much the bounding box is grater or smaller than the prior anchor. For example, if $t_w > 0$, then there will be $\mathrm{e}^{t_w} > 1$, and we will have $b_w > p_w$.

$$
\begin{align}
b_x & = \sigma(t_x) + c_x \\
b_y & = \sigma(t_y) + c_y \\
b_w & = p_w\mathrm{e}^{t_w} \\
b_h & = p_h\mathrm{e}^{t_h} \\
\end{align}
$$

$b_x$ and $b_y$ are the pixel distance from the top left corner of the current feature map $(l.w, l.h)$. And since $p_w$ and $p_h$ are actual pixel values, $b_w$ and $b_h$ are the actual pixel values on the resized network input image. To get normalized prediction result, $b_x$ and $b_y$ should be divided by the size of current feature map lw and lh. Similarly, $b_w$ and $b_h$ should be divided by the size of the resized input of the network w and h.

box get_yolo_box(float *x, float *biases, int n, int index, int i, int j, int lw, int lh, int w, int h, int stride)
{
    box b;
    b.x = (i + x[index + 0*stride]) / lw;
    b.y = (j + x[index + 1*stride]) / lh;
    b.w = exp(x[index + 2*stride]) * biases[2*n]   / w;
    b.h = exp(x[index + 3*stride]) * biases[2*n+1] / h;
    return b;
}

get_yolo_box converts the prediction result $\sigma(t_x), \sigma(t_y), t_w, t_h$ to the normalized box struct instance. Inversely, to compute the gradients of the bounding box prediction, we should convert the already normalized ground truth label box truth back to $\sigma(\hat{t}_x), \sigma(\hat{t}_y), \hat{t}_w, \hat{t}_h$.

float delta_yolo_box(box truth, float *x, float *biases, int n, int index, int i, int j, int lw, int lh, int w, int h, float *delta, float scale, int stride)
{
    box pred = get_yolo_box(x, biases, n, index, i, j, lw, lh, w, h, stride);
    float iou = box_iou(pred, truth);

    float tx = (truth.x*lw - i);
    float ty = (truth.y*lh - j);
    float tw = log(truth.w*w / biases[2*n]);
    float th = log(truth.h*h / biases[2*n + 1]);

    delta[index + 0*stride] = scale * (tx - x[index + 0*stride]);
    delta[index + 1*stride] = scale * (ty - x[index + 1*stride]);
    delta[index + 2*stride] = scale * (tw - x[index + 2*stride]);
    delta[index + 3*stride] = scale * (th - x[index + 3*stride]);
    return iou;
}

So using delta_yolo_box, we can convert normalized bounding box label to $\sigma(\hat{t}_x), \sigma(\hat{t}_y), \hat{t}_w, \hat{t}_h$ then subtract $\sigma(t_x), \sigma(t_y), t_w, t_h$ to get the gradients. But if we only do the subtraction, large bounding boxes will take advantage of their big size to impact the gradient. To compromise this, we multiply the gradients by scale to magnify the gradients of the relatively small GT bounding boxes. Generally, set scale to 2-truth.w*truth.h to do this, gradients of the small GT bounding boxes will be magnify by a factor close to 2 while gradients of the big GT bounding boxes will be magnify by a factor close to 1.

📙 PyTorch Note - CUDA

ivan Ding — Thu, 20 Dec 2018 14:39:00 GMT

CPU Tensor or GPU tensor?

doc: torch.Tensor — PyTorch master documentation
ref: How to create a tensor on GPU as default - PyTorch Forums
torch.Tensor is an alias for the default tensor type (torch.FloatTensor). - said by the document

import torch
import torch.nn as nn
import torch.nn.functional as F

Torch defines 8 CPU tensor type and 8 GPU tensor type
The default tensor type is torch.FloatTensor, which is a CPU Tensor and has a dtype of torch.float32

print(torch.get_default_dtype()) # To get the default Tensor dtype(torch.float32)

And this(torch.FloatTensor) makes tensors to be created on CPU if no device is specified.
To make tensors to be crated on GPU by default:

torch.set_default_tensor_type('torch.cuda.FloatTensor')

After this, all tensors will be created on the selected GPU device, and still has a dtype of torch.float32 by default.

a = torch.tensor([1.])
print(a.dtype)
print(a.device)

One more example:

torch.set_default_tenor_type('torch.cuda.DoubleTensor')

This makes tensors to be created on GPU by default and has a dtype of torch.float64

Device, Current Device

Use torch.device to get the torch.device object

Get the CPU device

cpu = torch.device('cpu') # Current CPU device
cpu1 = torch.device('cpu:0')

It's exactly the same, cuz there is no multiple CPUs mode.

Get the GPU device
```
# Current GPU device
cuda = torch.device('cuda')
cuda = torch.device('cuda', None)
# GPU 0
cuda0 = torch.device('cuda:0')
cuda0 = torch.device('cuda', 0)
# GPU 1
cuda1 = torch.device('cuda:1')
cuda1 = torch.device('cuda', 1)
```
Current CPU device will always be 'CPU:0', but current GPU device depends (on the currently selected device).
So, if currently selected device is 'GPU 0' now, gpu is 'GPU 0'. But when we change current selected device to 'GPU 1'(if u have...😂), gpu will become 'GPU 1'.

Create Tensors on device
Get the index of currently selected device:

print(torch.cuda.current_device())

Let's suppose it's 0, now we can

# Create a tensor on CPU, given a torch.device object or a string
a = torch.tensor([1.], device=cpu)
a = torch.tensor([1.], device='cpu')
# Create a tensor on currently selected GPU, which is GPU 0 now
b = torch.tensor([1.], device=cuda)
b = torch.tensor([1.], device='cuda')
# Create a tensor on specific GPU
c = torch.tensor([1.], device=cuda1)
c = torch.tensor([1.], device='cuda:1')

With One GPU

With one GPU, we only care about tensor on CPU or on GPU. No need to care about currently selected device, cuz u have only 1 GPU that can be selected :joy:.

Transfer Data (CPU <-> GPU)

torch.Tensor.cuda() returns a copy of this torch.Tensor object in CUDA memory in a specified device and will copy to the currently selected device if no device parameter was given.

cuda = torch.device('cuda')
cuda0 = torch.device('cuda:0')
tensor = torch.randn(2, 2)
# To currently selected GPU device or specific device(both 'cuda:0' in this situation)
tensor = tensor.cuda()
tensor = tensor.cuda(cuda0)

Inversely, torch.Tensor.cpu() to get a copy in CPU memory.

tensor = torch.randn(2, 2)
# CPU -> GPU
tensor = tensor.cuda()
# GPU -> CPU
tensor = tensor.cpu()

torch.Tensor.to() performs Tensor dtype and/or device conversion. It returns a copy of the desired Tensor.

cuda0 = torch.device('cuda:0')
cpu = torch.device('cpu')
tensor = torch.randn(2, 2)
# to float64
tensor = tensor.to(torch.float64)
# to float 32, using torch.Tensor.type()
tensor = tensor.type(torch.float32)
# to GPU
tensor = tensor.to(cuda0)
# to CPU
tensor = tensor.to(cpu)

So torch.Tensor.to(device, dtype) can be considered as a combination of torch.Tensor.cuda(device), torch.Tensor.cpu(device) and torch.Tensor.type(dtype)

Transfer Model (CPU <-> GPU)

Once the data Tensor is allocated (to CPU/GPU), we can do operations to it irrespective of the selected device, and the results will be always placed on the same device as the Tensor.

Furthermore, if we do operations between 2 or more Tensors, they should be allocated to the same device so the operation will take place at that device and the result will be placed there.

torch.nn.Parameter is a kind of Tensor that is to be considered a module parameter. And Parameters are Tensor subclasses. Equally, torch.nn module provides torch.nn.Module.cuda(), torch.nn.Module.cpu() methods for easily tensor(parameters) transferring between CPU and GPU. And also the torch.nn.Module.to() method to do the transfer/cast things.

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
       x = F.relu(self.conv1(x))
       return F.relu(self.conv2(x))

model = Model()
# list contains parameters of model.conv1(weight and bias)
param_list_conv1 = list(model.conv1.parameters())
print(param_list_conv1[0].device)
# CPU -> GPU(.cuda() method)
model.cuda()
print(param_list_conv1[0].device)
# GPU -> CPU(.cpu() method)
model.cpu()
print(param_list_conv1[0].device)
# CPU -> GPU(.to() method)
cuda0 = torch.device('cuda:0')
model.to(cuda0)
print(param_list_conv1[0].device)

After allocating data and model to GPU, we are able to use GPU to accelerate our training process.

With Multiple GPUs

Use Context-Manager

With multiple GPUs, you should care about the currently selected device. Use a context-manager torch.cuda.device() to manually control which GPU a tensor is created on meanwhile make our code more clear.

cuda = torch.device('cuda')

# Create tensor a,b,c on device cuda:0
with torch.cuda.device(0):
    a = torch.tensor([1., 2.], device=cuda)
    b = torch.tensor([1., 2.]).cuda()
    c = torch.tensor([1., 2.]).to(cuda)

# Create tensor d,e,f on device cuda:1
with torch.cuda.device(1):
    d = torch.tensor([1., 2.], device=cuda)
    e = torch.tensor([1., 2.]).cuda()
    f = torch.tensor([1., 2.]).to(cuda)

Control GPU Visibility with CUDA_VISIBLE_DEVICES

doc: J. CUDA Environment Variables :: CUDA Toolkit Documentation
ref: CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES | NVIDIA Developer Blog
ref: 2Pac – Can't C Me Lyrics | Genius Lyrics

Let's suppose (or dream about) that you have 4 GPUs, and want to use three of them to train your model while using the remaining one to play :video_game:. Just set the environment variable CUDA_VISIBLE_DEVICES to restrict the devices that your CUDA application(model training process) sees.

Many ways to achieve that, just introduce 2 of them:

Set the environment variable in your python script (not recommended)
```
import os
os.environ['CUDA_VISIBLE_DEVICE']='1,2,3'
```
This method is not recommended cuz it's not flexible. Use this method when you do this thing as normal.
Set the environment variable when you run the python script (recommended)
```
CUDA_VISIBLE_DEVICES=1,2,3 python train.py
```
Use this if you just want to play tonight. Or only make one of them visible to test the compatibility with one GPU environment.

And after that,

The blind stares of a million pairs of eyes
Lookin' hard but won't realize
That they will never see the 'GPU0'!

Data Parallelism

ref: Optional: Data Parallelism — PyTorch Tutorials
doc: torch.nn — PyTorch master documentation

Torch will only use one GPU by default. Simply use torch.nn.DataParallel to run your model parallelized over multiple GPUs in the batch dimension.

model = nn.DataParallel(model)

Use Pinned Memory Buffer and Asynchronization

ref: When to set pin_memory to true? - vision - PyTorch Forums
ref: How to Optimize Data Transfers in CUDA C/C++ | NVIDIA Developer Blog

torch.utils.data.DataLoader admits a parameter pin_memory, and if True the tensors will be copied into CUDA pinned memory.

#https://github.com/pytorch/examples/blob/master/imagenet/main.py#L211-L223
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
    num_workers=args.workers, pin_memory=True, sampler=train_sampler)

By default, GPU operations are asynchronous, this allows to execute more computations in parallel. But when copying data between CPU and GPU or between GPUs, it will be synchronous by default. E.g. torch.Tensor.to()， torch.Tensor.cuda() and torch.nn.Module.to() . And these functions admit a non_blocking argument which was named as async before. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

#https://github.com/pytorch/examples/blob/master/imagenet/main.py#L270-L272
input = input.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)

These methods provides a larger bandwidth between the host(CPU) and the device(GPU), also improves the data transfer performance.

Write Device-Agnostic Code

Use CUDA if Possible

# At the begining of the script
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# When loading data
image, label = image.to(device), label.to(device)
# Create the model
model = Model().to(device)

Use Arguments to Control it

# https://pytorch.org/docs/stable/notes/cuda.html#device-agnostic-code
import argparse
import torch

parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--disable-cuda', action='store_true',
                    help='Disable CUDA')
args = parser.parse_args()
args.device = None
if not args.disable_cuda and torch.cuda.is_available():
    args.device = torch.device('cuda')
else:
    args.device = torch.device('cpu')
    
# When loading the data
for i, x in enumerate(train_loader):
    x = x.to(args.device)
# When creating the model
model = Model().to(args.device)

In Practice

Actually, it's a brief conclusion. So in practice, we should:

Write device-agnostic code that uses GPU by default and provide an argument to disable it.
Use pinned memory buffer and also asynchronous data transfer.
Use data parallel when you have multiple GPUs.
Use environment variable to control GPU visibility when you have multiple GPUs.

Post cover image from Quick Guide for setting up PyTorch with Window in 2 mins

CSE 455 Computer Vision

ivan Ding — Mon, 10 Dec 2018 15:25:01 GMT

CSE455: Computer Vision - Spring 2018
I saw this course on pjreddie's GitHub page, and found it intersting.👍
It is an undergraduate course provided by School of Computer Science and Engineering at University of Washington. I did the assignment for my personal interest.😋

Solution of Assignments📁

My solution to the Assignments includes codes to finish the homework and extra things to get the credits.

Style Things📗

‘for’ loop initial declarations are only allowed in C99 mode
Although i could use -std=c99 flag to tell the complier to use the C99, i think it's cooler to do the declaration out of the loop.
```
int i, j, k;
for (i = 0; i < im.c; ++i){
  for (j = 0; j < im.h; ++j){
    for (k = 0; k < im.w; ++k){
      /*body*/
    }
  }
}
```
If statement
My obsession: if(expression) for single-line things, and if (expression){ for multiple lines. And always use if(1) or if(0) to enable/disable code snippet.
```
if(!sum) return;
```
```
if (a == LOGISTIC){
    d.data[i][j] *= x * (1 - x);
} else if (a == RELU){
    d.data[i][j] *= x > 0 ? 1 : 0;
} else if (a == LRELU){
    d.data[i][j] *= x > 0 ? 1 : 0.1;
}
```
```
if(0){
    /*disabled body*/
} else
{
    /*enabled body*/
}
```
So i can search for if(0) to locate the snippet and do the switch quickly?
Actually, i was not stick to this norm in this repository.😂
Always use ++i when i have a choice

Note📝

Makefile
TODO Should write a gist for it.
Complie with opencv(using MinGW)

struct with pointer inside it
When we define a struct with at least one pointer in it.

typedef struct matrix{
    int rows, cols;
    double **data;
    int shallow;
} matrix;

We should write a function to allocate and initialize memory for it for safety amd convenience:

matrix make_matrix(int rows, int cols)
{
    matrix m;
    m.rows = rows;
    m.cols = cols;
    m.shallow = 0;
    m.data = calloc(m.rows, sizeof(double *));
    int i;
    for(i = 0; i < m.rows; ++i) m.data[i] = calloc(m.cols, sizeof(double));
    return m;
}

And also a function to free the memory:

void free_matrix(matrix m)
{
    if (m.data) {
        int i;
        if (!m.shallow) for(i = 0; i < m.rows; ++i) free(m.data[i]);
        free(m.data);
    }
}

Remember to call it to free the memory manually to avoid ⚠️segmentation fault.
And also a funtion for deep copy(if necessary).

matrix copy_matrix(matrix m)
{
    int i,j;
    matrix c = make_matrix(m.rows, m.cols);
    for(i = 0; i < m.rows; ++i){
        for(j = 0; j < m.cols; ++j){
            c.data[i][j] = m.data[i][j];
        }
    }
    return c;
}

Never use struct with pointer inside it as intermediate varible in the expression
in ./vision-hw4/src/classifier.c i used to write things like this.

// THIS IS TOTALLY WRONG!
matrix backward_layer(layer *l, matrix delta)
{
    // back propagation through the activation
    gradient_matrix(l->out, l->activation, delta);
    
    // calculate dL/dw and save it in l->dw
    free_matrix(l->dw);
    matrix dw = matrix_mult_matrix(transpose_matrix(l->in), delta);
    l->dw = dw;
    
    // calculate dL/dx and return it.
    matrix dx = matrix_mult_matrix(delta, transpose_matrix(l->w));

    return dx;
}

It is totally wrong because the intermediate struct variable transpose_matrix(l->in) and transpose_matrix(l->w) will never ever be freed. And this stupid Python-like convenient writing will run out of ur memory. And fill it up with these intermediate garbage. Finally throw out a ⚠️segmentation fault.
The right way to do this is:

 matrix backward_layer(layer *l, matrix delta)
{
    // back propagation through the activation
    gradient_matrix(l->out, l->activation, delta);
    
    // calculate dL/dw and save it in l->dw
    free_matrix(l->dw);
    matrix inT = transpose_matrix(l->in);
    matrix dw = matrix_mult_matrix(inT, delta);
    free_matrix(inT);
    l->dw = dw;
    
    // calculate dL/dx and return it.
    matrix wT = transpose_matrix(l->w);
    matrix dx = matrix_mult_matrix(delta, wT);
    free_matrix(wT);

    return dx;
}

String things could cause fatal mistake
After finishing my code in ./vision-hw4, i trained the model on my windows laptop and it worked well. But when i tried to use Linux to do the same thing, the training procedure just get crashed which gave me a 0% training and test accuracy.
After debugging i found that i accidentally changed the line ending(of the file mnist.labels) from LF to CRLF, which is default on Windows.
This converts all \n(represents line break on Linux) to \r\n(represents line break on Windows).
So num0\n becomes num0\r\n in mnist.labels, so does the rest.
And see char *fgetl(FILE *fp) function in ./src/data.c. This function parses labels from the text file and stores labels for training and test phase.
```
char *fgetl(FILE *fp)
{
    if(feof(fp)) return 0;
    size_t size = 512;
    char *line = malloc(size*sizeof(char));
    if(!fgets(line, size, fp)){
        free(line);
        return 0;
    }

    size_t curr = strlen(line);

    while((line[curr-1] != '\n') && !feof(fp)){
        if(curr == size-1){
            size *= 2;
            line = realloc(line, size*sizeof(char));
            if(!line) {
                fprintf(stderr, "malloc failed %ld\n", size);
                exit(0);
            }
        }
        size_t readsize = size-curr;
        if(readsize > INT_MAX) readsize = INT_MAX-1;
        fgets(&line[curr], readsize, fp);
        curr = strlen(line);
    }
    if(line[curr-1] == '\n') line[curr-1] = '\0';

    return line;
}
```
And most importantly, this function looks for \n as a marker of line ending. So label num0 becomes num0\r, so does the other labels.
At the training phase, all the training samples will be considered as negative so does the test phase. Surprisingly but reasonablely, i got 0% for both training and test accuracy.
Remember:
1. LF as a default option
2. Make string function compatible with both Linux and Windows
More Extra Credit of vision-hw2(spherical coordinates)

Resources📚

Text Book: Computer Vision: Algorithms and Applications Rick Szeliski, 2010.
My solution: ivanpp/CSE455_Spring_2018

Darknet - Convolutional Layer

ivan Ding — Thu, 07 Jun 2018 15:40:36 GMT

All GPU implementations have been ignored
Written by ivanpp for fun, contact me: ding@ivanpp.me

Initialize a convolutional layer

make_convolutional_layer

convolutional_layer make_convolutional_layer(int batch, int h, int w, int c, int n, int groups, int size, int stride, int padding, ACTIVATION activation, int batch_normalize, int binary, int xnor, int adam)
{
    int i;
    // create a convolutional_layer(layer) type variable l, initialize all struct members to 0.
    convolutional_layer l = {0};
    l.type = CONVOLUTIONAL;
    // Get the params
    l.groups = groups;  // optional: weight sharing across 'groups' channels
    l.h = h;  // input height
    l.w = w;  // input width
    l.c = c;  // input channels
    l.n = n;  // num of filters
    l.binary = binary;  // optional: ?
    l.xnor = xnor;  // optional: ?
    l.batch = batch;  // num of image per batch
    l.stride = stride;  // stride of the conv operation
    l.size = size;  // kernel size of filters
    l.pad = padding;  // padding of the conv operation
    l.batch_normalize = batch_normalize;  // optional: bn after conv

    // Allocate memory (for conv weight and conv weight_update)
    l.weights = calloc(c/groups*n*size*size, sizeof(float));  // stored as (n*(c/groups)*size*size)
    l.weight_updates = calloc(c/groups*n*size*size, sizeof(float));

    l.biases = calloc(n, sizeof(float));
    l.bias_updates = calloc(n, sizeof(float));

    l.nweights = c/groups*n*size*size;  // num of params for l.weights
    l.nbiases = n;  // num of params for l.biases

    // Initialize weights to random_uniform
    float scale = sqrt(2./(size*size*c/l.groups));
    for(i = 0; i < l.nweights; ++i) l.weights[i] = scale*rand_normal();
    // Allocate memory (for forward and backward)
    int out_w = convolutional_out_width(l);  // compute output width
    int out_h = convolutional_out_height(l); // compute output height
    l.out_h = out_h;
    l.out_w = out_w;
    l.out_c = n;  // output channel should be num of filter, n
    l.outputs = l.out_h * l.out_w * l.out_c;
    l.inputs = l.w * l.h * l.c;

    l.output = calloc(l.batch*l.outputs, sizeof(float));  // for conv output(forward pass)
    l.delta  = calloc(l.batch*l.outputs, sizeof(float));  // for prev layer's gradient(backward pass)
	// Assign forward, backward and update function
    l.forward = forward_convolutional_layer;
    l.backward = backward_convolutional_layer;
    l.update = update_convolutional_layer;
    if(binary){
        l.binary_weights = calloc(l.nweights, sizeof(float));
        l.cweights = calloc(l.nweights, sizeof(char));
        l.scales = calloc(n, sizeof(float));
    }
    if(xnor){
        l.binary_weights = calloc(l.nweights, sizeof(float));
        l.binary_input = calloc(l.inputs*l.batch, sizeof(float));
    }

    if(batch_normalize){
        l.scales = calloc(n, sizeof(float));
        l.scale_updates = calloc(n, sizeof(float));
        for(i = 0; i < n; ++i){
            l.scales[i] = 1;
        }

        l.mean = calloc(n, sizeof(float));
        l.variance = calloc(n, sizeof(float));

        l.mean_delta = calloc(n, sizeof(float));
        l.variance_delta = calloc(n, sizeof(float));

        l.rolling_mean = calloc(n, sizeof(float));
        l.rolling_variance = calloc(n, sizeof(float));
        l.x = calloc(l.batch*l.outputs, sizeof(float));
        l.x_norm = calloc(l.batch*l.outputs, sizeof(float));
    }
    if(adam){
        l.m = calloc(l.nweights, sizeof(float));
        l.v = calloc(l.nweights, sizeof(float));
        l.bias_m = calloc(n, sizeof(float));
        l.scale_m = calloc(n, sizeof(float));
        l.bias_v = calloc(n, sizeof(float));
        l.scale_v = calloc(n, sizeof(float));
    }

    l.workspace_size = get_workspace_size(l);
    l.activation = activation;  // which activation to use

    fprintf(stderr, "conv  %5d %2d x%2d /%2d  %4d x%4d x%4d   ->  %4d x%4d x%4d  %5.3f BFLOPs\n", n, size, size, stride, w, h, c, l.out_w, l.out_h, l.out_c, (2.0 * l.n * l.size*l.size*l.c/l.groups * l.out_h*l.out_w)/1000000000.);

    return l;
}

Describe Sth please

Optional params:

Optional params	Forward	Backward	Update	Usage	Defined in
l.groups	Y	Y	N		[convolutional]
l.binary	TODO	TODO	TODO		[convolutional]
l.xnor	TODO	TODO	TODO		[convolutional]
l.batch_normalize	Y	Y	N	Regularization	[convolutional]
adam				Optimization Algorithm	[net]

Batch normalization and Adam will not be covered in this blog.

forward_convolutional_layer()

Image(or image like) input *net.input (given by the net) has the size of $[batch\times c\times h\times w]$, consider it as a $(batch, c, h, w)$ matrix.

Conv filter *l.weights has the size of $[n\times \frac{c}{groups}\times\ size\times size]$, consider it as a $(n, \frac{c}{groups}, size, size)$ matrix.
Conv bias *l.biases has the size of $[n]$, namely 1 float bias for 1 filter.

Conv output *l.output has the size of $[batch\times n\times out_h\times out_w]$, consider it as a $(batch, n, out_h, out_w)$ matrix.

Conv workspace *l.workspace should have the size of $[\frac{c}{groups}\times out_h\times out_w\times size\times size]$. But the actual size of the *net.workspace(workspace for all conv/deconv/local layers in a net) should be suitable for the layer that needs the most workspace memory. All conv/deconv/local layers share the workspace of the net. So the most of the conv/deconv/local layers are only using part of the net's workspace.

Conv Operation

Use default value for all optional params:

net.adam = 0;
l.groups = 1;
l.batch_normalize = 0;
l.binary = 0;
l.xnor = 0;

We can get the minimal implementation:

// minimal implementation of conv forward
void forward_convolutional_layer_min(convolutional_layer l, network net)
{
    int i;

    fill_cpu(l.outputs*l.batch, 0, l.output, 1);

    int m = l.n;
    int k = l.size*l.size*l.c;
    int n = l.out_w*l.out_h;
    for(i = 0; i < l.batch; ++i){
        float *a = l.weights;
        float *b = net.workspace;
        float *c = l.output + i*n*m;

        im2col_cpu(net.input + i*l.c*l.h*l.w, l.c, l.h, l.w, l.size, l.stride, l.pad, b);
        gemm(0,0,m,n,k,1,a,k,b,n,1,c,n);
    }

    add_bias(l.output, l.biases, l.batch, l.n, l.out_h*l.out_w);

    activate_array(l.output, l.outputs*l.batch, l.activation);
}

Now *net.input remains the same, has the size of $[batch\times c\times h\times\ w]$. For one single image in current batch, it has the size of $[c\times h\times w]$. And conv filter matrix has the shape of $(n,c,size,size)$.

Instead of using for loops to do conv operations at each input location using all the filters, we use im2col and then just do matrix multiplication.

// src/im2col.c
void im2col_cpu(float* data_im,
     int channels,  int height,  int width,
     int ksize,  int stride, int pad, float* data_col)
{
    int c,h,w;
    int height_col = (height + 2*pad - ksize) / stride + 1;  // height after reconstruct
    int width_col = (width + 2*pad - ksize) / stride + 1;  // width after reconstruct

    int channels_col = channels * ksize * ksize;  // filter(channels, ksize, ksize)
    for (c = 0; c < channels_col; ++c) {  // flatten the filter
        int w_offset = c % ksize;  // from which column
        int h_offset = (c / ksize) % ksize;  // from which row
        int c_im = c / ksize / ksize;  // from which channel
        for (h = 0; h < height_col; ++h) {  // iterate the reconstructed img(out_h*out_w)
            for (w = 0; w < width_col; ++w) {
                // mapping reconstructed img to padded img(which col)
                int im_row = h_offset + h * stride;
                // mapping reconstructed img to padded img(which row)
                int im_col = w_offset + w * stride;
                // index of the data_col(reconstructed img)
                int col_index = (c * height_col + h) * width_col + w;
                data_col[col_index] = im2col_get_pixel(data_im, height, width, channels,
                        im_row, im_col, c_im, pad);  // mapping pixel by pixel
            }
        }
    }
}

im2col_cpu() accepts 2 pointer as input. Input pointer(a.k.a. image data pointer) points to start address of the image input, net.input + i*l.c*l.h*l.w. Output pointer(a.k.a. col data pointer) points to the start address of the workspace, b = net.workspace.

im2col_cpu() reconstruct image data $(c,h,w)$ to col data $(c\times size\times size, out_h\times out_w)$.

gemm() stands for General Matrix Multiplication. So weight matrix $(n, c\times size\times size)$ multiplies col data matrix $(c\times size\times size, out_h, out_w)$, finally get the output matrix $(n, out_h, out_w)$, for one single image. Note that pointer for *l.output has already points to the right place float *c = l.output + i*n*m.

And for batch images, we will get the $(batch, n, out_h, out_w)$ output for *l.output.

Use add_bias() to add bias to *l.output and use activate_array() to pass through some chosen activation function. Conv forward done!

Just a review:

Fill output with 0.0
For each image:
1. Convert input image data to col data
2. Matrix multiplication between weights and col data to get the conv output(without adding the bias)
Add biases to the output
Pass through activation

Group Vision

Each image has been divided into l.groups groups, or more specifically, grouped by channels. So each group of image has the shape of $(\frac{c}{groups},h,w)$. Size of the filters is $[n\times \frac{c}{groups}\times size\times size]$. In other words, there are $l.n\times [\frac{c}{groups}\times size\times size]$ filters.

We don't use all the $l.n\times (\frac{c}{groups}, size, size)$ kernels to do conv operation with all $l.groups\times (\frac{c}{groups},h,w)$ partial-channel image, we group the filters as well as the image(actually image channels) first. Conv kernels have also been divided into l.groups groups, so each filter group has $\frac{l.n}{l.groups}\times (\frac{c}{groups},h,w)$ filters.

The image (channel) groups and filter groups are one-to-one correspondent. $Gropu_j$ filters are only responsible for $Group_j$ image, like, sort of conv pair.

int i, j;

int m = l.n/l.groups;  // num of filters
int k = l.size*l.size*l.c/l.groups;  // len of filter
int n = l.out_w*l.out_h;  // len of output per output channel
for(i = 0; i < l.batch; ++i){
    for(j = 0; j < l.groups; ++j){
        float *a = l.weights + j*l.nweights/l.groups;
        float *b = net.workspace;
        float *c = l.output + (i*l.groups + j)*n*m;

        // use im2col_cpu() to reconstruct input for each (input, weight) pair
        im2col_cpu(net.input + (i*l.groups + j)*l.c/l.groups*l.h*l.w,
                   l.c/l.groups, l.h, l.w, l.size, l.stride, l.pad, b);
        // conv operation(actually matrix multiplication) for one pair
        gemm(0,0,m,n,k,1,a,k,b,n,1,c,n);
    }
}

*a has the start address of $Group_j$ filters
*b has the start address of the workspace
*c has the start address of the output for $Group_j$ of $image_i$
net.input + (i*l.groups + j)*l.c/l.groups*l.h*l.w gives the address for $Group_j$ of $image_i$

Using im2col_cpu() each $(\frac{c}{groups},h,w)$ partial-channel image will get the 'partial-channel col data' of shape $(\frac{c}{groups}\times size\times size, out_h, out_w)$. Along with its $(\frac{n}{groups},\frac{c}{groups}\times size\times size)$ filters pair, do the matrix multiplication, outcome will be $(\frac{n}{groups},out_h,out_w)$.

Concatenating $groups \times (\frac{n}{groups},out_h,out_w)$ output, we will get $(n,out_h,out_w)$ output for one image as usual. The output shape remains the same, regardless of using this group thing.

Just a review:

Image channels are divided into groups, so do the filters
Image channel groups and filter groups are one-to-one correspondent, like conv pair. Do conv operation to each conv pair then concatenate to get the final output
Num of the filter parameters $l.nweights=n\times \frac{c}{groups}\times size\times size$, reduced by a factor of l.groups
Num of float operations become $\frac{n}{groups}\times \frac{c}{groups}\times size\times size\times out_h\times out_w\times groups$, reduced by a factor of l.groups
Some global information may be lost, because conv operations do not cross the conv pairs.

E.G. No fuckin examples because it is stupid.

Binary things?

backward_convolutional_layer()

*l.delta has the same size of *l.output, as it will store the gradients w.r.t the output of current conv layer.

What forward_convolutional_layer() should compute for each input $x$ is $y=x\ast W$, and what it actually does is to compute $y=W\times x_{col}$.

So for backward_convolutional_layer(), it computes:
$\frac{\partial L}{\partial W}=\frac{\partial L}{\partial y}\cdot \frac{\partial y}{\partial W}=\frac{\partial L}{\partial y} \times {x_{col}}^T$
$\frac{\partial L}{\partial x_{col}}=\frac{\partial L}{\partial y}\cdot \frac{\partial y}{\partial x_{col}}=W^T\times \frac{\partial L}{\partial y}$

void backward_convolutional_layer(convolutional_layer l, network net)
{
    int i, j;
    int m = l.n;
    int n = l.size*l.size*l.c;
    int k = l.out_w*l.out_h;
    // gradients pass through activation function
    gradient_array(l.output, l.outputs*l.batch, l.activation, l.delta);

    if(l.batch_normalize){
        backward_batchnorm_layer(l, net);
    } else {
        backward_bias(l.bias_updates, l.delta, l.batch, l.n, k);
    }

    for(i = 0; i < l.batch; ++i){
        for(j = 0; j < l.groups; ++j){
            float *a = l.delta + (i*l.groups + j)*m*k;
            float *b = net.workspace;
            float *c = l.weight_updates + j*l.nweights/l.groups;

            float *im = net.input+(i*l.groups + j)*l.c/l.groups*l.h*l.w;

            im2col_cpu(im, l.c/l.groups, l.h, l.w,
                    l.size, l.stride, l.pad, b);
            // compute gradients w.r.t. weights
            gemm(0,1,m,n,k,1,a,k,b,k,1,c,n);

            if(net.delta){  // if gradient descent continues(not the first layer)
                a = l.weights + j*l.nweights/l.groups;
                b = l.delta + (i*l.groups + j)*m*k;
                c = net.workspace;

                // compute gradients w.r.t. the reconstructed inputs(x_col)
                gemm(1,0,n,k,m,1,a,n,b,k,0,c,k);

                // reconstruct the im_col using col2im_cpu, resotre the struct,
                // and get the gradients w.r.t. the inputs
                col2im_cpu(net.workspace, l.c/l.groups, l.h, l.w, l.size, l.stride,
                    l.pad, net.delta + (i*l.groups + j)*l.c/l.groups*l.h*l.w);
            }
        }
    }
}

$X$ has the shape of $(batch,c,h,w)$, and $X_{col}$ has the shape of $(batch, c\times size\times size, out_h, out_w)$.
For $Group_j$ in $Image_i$, $x_{col}^{ij}$ should have the shape of $(\frac{c}{groups}\times size\times size,out_h\times out_w)$.
$W$ has the shape of $(n,\frac{c}{groups},size,size)$. And what is responsible for $x_{col}^{ij}$, $w^{ij}$ has the shape of $(\frac{n}{groups},\frac{c}{groups}\times size\times size)$.
$\frac{\partial L}{\partial Y}$ has the shape of $(batch,n,out_h,out_w)$. What we need at a time is $\frac{\partial L}{\partial y^{ij}}$, has the shape of $(\frac{n}{groups},out_h\times out_w)$.

Given $\frac{\partial L}{\partial y^{ij}}$ and $x_{col}^{ij}$, call gemm(TA=0, TB=1, ...), we will get $\frac{\partial L}{\partial w^{ij}}=\frac{\partial L}{\partial y^{ij}}\times {x_{col}^{ij}}^T$, which has the shape of $(\frac{n}{groups}, \frac{c}{groups}\times size\times size)$. And will finally get the $\frac{\partial L}{\partial W}$, which will be stored in the memory block started from *l.weight_updates, of the size $(n,\frac{c}{groups},size,size)$.
Also, give $w^{ij}$ and $\frac{\partial L}{\partial y^{ij}}$, call gemm(TA=1, TB=0), we will get $\frac{\partial L}{\partial x_{col}^{ij}}={w^{ij}}^T\times \frac{\partial L}{\partial y^{ij}}$, which has the shape of $(\frac{c}{groups}\times size\times size, out_h\times out_w)$. And will finally get the $\frac{\partial L}{\partial X_{col}}$ of the shape $(batch, c\times size\times size, out_h, out_w)$, stored (start) from *net.workspace, $X_{col}$ will be overwritten.
Using col2im_cpu(), $\frac{\partial L}{\partial X_{col}}$ will be reconstruct to $\frac{\partial L}{\partial X}$, stored in the memory block that start from *net.delta, of the size $(batch,c,h,w)$.

Just a review (again...):

Backprop through activation function
Compute $\frac{\partial L}{\partial b}$, stored at l.bias_updates
For each image(or each group/pair in each image):
1. Compute $\frac{\partial L}{\partial W}$, stored at l.weight_updates
2. Compute $\frac{\partial L}{\partial x_{col}}$, stored at net.workspace
3. Use col2im_cpu() to reconstruct $\frac{\partial L}{\partial x_{col}}$ to $\frac{\partial L}{\partial x}$, stored at net.delta, namely the l.delta of the previous layer

update_convolutional_layer()

void update_convolutional_layer(convolutional_layer l, update_args a)
{
    float learning_rate = a.learning_rate*l.learning_rate_scale;
    float momentum = a.momentum;
    float decay = a.decay;
    int batch = a.batch;

    axpy_cpu(l.n, learning_rate/batch, l.bias_updates, 1, l.biases, 1);
    scal_cpu(l.n, momentum, l.bias_updates, 1);

    if(l.scales){
        axpy_cpu(l.n, learning_rate/batch, l.scale_updates, 1, l.scales, 1);
        scal_cpu(l.n, momentum, l.scale_updates, 1);
    }

    axpy_cpu(l.nweights, -decay*batch, l.weights, 1, l.weight_updates, 1);
    axpy_cpu(l.nweights, learning_rate/batch, l.weight_updates, 1, l.weights, 1);
    scal_cpu(l.nweights, momentum, l.weight_updates, 1);
}

*l.weight_updates has the same size of *l.weights, *l.bias_update has the same size of *l.biases as well.