During the semester break, I chose to explore the Alps instead of enjoying the 18 hours of daylights in Scandinavian. It was the exam weeks of TUM
]]>During the semester break, I chose to explore the Alps instead of enjoying the 18 hours of daylights in Scandinavian. It was the exam weeks of TUM when I visited Munich. I sited in front of the Parabola Slide in CIT building, joined the welcome meeting via Zoom. Following their advice, I immediately started my house finding in Munich. After more than 100 applications, I finally got a cozy place in Schwabing.
Apart from getting a place to stay, this "30-day (NOT) free trial" gave me an expectation of what my life will be for the following year. I also got to know the most delicious dishes in Mensa quite in advance.
I also travelled to Switzerland🇨🇭, "The Notorious B.A.H.N" (not a German Rapper) drove me to my bff. It is a beautiful but expensive heaven.
I went back to Sweden in August, with the feeling of something went wrong. The Swedish visa extension I applied in April 2023 went totally silent. I tried to contact my case officer, and had some really bad experience. I decided to start my "Plan B". I contacted TUM and German embassy, they gave me every information I need to start the German visa application.
I got a termin in October in Stockholm. Although a lot of documents were needed, the whole process only took one month.
This was the most difficult, most stressful part of my first semester exchange study. From October, the start of TUM's winter semester, till the day I got my passport back from German Embassy, the stress piled up. It was my partner and friends' support made me the way to Munich.
Finally, I arrived Munich again at the end of November. Like every other settlers here, I have to anmeldung my address, get a bank account, find a family doctor... Everything makes me happy because I'm finally here.
I lost my position in Praktikum because I cannot attend in person, but for other courses I was not lagging behind too much (thanks to TUM-Live). The Oktoberfest passed by but we prepared and celebrated Christmas together.
As a bigger uni (of ~50k students), TUM has more types of courses, with finer granularity. Most of the lecture-based courses have 6 or 8 credits, seminars have 4 credits, and praktikums have 10 credits. Lecture-based courses are examined by written exams, seminars are evaluated by presentation and/or report. Some praktikums are lab course, they're evaluated by lab report and presentation. Some praktikums are project course, they're evaluated by final project.
Exams in my home university have a length of 4 hours, getting 80% of the points will give you the highest grade (5.0). In TUM, you only have 90-120 minutes to finish lot of questions, this is called "Überhangklausur" (overhang exam). But don't be panic, to get a highest grade (1.0), you don't have to get full points, or certain percentage of points. You ONLY have to beat your peers, become top x% of all hundreds of students (grading distribution or bar is defined by the examiner).
One important thing to know is, once you passed one exam, you can never attend any of its retake. For someone who cares the grade, getting a 4.0 (lowest passing grade) is the biggest nightmare. Many students fail the exam on purpose when they find it's not worth to pass it this time😱.
As a fan of FCB for more than 15 years, of course I will be happy to move to Munich. Now I live only about 1300 kilometers away from my favorite team - FC Barcelona❤️💙!
I boulder 2-3 times a week at Boulderwelt, but Fysiken Klätterlabbet Centrum will always be my favorite boulder gym.
The adventure is a continuous and differentiable function, I'm still exploring and optimizing it, and I know what direction to go.
]]>Instant NGP speeds up the training of the original NeRF by 1000x, while still using neural network to implicitly store the scene. What is the magic in it?
The trainable multiresolution hash
]]>Instant NGP speeds up the training of the original NeRF by 1000x, while still using neural network to implicitly store the scene. What is the magic in it?
The trainable multiresolution hash encoding permits the use of a smaller neural network without sacrificing quality, and remains generality. Several techniques are used to make the encoding works better on modern GPU.
The trainable features are arranged into $L=16$ levels of hash tables, each mapped to one resolution of a virtual 3D voxel grid. Given a 3D location $(x,y,z)$, on each level of 3D voxel grid, we interpolate the feature vectors from the feature vectors of its 8 integer corners (4 corners if 2D, as shown in the picture). All feature vectors ($F$ dimension) of all these integer corners are stored in a static data structure, i.e. the hash table of size $T$. So for each location we're interested, at each $L$ level, we lookup hash table for 8 times, and interpolate to get a feature vector of size $F$. Then we concatenate all feature vectors of all levels with a auxiliary feature vector (can be anything!) of size $E$. Finally, we get a feature vector of size $(L\times F + E)$.
To be notice, this process can be done in parallel efficiently. For every pixels we try to render at a time, we load one level of the hash table into the GPU cache, do the hash, look-up the feature vectors of all these pixels, then interpolate. Then move on to the next level, do the same thing. Finally, all the interpolated features are concatenated together, along with a auxiliary input, becomes the input of the neural network.
The efficiency of the static hash table is better than dynamic structures like tree, and it is more general. When the resolution of a certain level is larger than the hash table size, there will be hash collision. But it is automatically solved by multi-resolution and interpolation. (The chance that 2 different location has the same final feature vector input is near zero)
The proposed hash encoding is highly efficient and is tailored with several techniques to improve even more.
The hash table entries are stored in half-precision, and mixed precision training were used. That enables faster training and faster inference.
As mentioned before, the hash tables are evaluated level by level. So only some levels of the hash tables will reside in caches and will be reused over and over again, at any given time.
More importantly, the use of the multiresolution hash encoding makes it possible to use a smaller neural network without sacrificing quality.
Instant NGP uses highly optimized fully-fused MLP, which is 5-10x faster than TensorFlow implementation (e.g. in original NeRF).
By using a relatively small neural network, and make good use of the GPU, instant NGP gets their neural network part close to voxel lookup speed.
Voxel based method stores the scene 3D voxels, like store image data in 2D pixels. To know the attribute of a given position (3D coordinates), a simple look-up is enough. Methods like Plenoxels uses voxels in replace of neural network to significantly speedup the pipeline. But to store a high-resolution scene, excess memory are needed. That amount of storage makes a simple look-up not simple anymore. When training, huge amount of the voxel data are needed to be transferred into memory and cache repeatedly. Excess memory operations bound the speed (but it is still relatively fast).
Theoretically, voxel based method can be faster when we have more memory and cache. And neural network based method trades the memory footprint for compute. It can be faster if we make the computation more effectively.
For a standard neural network, given a fixed batch size, the compute cost is $O(M)$, and memory cost is $O(M^2)$, while $M$ is number of neurons per layer. On bigger neural networks, focus on optimization of computation is wise, but on smaller ones, the memory is the most important thing.
They made their neural network so small, so that the whole network can fit into the on-chip memory of the GPU. When evaluating the network (imagine ray marching and query thousands of vales at the same time), each thread block can run the whole network independently, using the weights and bias stored in on-chip memory.
The authors are from NVIDIA, they know their hardware well, and they know CUDA well, so they implemented instant NGP in CUDA and integrated with fully-fused MLPs of the tiny-cuda-nn framework. With carefully tailored neural network and good use of the NVIDIA GPU, 5-10x speedup is achieved compared with TensorFlow version.
Overall, instant NGP takes 10-100x fewer steps than the naïve dense stepping, which means 10-100x fewer less query of the neural network.
Typically, larger scenes have more empty regions, and coarser details is not too noticeable. A exponential step size is so the computation grows with scene size.
A multi-scale occupancy grid is maintained to indicate where in the space is empty. For empty spaces we don't have to infer the neural network, hence computation is saved. (little extra memory but less computation)
Here is a recursion function, generating Fibonacci number.
int fib (int z) {
int r;
if (z == 0)
r = 0;
else if (z == 1)
r = 1;
else
r = fib(z-1) + fib(z-2);
return r;
}
The first two items of the Fib are 0 and 1. So it is
]]>Here is a recursion function, generating Fibonacci number.
int fib (int z) {
int r;
if (z == 0)
r = 0;
else if (z == 1)
r = 1;
else
r = fib(z-1) + fib(z-2);
return r;
}
The first two items of the Fib are 0 and 1. So it is very nature to write the code like this. But it is under-optimized, in a "Fibonacci-way".
Now we do the WCET(worst case execution time) analysis, assuming:
Let's make $f(z)$ to be the WCET of the fib(z)
, so $f(0)$ is the WCET of the fib(0)
function execution. For different z
value, the code being executed is different, with different length, and different WCET.
Now divide the code snippet to 3 parts, and analyze them seperately:
// path a (z=0): 4
int fib (int z) {
int r; // declaration (1)
if (z == 0) // compare (1)
r = 0; // assignment (1)
return r; // return (1)
}
For path a, the WCET of fib(0)
is 4, that is $f(0)=4$
// path b (z=1): 5
int fib (int z) {
int r; // declaration (1)
if (z == 0) // compare (1)
;
else if (z == 1) // compare (1)
;
else
r = 1; // assignment (1)
return r; // return (1)
}
For path b, $f(1)=5$ because we have one more comparison executed with the if-else statement.
// path c (z>=2): 21+f(z-1)+f(z-2)
int fib (int z) {
int r; // declaration (1)
if (z == 0) // compare (1)
;
else if (z == 1) // compare (1)
;
else
r = fib(z-1) + fib(z-2); // 3*4+2*2+f(z-1)+f(z-2)
return r; // return (1)
}
For r = fib(z-1) + fib(z-2);
it takes 3 ALUs(add/sub), 2 function calls, 1 assignment, and it contains the code of other 2 fib calling. So $f(z)=21+f(z-1)+f(z-2)$ for path c. And you may noticed that, we need to do one more comparison, same as path b.
For a if-else code snippet, the structure/placement matters, especially it will be called recursively.
if (cond1)
func1(); // execute after cond1
else if (cond2)
func1(); // execute after cond1, cond2
else if (cond3)
func1(); // execute after cond1, cond2, cond3
else
func1(); // execute after cond1, cond2, cond3
Let me use some real numbers, if we want to compute the WCET for fib(5)
, we'll have:
$$
\begin{align*}
f(2)=f(1)+f(0)+c=a+b+c \\
f(3)=f(2)+f(1)+c=a+2b+2c \\
f(4)=f(3)+f(2)+c=2a+3b+4c \\
f(5)=f(4)+f(3)+c=3a+5b+7c \\
\end{align*}
$$
The WCET of fib(5)
is composed of some path a, more path b, much more path c. And try to look at it vertically, you may understand why I said, this is under-optimized in a Fibonacci-way.
The weight of path A grows in a Fibonacci-way; the weight of path B grows in a same fashion but one step ahead (one step in Fibonacci...). And for path C? Also grows in a Fibonacci fashion, but every time 1 is added to the weight (for the execution of path c itself).
For this natural version code, the WCET of fib(5)
is $3\times 4+5\times 5+7\times 21=184$
int fib (int z) {
int r;
if (z > 1)
r = fib(z-1) + fib(z-2);
else if (z == 1)
r = 1;
else
r = 0;
return r;
}
If we switch the position of path C with path A, it will save one comparison for path C and add one to path A, which makes WCET of the new fib function is $3\times 5 + 5\times 5 + 7\times 20=180$. The difference is 4 time units of comparison.
For a computation of fib(10)
, the difference will be 54 time units.
For a computation of fib(20)
, the difference will be 6764 time units...
GEMM stands for general matrix multiply, it is the "level 3" routine of the BLAS (Basic Linear Algebra Subprograms), built for common linear algebra operations. GEMM is also widely used in areas like computer vision, machine learning.
The formula of GEMM is:
$C=\alpha AB+\beta C$
where
]]>GEMM stands for general matrix multiply, it is the "level 3" routine of the BLAS (Basic Linear Algebra Subprograms), built for common linear algebra operations. GEMM is also widely used in areas like computer vision, machine learning.
The formula of GEMM is:
$C=\alpha AB+\beta C$
where $A$, $B$, and $C$ are matrices, $\alpha$ and $\beta$ are constant values.
Here is a very neat GEMM implementation written in C, in the well-known neural network framework darknet.
void gemm_cpu(int TA, int TB, int M, int N, int K, float ALPHA,
float *A, int lda,
float *B, int ldb,
float BETA,
float *C, int ldc)
{
int i, j;
for(i = 0; i < M; ++i){
for(j = 0; j < N; ++j){
C[i*ldc + j] *= BETA;
}
}
if(!TA && !TB)
gemm_nn(M, N, K, ALPHA,A,lda, B, ldb,C,ldc);
else if(TA && !TB)
gemm_tn(M, N, K, ALPHA,A,lda, B, ldb,C,ldc);
else if(!TA && TB)
gemm_nt(M, N, K, ALPHA,A,lda, B, ldb,C,ldc);
else
gemm_tt(M, N, K, ALPHA,A,lda, B, ldb,C,ldc);
}
By default, matrix $A$, $B$ is not transposed (TA=0 && TB=0
), which means:
*A
is a 1-d array which stores a $(M, K)$ matrix
*B
is a 1-d array which stores a $(K, N)$ matrix
*C
is a 1-d array which stores a $(M, N)$ matrix, and it will be used to store the final result.
$C=\beta C$ will be computed first for better efficiency. Then we will do $C=\alpha AB+C$ part.
Matrix $A$, $B$, and $C$ are all stored in a row-majored order, which means elements of the same row are stored consecutive in memory. (This doesn't mean all elements in the matrix is stored consecutive in memory).
Elements of the matrix used in the gemm function is not necessarily stored consecutive in memory? A little counter-intuitive right? To explain this, I need to introduce leading dimension (argument lda
, ldb
, and ldc
).
Actually, elements of a matrix is stored consecutive in memory, but when multiplying matrix, sometimes we want to use part of an existing matrix as the input/output, not all of it.
Suppose we have a $(6, 8)$ matrix $Q$ in our memory (row-majored order), and we want to do matrix multiply on part of it, which is a $(3, 4)$ matrix $q$.
Apparently, elements in matrix $q$ were not stored consecutively in memory. Instead of copying the data first then do the gemm, we can do gemm directly if we use the right parameters *A
, M
, K
, and most importantly, lda
. In the previous example:
TA=0
means matrix $Q$ and of course matrix $q$ are row-majored, or not transposed.
lda=8
means the leading dimension (number of columns in this case) of the matrix stored in the memory is $8$, which is the dimension of the matrix $Q$.
K=4
means the dimension (number of columns in this case) of the matrix used for gemm is $4$, which is the dimension of the matrix $q$.
M=3
means the number of rows is $3$ for matrix $q$.
*A=*(Q+10)
means the first element of matrix $q$, is the 11th element of the matrix $Q$, starting address together with offset were given here.
These are all we need for one input/output of the gemm function. And if you're familiar with numpy, here's an example in Python:
import numpy as np
# 1-d array Q, to get the idea how it is stored in memory
Q = np.arange(6 * 8)
print(Q)
# 2-d array QQ, how we understand the matrix, with 2-d shape information
QQ = Q.reshape(6, 8)
print(QQ)
# to help you understand the C explanation above
lda = QQ.shape[1] # 8
K = 4
M = 3
offset = 10
# these are all we need to get the q, or to use it directly in gemm function
q = QQ[offset//lda: offset//lda+M, offset%lda: offset%lda+K]
# q = QQ[1:4, 2:6]
print(q)
After the easy part $C=\beta C$ is done, $\alpha AB$ will be computed. The order of storage of matrix $A$, $B$ should be considered and taken care of.
Matrices can be stored in row-major order or column-major order. Row-major order is used for C-style arrays. That means, by default, elements of the same row are considered to be stored consecutively. But in some cases (an example below will show a situation that can benefit from it) we need to store matrices in a column-major order. Storing a matrix in column-major order in a row-major order convention is eqivalent to store the transpose matrix of the origianl in the memory.
Now you may know why we need int TA
and int TB
parameters in our gemm function. In our simple example, TA=0
means matrix $A$ is stored in a row-major order. And TA!=0
means matrix $A$ is stored in a column-major order, or you can say the transpose matrix of $A$, which is $A^T$, is stored in the memory.
// if (TA == 0 && TB == 0)
void gemm_nn(int M, int N, int K, float ALPHA,
float *A, int lda,
float *B, int ldb,
float *C, int ldc)
{
int i,j,k;
#pragma omp parallel for
for(i = 0; i < M; ++i){
for(k = 0; k < K; ++k){
register float A_PART = ALPHA*A[i*lda+k];
for(j = 0; j < N; ++j){
C[i*ldc+j] += A_PART*B[k*ldb+j];
}
}
}
}
When TA==0 && TB==0
, the snippet above will be used to compute $C=C+\alpha AB$
// if (TA == 0 && TB != 0)
void gemm_nt(int M, int N, int K, float ALPHA,
float *A, int lda,
float *B, int ldb,
float *C, int ldc)
{
int i,j,k;
#pragma omp parallel for
for(i = 0; i < M; ++i){
for(j = 0; j < N; ++j){
register float sum = 0;
for(k = 0; k < K; ++k){
sum += ALPHA*A[i*lda+k]*B[j*ldb + k];
}
C[i*ldc+j] += sum;
}
}
}
]]>cvgear 0.1.0 was released on 20 May, 2020.
CVGear means Computer Vision Gear. It is under MIT License and contains computer vision gears for good uses.
TorchNestedLoader
TorchNestedLoader
allows you to save/load between different modules with actually the same logic structure.
Suppose we have a SimpleNet:
import
]]>cvgear 0.1.0 was released on 20 May, 2020.
CVGear means Computer Vision Gear. It is under MIT License and contains computer vision gears for good uses.
TorchNestedLoader
TorchNestedLoader
allows you to save/load between different modules with actually the same logic structure.
Suppose we have a SimpleNet:
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(
in_channels=3,
out_channels=32,
kernel_size=3,
stride=1,
padding=1,
bias=False
)
self.bn1 = nn.BatchNorm2d(32)
self.conv2 = nn.Conv2d(
in_channels=32,
out_channels=32,
kernel_size=3,
stride=1,
padding=1,
bias=False
)
self.bn2 = nn.BatchNorm2d(32)
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.conv2(x)
x = self.bn2(x)
return x
simplenet = SimpleNet()
The structure of SimpleNet is:
SimpleNet(
(conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
And we have a more "wrapped" version WrappedSimpleNet:
import torch.nn as nn
class Conv2d(torch.nn.Conv2d):
def __init__(self, *args, **kwargs):
norm = kwargs.pop("norm", None)
super().__init__(*args, **kwargs)
self.norm = norm
def forward(self, x):
x = super().forward(x)
if self.norm is not None:
x = self.norm(x)
return x
class WrappedSimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.stem = Conv2d(
in_channels=3,
out_channels=32,
kernel_size=3,
stride=1,
padding=1,
bias=False,
norm=nn.BatchNorm2d(32)
)
self.conv1 = Conv2d(
in_channels=32,
out_channels=32,
kernel_size=3,
stride=1,
padding=1,
bias=False,
norm=nn.BatchNorm2d(32)
)
def forward(self, x):
x = self.stem(x)
x = self.conv1(x)
return x
wrappedsimplenet = WrappedSimpleNet()
The structure of WrappedSimpleNet is:
WrappedSimpleNet(
(stem): Conv2d(
3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(conv1): Conv2d(
32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
The logic structure of SimpleNet and WrappedSimpleNet is exactly the same. But they differ from submodule names and tree structure. So you cannot save/load state_dict
between these two modules using .state_dict()
and .load_state_dict()
methods easily.
But with TorchNestedLoader
, you can save/load nested_dict
between these two modules easily:
from cvgear.framework.torch import TorchNestedLoader
simplenetloader = TorchNestedLoader(simplenet)
wrappedsimplenetloader = TorchNestedLoader(wrappedsimplenet)
# save as nested_dict
nested_dict = simplenetloader.nested_dict()
# load nested_dict
wrappedsimplenetloader.load_nested_dict(nested_dict)
Imagine you just implement one state-of-the-art model in torch and want to test it. Train your model from scratch will be time-consuming. After downloading pre-trained model from the Internet then found it painful to load it to your model manually?
Use TorchNestedLoader
as your gear!
DarknetParser
Everyone loves darknet. It is very fast and in public domain.
The configuration file of the darknet network is often long and tedious(due to its sequential structure), hard to read through. With darknet installed, the network configuration file can be parsed easily and you can get a clear sense of what the network structure is from the information it displayed.
Without darknet installed:
from cvgear.framework.darknet import DarknetParser, build_darknet_parser
# create a DarknetParser instance, then load network configuration
darknet53 = DarknetParser("darknet53")
darknet53.load_darknet_cfg("path/to/darknet53.cfg")
# or build a DarknetParser from network configuration file directly
darknet53 = build_darknet_parser("path/to/darknet53.cfg")
print(darknet53)
Crystal clear!
DarknetNestedLoader
Save/load torch module...
Parse darknet network...
What about... Save/load darknet network?
Even more! Save/load between darknet network and torch.nn.Module
! DarknetNestedLoader
is made for save/load darknet network(DarknetParser) as binary weights(.weights
file) or as nested_dict
.
from cvgear.framework.darknet import DarknetNestedLoader, build_darknet_nested_loader
# create a DarknetNestedLoader instance with DarknetParser, then load from binary weights file
darknet53loader = DarknetNestedLoader(darknet53)
darknet53loader.load_darknet_weights("path/to/darknet53.weights")
# or build a DarknetNestedLoader from network configuration file and binary weights file
darknet53loader = build_darknet_nested_loader("path/to/darknet53.cfg", "path/to/darknet53.weights")
# save weights to nested_dict
nested_dict = darknet53loader.nested_dict()
# load nested_dict to a torch.nn.Module with TorchNestedLoader
# ...
DarknetParser
describes a darknet network(as torch.nn.Module
describes a torch module)DarknetNestedLoader
can save/load a darknet network as nested_dict
or binary fileTorchNestedLoader
can save/load a torch module as nested_dict
DarknetNestedLoader
and TorchNestedLoader
, you can convert between darknet weights and torch weights easily.That is little cvgear 0.1.0
More gears are coming up...
Happy inauguration!🎉🎉🎉
New research starts with understanding, reproducing and verifying previous results in the literature. Detectron2 made the process easy for computer vision tasks.
This post contains the #installation, #demo and #training of detectron2 on windows.
update:
2020/07/08
New research starts with understanding, reproducing and verifying previous results in the literature. Detectron2 made the process easy for computer vision tasks.
This post contains the #installation, #demo and #training of detectron2 on windows.
update:
2020/07/08
Learning detectron2 starts with installing.
REM "Create a conda environment named 'detectron2' with the latest version of Python 3.7.x"
conda create --name detectron2 python=3.7
REM "Activate the conda environment for 'detectron2'"
conda activate detectron2
Note: All required python package will be installed in this environment(so does detectron2 itself), make sure activate the environment by command conda activate detectron2
before you do anything with detectron2. Deactivate the environment by conda deactivate
so you can go back to your previous working environment.
The latest version of detectron2 requires pycocotools 2.0.1
Install it by pip install pycocotools>=2.0.1
for Linux
But for windows, you should first download pycocotools-2.0.1.tar.gz from PyPi.
Unzip it then edit pycocotools-2.0.1\setup.py
:
replace extra_compile_args=['-Wno-cpp', '-Wno-unused-function', '-std=c99']
, with extra_compile_args={'gcc': ['/Qstd=c99']},
Back to command prompt, install pycocotools to site-packages of current environment(detectron2):
cd pycocotools-2.0.1
python setup.py build_ext install
If it works, you should see the info Finished processing dependencies for pycocotools==2.0.1
, then you can delete the cocoapi directory if you like:
cd ..
RMDIR /S pycocotools-2.0.1
Check your CUDA version first:
nvcc --version
It should be ≥ 9.2 (that is 9.2, 10.0 or 10.1), go to https://pytorch.org/get-started/locally/, select your CUDA version copy the command (e.g. for CUDA 10.1 it should be)
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
Official version doesn't support windows currently. To build and use it successfully on windows, you should edit some files: File 1, File 2, File 3, File 4, File 5, File 6
This repository ivanpp/detectron2 contains the latest version of official detectron2 with windows patches mentioned above. So the easy way to do this is to clone and build it:
git clone https://github.com/ivanpp/detectron2.git
cd detectron2
pip install -e .
Or use the official version:
git clone https://github.com/facebookresearch/detectron2.git
Then edit the files mentioned above and build it:
cd detectron2
pip install -e .
Note: it may took a while to build all the .cu
and .cpp
files, be patient!
Check the installation:
python -m detectron2.utils.collect_env
The result should like:
Make sure the NVCC version of detectron2 matches the NVCC version of PyTorch. If not, you may choose the wrong version at Step 2.
Choose a model in the model zoo, set the input config file and specify the corresponding MODEL.WEIGHT
for it.
python demo/demo.py ^
--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml ^
--input datasets/coco/unlabeled2017/000000000361.jpg ^
--output output.jpg ^
--opts MODEL.WEIGHTS detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x/137260431/model_final_a54504.pkl
Note:
Detectron2Handle
, see detectron2/detectron2/checkpoint/catalog.py for details.%USERPROFILE%/.torch/fvcore_cache
if $FVCORE_CACHE
environment variable is not set. (For Linux, the default cache file is ~/.torch/fvcore_cache
), see fvcore/fvcore/common/file_io.py for details.--opts PATH/TO/model_final_a54504.pkl
.All the config files are made for 8-GPU training. To reproduce the result on 1 GPU, there are changes to made. For example, to reproduce the result in configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml
, you can edit the corresponding .yaml
file(mask_rcnn_R_50_FPN_1x.yaml
or Base-RCNN-FPN.yaml
) or overwrite the training parameters in command line.
Inconvenient but once-for-all way:
Edit configs\Base-RCNN-FPN.yaml
:
SOLVER:
IMS_PER_BATCH: 2
BASE_LR: 0.0025
STEPS: (480000, 640000)
MAX_ITER: 720000
Train the model:
python tools/train_net.py ^
--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml ^
OUTPUT_DIR output/mask_rcnn_r50_fpn_1x
Convenient way:
Simply overwrite it through command line, no need to edit any file:
python tools/train_net.py ^
--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml ^
SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 ^
SOLVER.MAX_ITER 720000 SOLVER.STEPS (480000,640000) ^
OUTPUT_DIR output/mask_rcnn_r50_fpn_1x
All the checkpoints and the final model will be stored at the OUTPUT_DIR
we defined, output/mask_rcnn_r50_fpn_1x
, along with tensorflow eventlog file, log file... A comprehensive model config file will be generated automatically(output/mask_rcnn_r50_fpn_1x/config.yaml
).
Training progress may shut down sometimes, manually or accidentally. To resume training, simply run:
python tools/train_net.py ^
--config-file output/mask_rcnn_r50_fpx_1x/config.yaml
--resume
The training will be resumed from the last checkpoint automatically, no need to specify the checkpoint unless you need it for some reason.
Use tensorboard to visualize the training progress during or after training:
tensorboard --logdir output
Detectron2 will evaluate the final model after the training progress. To evaluate the performance of any checkpoint:
python tools/train_net.py ^
--config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml ^
--eval-only MODEL.WEIGHTS /path/to/checkpoint_file
]]>Conda installs packages from default channel if no channel is specified.
Use conda config --show channels
to see the current channel list.
Subject to the GFW, downloading process may be very slow in mainland of China.
The once for all solution is breaking
]]>Conda cheat sheet
Conda documentation
Conda installs packages from default channel if no channel is specified.
Use conda config --show channels
to see the current channel list.
Subject to the GFW, downloading process may be very slow in mainland of China.
The once for all solution is breaking the wall or leaving the mainland, both hard to achieve.
You can circumvent the GFW through proxy or use domestic channel as alternative. I prefer the former.
Suppose you have a local socks5 proxy listening port 1080, simply modified the .condarc
:
proxy_servers:
http: socks5://127.0.0.1:1080
https: socks5://127.0.0.1:1080
Two ways to add channel list:
Create .condarc
in %UserProfile%/.conda
, follow the YAML
syntax, override the channel list configuration like:
channels:
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
- defaults
Channels are organized from highest to lowest priority.
REM "Add to the top of the channel list"
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
OR
REM "Add to the bottom of the channel list"
conda config --append channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --append channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
Use -c
or --channel
flag to add additional channel to search when installing:
conda install numpy --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
And use --override-channels
flag to skip the channels list in .condarc
conda install numpy --override-channels --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
REM "Searh package and display the detailed information"
conda search PKGNAME --info
REM "Specify version"
conda install PKGNAME==3.14
REM "Specify channel"
conda install --channel conda-forge
conda install -c conda-forge
REM "Specify environment"
conda install PKGNAME --name ENVNAME
conda install PKGNAME -n ENVNAME
REM "Install from local directory"
conda install PATH/TO/PKGNAME.tar.bz2 --offline
conda update PKGNAME --name ENVNAME
conda remove PKGNAME --name ENVNAME
Conda allows you to create environments containing different packages and even different python version that will not interact with other environments.
Conda will install package to base
environment as default if no environment is specified when installing.
Here are some convenient command to manager conda environment:
REM "Current environment is highlighted with an asterisk(*)"
conda info --envs
Or just go to %UserProfile%/.conda/environments.txt
REM "Create environment with specific python version"
conda create --name ENVNAME python==VERSION
REM "Create environment to specific path"
conda create --prefix PATH/TO/ENVNAME
REM "Create environment from .yaml file"
conda env create --file PATH/TO/environment.yaml
REM "Create environment from .txt file"
conda create --name ENVNAME --file PATH/TO/spec-file.txt
REM "Create environment from existing environment"
conda env create --name NEWNAME --clone OLDNAME
conda activate ENVNAME
conda deactive
conda list --name ENVNAME
conda install --name ENVNAME pip
conda activate myenv
pip <pip_subcommand>
REM "example: pip install PKGNAME -f LINK/LOCAL_PATH"
REM "Export to .yaml file"
conda activate ENVNAME
conda env export > PATH/TO/environment.yaml
REM "Export spec list as txt file"
conda list --name ENVNAME --explicit > PATH/TO/spec-file.txt
conda remove --name ENVNAME --all
]]>Happy Chinese New Year🎉! It's seventh lunar month also end of the holiday. So i learn to do something meaningful, get rid of some bad shit.
The ringleader of Chinese increasingly closed Internet - WeChat, kidnapped my family and friends, forced me to use their so-called social software(spyware actually)
]]>Happy Chinese New Year🎉! It's seventh lunar month also end of the holiday. So i learn to do something meaningful, get rid of some bad shit.
The ringleader of Chinese increasingly closed Internet - WeChat, kidnapped my family and friends, forced me to use their so-called social software(spyware actually). And I'm not looking for network neutrality cuz we're far far away from it. That fuckin spyware doing something worse. Like scan my disk for privacy, do content review to decide what I could see, filter what I'm sayin without telling me👿.
Since that imp enslaves me and I have no escape, I decide to let my Telegram enslaves that imp and then talk to my Telegram equally and freely, without getting my hands dirty.
Libraries to use:
It's important to know EWS is now Alpha version so it's unstable and changes rapidly. So again, I use:
/newbot
WeChat Slave
panda_wechat_bot
/setprivacy
status to Disable
/setjoingroups
status to Enable
Optional:
Set bot's profile photo: /setuserpic
Set bot's description: /setdescription
Set bot's about text: /setabouttext
Set commands helper: /setcommands
Commands helper:
help - Show commands list.
link - Link a remote chat to a group.
chat - Generate a chat head.
info - Display information of the current Telegram chat.
update_info - Update the group name and profile picture.
unlink_all - Unlink all remote chats from a group.
extra - Access additional features from Slave Channels.
Ask @BotFather for your Bot's token: /token
, record it like: 123456789:EXAMPLEOF5BOTTOEKN5TOACCESS5HTTPAPI
.
Ask @get_id_bot for your Chat ID, record it like: 716124421
.
sudo apt update
sudo apt install -y python3 python3-pip python3-pil python3-setuptools python3-numpy python3-yaml python3-requests
sudo apt install -y ffmpeg libmagic-dev libwebp-dev screen
pip3 install imageio==2.4.0
pip3 install ehforwarderbot==2.0.0b13
pip3 install efb-telegram-master==2.0.0b18
pip3 install efb-wechat-slave==2.0.0a16
mkdir -p ~/.ehforwarderbot/profiles/default
vim ~/.ehforwarderbot/profiles/default/config.yaml
Set the master and slave:
master_channel: "blueset.telegram"
slave_channels:
- "blueset.wechat"
mkdir -p ~/.ehforwarderbot/profiles/default/blueset.telegram
vim ~/.ehforwarderbot/profiles/default/blueset.telegram/config.yaml
Set token as Bot's token recorded before to access the bot.
And set admins as Chat ID recorded before to make sure only you can access it.
token: "123456789:EXAMPLEof5BOTtoken5toaccess5HTTPAPI"
admins:
- 716124421
screen ehforwarderbot
]]>Post cover image from Self-Censorship in China Continues, Extends to Mobile Apps
Note: This post is based on AlexeyAB/darknet version, the procedure of pjreddie/darknet version may differ slightly (have not tried, maybe identical).
8 steps to build your own deep learning lego module in Darknet:
Define LAYER_TYPE
Add LAYER_TYPE for your custom layer in
Note: This post is based on AlexeyAB/darknet version, the procedure of pjreddie/darknet version may differ slightly (have not tried, maybe identical).
8 steps to build your own deep learning lego module in Darknet:
Define LAYER_TYPE
Add LAYER_TYPE for your custom layer in layer.h
typedef enum {
// ...
CUSTOM;
} LAYER_TYPE;
Define layer string
Add layer string for your custom layer in parser.c
LAYER_TYPE string_to_layer_type(char * type)
{
// ...
if (strcmp(type, "[custom]")==0) return CUSTOM;
}
Then darknet would be able to recognize your custom layer in cfg file:
[net]
#...
[custom]
#...
Implement your custom layer: custom_layer.c
and custom_layer.h
Should contain at least 4 functions:
layer make_custom_layer(int batch, int w, int h, .....);
void forward_custom_layer(const layer l, network_state state);
void backward_custom_layer(const layer l, network_state state);
void resize_custom_layer(layer *l, int w, int h);
(optional) If you want to train it with GPU, implement these:
#ifdef GPU
void forward_custom_layer_gpu(const layer l, network_state state);
void backward_custom_layer_gpu(const layer l, network_state state);
#endif
In parser.c
Include source file of your custom layer(to use make_custom_layer()
):
#include "custom_layer.h"
Implement the parse function:
layer parse_custom(list *options, size_params params)
{
int param1 = option_find_int(options, "param1", 1);
//...
layer l = make_custom_layer(params.batch, params.w, params.h, param1, ...);
l.param2 = option_find_float(options, "param2", .1);
//...
return l;
}
Add your parse function in parse_network_cfg_custom()
:
network parse_network_cfg_custom(char *filename, int batch)
{
//...
while(n){
//...
LAYER_TYPE lt = string_to_layer_type(s->type);
if(lt == CONVOLUTIONAL){
l = parse_convolutional(options, params);
}else if(lt == CUSTOM){
l = parse_custom(options, params);
}
}
//...
return net;
}
In network.c
Include source file of your custom layer(to use resize_custom_layer()
):
#include "custom_layer.h"
Modify int resize_network(network *net, int w, int h)
function:
int resize_network(network *net, int w, int h)
{
//...
for (i = 0; i < net->n; ++i){
layer l = net->layers[i];
if(l.type == CONVOLUTIONAL){
resize_convolutional_layer(&l, w, h);
}else if(l.type == CUSTOM){
resize_custom_layer(&l, w, h);
}
[optional] If your custom layer is used to produce results(like YOLO, REGION or DETECTION):
Implement custom_num_detections()
and get_custom_detections()
in custom_layer.c
, then modify 2 functions in network.c
(to count the detections and get the detections):
int num_detections(network *net, float thresh)
{
int i;
int s = 0;
for (i = 0; i < net->n; ++i) {
if (l.type == CUSTOM) {
s += custom_num_detections(l, thresh);
}
//...
}
return s;
}
void fill_network_boxes(network *net, int w, int h, float thresh, float hier, int *map, int relative, detection *dets, int letter)
{
int prev_classes = -1;
int j;
for (j = 0; j < net->n; ++j) {
layer l = net->layers[j];
if (l.type == CUSTOM){
int count = get_custom_detections(...);
//...
}
//...
}
}
Add custom_layer.c
and custom_layer.h
in your Visual Studio Solution build/darknet.sln
Or add custom_layer.o
in your Makefile
Rebuild your project
]]>Post cover image from Lego Store | Copenhagen
這篇 blog 用來記錄我客製化 Rime 輸入法過程中的一些想法😄
做這件事之前必須搞清楚的一點就是: 何謂客製化?
客製化是根據需求, 進行定製
Rime 可以定製的地方非常多, 但並非因爲一個組件可定製就要去定製它. 根本上還是要基於自己的習慣, 去 customize @ivanpp 自己的 Rime
這篇文章用正體字, 是因爲在瞭解和使用這款軟體的過城中, 我感受到了繁體字文化的意義所在. 但由於大環境的因素, 我並不方便頻繁使用繁體字, 所以就在這裏以小小的行動表現出我的敬意了!
配色方案我選擇了默認的 ps4 方案, 默認的字體以及字號我也非常滿意, 所以並沒有去爲了客製化而客製化, 只是簡單地選擇了這個配色.
同樣是在 weasel.custom.yaml
文件中進行了的定製, 所以我寫在這裏, 事實上我並不清楚爲什麼爲什麼將這個功能的客製化放在這裏進行.
很簡單的, 根據自己的需求, 在 bash, cmd, atom, MSVS 等軟體中,
這篇 blog 用來記錄我客製化 Rime 輸入法過程中的一些想法😄
做這件事之前必須搞清楚的一點就是: 何謂客製化?
客製化是根據需求, 進行定製
Rime 可以定製的地方非常多, 但並非因爲一個組件可定製就要去定製它. 根本上還是要基於自己的習慣, 去 customize @ivanpp 自己的 Rime
這篇文章用正體字, 是因爲在瞭解和使用這款軟體的過城中, 我感受到了繁體字文化的意義所在. 但由於大環境的因素, 我並不方便頻繁使用繁體字, 所以就在這裏以小小的行動表現出我的敬意了!
配色方案我選擇了默認的 ps4 方案, 默認的字體以及字號我也非常滿意, 所以並沒有去爲了客製化而客製化, 只是簡單地選擇了這個配色.
同樣是在 weasel.custom.yaml
文件中進行了的定製, 所以我寫在這裏, 事實上我並不清楚爲什麼爲什麼將這個功能的客製化放在這裏進行.
很簡單的, 根據自己的需求, 在 bash, cmd, atom, MSVS 等軟體中, 我使用了默認的英文輸入.
Rime 支持非常非常多的輸入方案, 我僅保留了我需要的三種方案:
就像 Counter Strike 1.6 ~
中使用呼叫控制檯一樣, 我使用 Contrl_L + ~
呼出 Rime 的控制檯. 我去掉了默認的F4
因爲我覺得難以記住並且會經常引起熱鍵衝突.
選項 1
固定爲目前正在使用的輸入方案, 而選項2固定爲模式切換的選單. 事實上選1和選2都會進入模式切換的選單. It makes sense. 因爲當你已經處於你選擇的方案的時候 , 與其進行無意義的空操作, 不如讓它成爲一個具體模式切換的按鈕. 實際使用起來這樣的方式也非常的舒服.
而具體的模式切换選單, 被我調整成了這樣. 原因如下: 雖然全角標點看起來已經 deprecated 了. 但由於中西文的切換與中西文標點的切換我並不會大費周章的去控制檯切換, 而會去簡單的按下Shift_L
所以它們實際上的使用頻率更低, 理應放在更後. 所以實際上, 我擁有了一個 1 2
或是 2 1
都可以實現的便捷且不用考慮到熱鍵佔用問題的簡繁體切換. 而 1 1
和 2 1
成爲了真正的空操作. 事實上當自己呼出控制檯後不知道自己想要幹什麼的時候, 我們就要執行空操作或者去取消. 根據我的使用發現, 當我自己的思維比較活躍的時候, 我往往會迅速的使用 1 1
進行空操作並繼續碼字, 而當我在思考或者比較遲鈍的時候我往往會按下 Esc
來達到同樣的效果.
我使用了如下的中英文切換方案來應對一些不同的情況:
Control_L
設置成了 commit_code
, 當我處於中文輸入的狀態, 即將輸入一段英文, 輸入了第一個單詞才發現自己的狀態, 這時候按 enter
再去按 Shift_L
進行中英文的轉化就很費事, Control_L
這時可以立刻將我已經輸入的英文內容 commit 到屏幕上, 並自動切換爲英文模式, 可以省區不少力氣.
另外, 在沒有任何輸入的情況下, Control_L
也可以被用作中英文切換, 但我的小拇指已經黏在了 Shift_L
上了, 我想我應該很少會去用它實現這個功能.
Shift_L
被設定成 inline_ascii
, 在無輸入的狀態下, 它是中英文的切換, 使用頻率比較高. 這裏它做另一種用法, 就是在大量的中文之間, 我需要插入一個單詞量 'grater than 1' 的短語, 而這個時候, 我又又又忘記了切換爲英文模式, 或者說專登沒有去切換. 那麼我需要做的就是, 再輸入了第一個英文字母後按下 Shift_L
然後輸入完整個短語, 或者句子, 或者郵箱? 再按 enter
. More effective!
經常會輸入一些郵箱或者文件名, 所以遇到 _
或是 @
時不能直接上屏, 要允許輸完整個郵箱或者是文件名:
當然如果自己記得事先切換到西文模式, 那也非常好. 郵箱也是一樣:
Convenient~
對自己使用頻率最高的'朙月拼音·简化字'我進行了一些定製與擴充, 從而讓我自己的使用更加便利.
我使用頻率最高的輸入方案是'朙月拼音·简化字'(下邊簡稱爲'簡體方案'), 而我最經常做的事情是 coding, 基於這兩點, 我在簡體方案中 ban 掉了中文標點, 徹徹底底! 因爲由中文符號造成的程式錯誤會讓我發瘋, 而全套的西文標點在日常聊天中也很協調. 至少我用起來很舒服.😃
增加了 MacOS 的換頁方案, [ ]
, 同時保留了所有默認的方案, 包括 Emacs 風格的那些. 實際上我使用最頻繁的還是 - +
.
拼音是我最熟練的輸入方案, 而我又很少去輸入生僻字, 所以並不需要筆畫輸入作爲編碼反查. 所以我將~
定義爲自定義詞組鍵位:
~f
用來輸入常用的 Emoji, 😋
~m
用來輸入數學符號, ±
~ar
用來輸入箭頭, ↑
當然還有很多其他什麼的! 還有給自己埋了彩蛋 😂
主要就是擴充了英文詞庫, 以滿足我自己的(頻繁地 😄)中英文混合輸入的需要.
夾帶了大量'私貨':
Shift + Control + 1/2/3/4/5
對應 menu 的五個選項, 其中 1
對應: 下一個輸入法.
其實對於我來說, 一隻手來按的話, 還是用Contrl_L + ~
再按具體數字更加快捷(手掌配合兩根手指)
使用 Control + Delete
或是 Shift + Delete
可以刪除字典中的錯詞.
需要備份的文件有: default.custom.yaml
, weasel.custom.yaml
, luna_pinyin_simp.custom.yaml
還有標點定義文件 ivanpp_punc.yaml
, 字典定義文件 ivanpp_dict.extended.dict.yaml
以及其中使用的所有字典文件.
可定期備份字典快照, luna_pinyin.userdb.txt
, 該文件處於用戶文件夾內.
RIME 也提供 GUI 用于備份及合併詞典快照和導出及導入文本碼表.
所以... 幾個月後, 導出文本碼表, 看看我經常使用哪些詞彙好了! 😋
The convolutional layer before yolo layer should have filters=n*(4+1+classes)
. n
is number of the prior anchors we used in the following yolo layer, namely sizeof(mask)
. classes
is number of the classes.
[convolutional]
size=1
stride=1
pad=1
filters=75
activation=linear
The
]]>The convolutional layer before yolo layer should have filters=n*(4+1+classes)
. n
is number of the prior anchors we used in the following yolo layer, namely sizeof(mask)
. classes
is number of the classes.
[convolutional]
size=1
stride=1
pad=1
filters=75
activation=linear
The shape of the input tensor is $(b, n*(4+1+classes), h, w)$. More specifically, it's the concatenation of $n$ individual $(4+1+classes, h, w)$ tensor per image. It's an 1-d array actually, but imagine it to be a $(b, n, (4+1+classes), h, w)$ tensor. The first and second dimensions are $w$ and $h$ and the third dimension is $(4+1+classes)$. So for all b
images and all n
anchors, we have a $(4+1+classes)$ prediction tensor at each location. And the stride of these predictions' elements are l.h*l.w
.
static int entry_index(layer l, int batch, int location, int entry)
{
int n = location / (l.w*l.h);
int loc = location % (l.w*l.h);
return batch*l.outputs + n*l.w*l.h*(4+l.classes+1) + entry*l.w*l.h + loc;
}
location
should have the value between 0
and l.n*l.h*l.w-1
, it gives the number of the prior anchor n
and the location loc
simultaneously. entry
should between 0
and 4+1+classes-1
, gives the index of the third dimension(the prediction tensor).
In Yolo Layer, the net predicts offset from the bounding box prior width and height. We define 3 options mask
, num
, anchors
to use the prior anchor boxes.
In cfg file:
[net]
width=416
height=416
[yolo]
mask = 0,1,2
num=9
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
In c source file:
// https://github.com/pjreddie/darknet/blob/master/src/parser.c#L306-L342
int total = option_find_int(options, "num", 1);
int num = total;
char *a = option_find_str(options, "mask", 0); // char *a = "0,1,2";
int *mask = parse_yolo_mask(a, &num); // int *mask = {0, 1, 2};
num
in cfg file or total
in the source file is the total number of the prior anchors that we can use in the entire network. mask
in cfg file gives the index of the prior anchors that we use in the current yolo layer. So we can define lots of anchors and use only few of them in current yolo layer. anchors
in cfg file gives all num
available anchors with the shape $(p_w, p_h)$. The anchor sizes $(p_w, p_h)$ are actual pixel values on input image of the network, in this case is $(416, 416)$. So $(10, 13)$ is a prior anchor with width=10 pixels and height=13 on the $(416, 416)$ resized input image.
# Example
#yolo_layer0
[yolo]
mask = 0,1
num=3
anchors = 10,13, 16,30, 33,23
#yolo_layer1
[yolo]
mask = 1,2
num = 3
anchors = 10,13, 16,30, 33,23
#yolo_layer2
[yolo]
num=2
In the example above, yolo_layer0 uses anchors $(10, 13), (16, 30)$, yolo_layer1 uses anchors $(16, 30), (33, 23)$ and yolo_layer2 does not use prior anchor. In fact, yolo_layer2 uses 2 $(0.5, 0.5)$ anchors as default.
Yolo layer predicts l.n*l.h*l.w
bounding boxes per image(l.n
is the length of *mask
, namely the number of the prior anchors used in the current yolo layer). And there is one objectiveness score for each predicted bounding box which gives the $Pr(Object)$ of it. What we want is to predict 1
objectiveness for all positive samples and 0
for all negative ones.
[yolo]
ignore_thresh = .5
truth_thresh = .9
Two kinds of predictions are considered as positive:
For all num
prior anchors centered in the same cell with the GT Bbox(ground truth bounding box), the anchor which has the most similar shape with the GT Bbox will be the only anchor responsible for that GT Bbox. In the other word, at most one best prior anchor will be allocated to each GT Bbox in the current yolo layer. And if this best anchor is not used in the current yolo layer(the index of the best anchor is not in *mask
of the current yolo layer), no anchor will be allocated for that GT Bbox.
For all l.n*l.h*l.w
predictions, if the highest IoU between the prediction and all ground truth bounding boxes is grater than truth_thresh
, this prediction will be responsible for the GT Bbox which gives the highest IoU with it.
Additionally, yolo layer sets truch_thresh = 1
as default. Since IoU is always less than or equal to 1
, the second situation will never happen. So yolo layer penalizes at most 1 (of l.n*l.h*l.w
) prediction for each GT Bbox. Penalize its objectiveness for not being 1
.
And there is a ignore_thresh
for negative(background) definition. If the highest IoU between the prediction and all GT Bboxes are less than or equal to ignore_thresh
, that prediction will be assigned as negative. Penalize its objectiveness score for not being 0
.
*output
is the input *state.input
of the last convolutional layer, namely the prediction tensor. *delta
is the gradient of the yolo layer. index
gives the index of the starting class probability $Pr(Class_0|Object)$ for certain batch b
, certain anchor n
and certain position w, h
. Remember we have b
images, n
anchors for each position and w*h
locations. class
is class of the ground truth and classes
gives the number of the classes. stride
will always be l.w*l.h
and *avg_cat
is for statistic usage, to calculate the average class probability
void delta_yolo_class(float *output, float *delta, int index, int class, int classes, int stride, float *avg_cat)
{
int n;
if (delta[index]){ // if some anchor is responsible for more than one GT
delta[index + stride*class] = 1 - output[index + stride*class];
if(avg_cat) *avg_cat += output[index + stride*class];
return;
}
for(n = 0; n < classes; ++n){ // common situation
// penalize Pr(Classi|Object) for all classes
delta[index + stride*n] = ((n == class)?1 : 0) - output[index + stride*n];
if(n == class && avg_cat) *avg_cat += output[index + stride*n];
}
}
Given the index of the $Pr(Class_0|Object)$ of delta_yolo_class
penalize $Pr(Class_i|Object)$ for all $Class_i$. It wants the $Pr(Class_{i=gt}|Object)$ to be 1
and others to be 0
. And if some lucky anchor is responsible for more than one ground truth box and these GT boxes may or may not contain the same class. Just overwrite the gradients for the other ground truth class probability and leave others along. For example if we have 20 classes and some lucky anchor is responsible for 2 different classes(let's say there are dog and cat) in some naughty image. It will penalize $Pr(Class_{i=dog}|Object)$ and $Pr(Class_{i=cat}|Object)$ for not be 1
and penalize others for not be 0
.
int obj_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 4); // index of objectiveness
avg_anyobj += l.output[obj_index]; // sum the objectiveness for all pred box
l.delta[obj_index] = 0 - l.output[obj_index]; // common situation, low iou
if (best_iou > l.ignore_thresh) { // best_iou > ignore_thresh -> Positive, then don't penalize the
l.delta[obj_index] = 0;
}
if (best_iou > l.truth_thresh) { // nerver gonna happen when l.truth_thresh = 1
l.delta[obj_index] = 1 - l.output[obj_index]; //
int class_id = state.truth[best_t*(4 + 1) + b*l.truths + 4]; // get the class_id of the GT box
if (l.map) class_id = l.map[class_id];
int class_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 4 + 1);
delta_yolo_class(l.output, l.delta, class_index, class_id, l.classes, l.w*l.h, 0, l.focal_loss);
box truth = float_to_box_stride(state.truth + best_t*(4 + 1) + b*l.truths, 1);
delta_yolo_box(truth, l.output, l.biases, l.mask[n], box_index, i, j, l.w, l.h, state.net.w, state.net.h, l.delta, (2-truth.w*truth.h), l.w*l.h);
}
The network predicts 4 coordinates for each bounding box, $t_x, t_y, t_w, t_h$.
$\sigma(t_x)$ and $\sigma(t_y)$ are the box center position relative to the cell. $t_w$ and $t_h$ predict how much the bounding box is grater or smaller than the prior anchor. For example, if $t_w > 0$, then there will be $\mathrm{e}^{t_w} > 1$, and we will have $b_w > p_w$.
$$
\begin{align}
b_x & = \sigma(t_x) + c_x \\
b_y & = \sigma(t_y) + c_y \\
b_w & = p_w\mathrm{e}^{t_w} \\
b_h & = p_h\mathrm{e}^{t_h} \\
\end{align}
$$
$b_x$ and $b_y$ are the pixel distance from the top left corner of the current feature map $(l.w, l.h)$. And since $p_w$ and $p_h$ are actual pixel values, $b_w$ and $b_h$ are the actual pixel values on the resized network input image. To get normalized prediction result, $b_x$ and $b_y$ should be divided by the size of current feature map lw
and lh
. Similarly, $b_w$ and $b_h$ should be divided by the size of the resized input of the network w
and h
.
box get_yolo_box(float *x, float *biases, int n, int index, int i, int j, int lw, int lh, int w, int h, int stride)
{
box b;
b.x = (i + x[index + 0*stride]) / lw;
b.y = (j + x[index + 1*stride]) / lh;
b.w = exp(x[index + 2*stride]) * biases[2*n] / w;
b.h = exp(x[index + 3*stride]) * biases[2*n+1] / h;
return b;
}
get_yolo_box
converts the prediction result $\sigma(t_x), \sigma(t_y), t_w, t_h$ to the normalized box struct instance. Inversely, to compute the gradients of the bounding box prediction, we should convert the already normalized ground truth label box truth
back to $\sigma(\hat{t}_x), \sigma(\hat{t}_y), \hat{t}_w, \hat{t}_h$.
float delta_yolo_box(box truth, float *x, float *biases, int n, int index, int i, int j, int lw, int lh, int w, int h, float *delta, float scale, int stride)
{
box pred = get_yolo_box(x, biases, n, index, i, j, lw, lh, w, h, stride);
float iou = box_iou(pred, truth);
float tx = (truth.x*lw - i);
float ty = (truth.y*lh - j);
float tw = log(truth.w*w / biases[2*n]);
float th = log(truth.h*h / biases[2*n + 1]);
delta[index + 0*stride] = scale * (tx - x[index + 0*stride]);
delta[index + 1*stride] = scale * (ty - x[index + 1*stride]);
delta[index + 2*stride] = scale * (tw - x[index + 2*stride]);
delta[index + 3*stride] = scale * (th - x[index + 3*stride]);
return iou;
}
So using delta_yolo_box
, we can convert normalized bounding box label to $\sigma(\hat{t}_x), \sigma(\hat{t}_y), \hat{t}_w, \hat{t}_h$ then subtract $\sigma(t_x), \sigma(t_y), t_w, t_h$ to get the gradients. But if we only do the subtraction, large bounding boxes will take advantage of their big size to impact the gradient. To compromise this, we multiply the gradients by scale
to magnify the gradients of the relatively small GT bounding boxes. Generally, set scale
to 2-truth.w*truth.h
to do this, gradients of the small GT bounding boxes will be magnify by a factor close to 2
while gradients of the big GT bounding boxes will be magnify by a factor close to 1
.
doc: torch.Tensor — PyTorch master documentation
ref: How to create a tensor on GPU as default - PyTorch Forums
torch.Tensor
is an alias for the default tensor type (torch.FloatTensor
). - said by the document
import torch
import torch.nn as nn
import torch.
]]>doc: torch.Tensor — PyTorch master documentation
ref: How to create a tensor on GPU as default - PyTorch Forums
torch.Tensor
is an alias for the default tensor type (torch.FloatTensor
). - said by the document
import torch
import torch.nn as nn
import torch.nn.functional as F
Torch defines 8 CPU tensor type and 8 GPU tensor type
The default tensor type is torch.FloatTensor
, which is a CPU Tensor and has a dtype of torch.float32
print(torch.get_default_dtype()) # To get the default Tensor dtype(torch.float32)
And this(torch.FloatTensor
) makes tensors to be created on CPU if no device is specified.
To make tensors to be crated on GPU by default:
torch.set_default_tensor_type('torch.cuda.FloatTensor')
After this, all tensors will be created on the selected GPU device, and still has a dtype of torch.float32
by default.
a = torch.tensor([1.])
print(a.dtype)
print(a.device)
One more example:
torch.set_default_tenor_type('torch.cuda.DoubleTensor')
This makes tensors to be created on GPU by default and has a dtype of torch.float64
Use torch.device
to get the torch.device
object
Get the CPU device
cpu = torch.device('cpu') # Current CPU device
cpu1 = torch.device('cpu:0')
It's exactly the same, cuz there is no multiple CPUs mode.
Get the GPU device
# Current GPU device
cuda = torch.device('cuda')
cuda = torch.device('cuda', None)
# GPU 0
cuda0 = torch.device('cuda:0')
cuda0 = torch.device('cuda', 0)
# GPU 1
cuda1 = torch.device('cuda:1')
cuda1 = torch.device('cuda', 1)
Current CPU device will always be 'CPU:0', but current GPU device depends (on the currently selected device).
So, if currently selected device is 'GPU 0' now, gpu
is 'GPU 0'. But when we change current selected device to 'GPU 1'(if u have...😂), gpu
will become 'GPU 1'.
Create Tensors on device
Get the index of currently selected device:
print(torch.cuda.current_device())
Let's suppose it's 0
, now we can
# Create a tensor on CPU, given a torch.device object or a string
a = torch.tensor([1.], device=cpu)
a = torch.tensor([1.], device='cpu')
# Create a tensor on currently selected GPU, which is GPU 0 now
b = torch.tensor([1.], device=cuda)
b = torch.tensor([1.], device='cuda')
# Create a tensor on specific GPU
c = torch.tensor([1.], device=cuda1)
c = torch.tensor([1.], device='cuda:1')
With one GPU, we only care about tensor on CPU or on GPU. No need to care about currently selected device, cuz u have only 1 GPU that can be selected :joy:.
torch.Tensor.cuda()
returns a copy of this torch.Tensor
object in CUDA memory in a specified device and will copy to the currently selected device if no device parameter was given.
cuda = torch.device('cuda')
cuda0 = torch.device('cuda:0')
tensor = torch.randn(2, 2)
# To currently selected GPU device or specific device(both 'cuda:0' in this situation)
tensor = tensor.cuda()
tensor = tensor.cuda(cuda0)
Inversely, torch.Tensor.cpu()
to get a copy in CPU memory.
tensor = torch.randn(2, 2)
# CPU -> GPU
tensor = tensor.cuda()
# GPU -> CPU
tensor = tensor.cpu()
torch.Tensor.to()
performs Tensor dtype and/or device conversion. It returns a copy of the desired Tensor.
cuda0 = torch.device('cuda:0')
cpu = torch.device('cpu')
tensor = torch.randn(2, 2)
# to float64
tensor = tensor.to(torch.float64)
# to float 32, using torch.Tensor.type()
tensor = tensor.type(torch.float32)
# to GPU
tensor = tensor.to(cuda0)
# to CPU
tensor = tensor.to(cpu)
So torch.Tensor.to(device, dtype)
can be considered as a combination of torch.Tensor.cuda(device)
, torch.Tensor.cpu(device)
and torch.Tensor.type(dtype)
Once the data Tensor is allocated (to CPU/GPU), we can do operations to it irrespective of the selected device, and the results will be always placed on the same device as the Tensor.
Furthermore, if we do operations between 2 or more Tensors, they should be allocated to the same device so the operation will take place at that device and the result will be placed there.
torch.nn.Parameter
is a kind of Tensor that is to be considered a module parameter. And Parameters are Tensor subclasses. Equally, torch.nn
module provides torch.nn.Module.cuda()
, torch.nn.Module.cpu()
methods for easily tensor(parameters) transferring between CPU and GPU. And also the torch.nn.Module.to()
method to do the transfer/cast things.
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
model = Model()
# list contains parameters of model.conv1(weight and bias)
param_list_conv1 = list(model.conv1.parameters())
print(param_list_conv1[0].device)
# CPU -> GPU(.cuda() method)
model.cuda()
print(param_list_conv1[0].device)
# GPU -> CPU(.cpu() method)
model.cpu()
print(param_list_conv1[0].device)
# CPU -> GPU(.to() method)
cuda0 = torch.device('cuda:0')
model.to(cuda0)
print(param_list_conv1[0].device)
After allocating data and model to GPU, we are able to use GPU to accelerate our training process.
With multiple GPUs, you should care about the currently selected device. Use a context-manager torch.cuda.device()
to manually control which GPU a tensor is created on meanwhile make our code more clear.
cuda = torch.device('cuda')
# Create tensor a,b,c on device cuda:0
with torch.cuda.device(0):
a = torch.tensor([1., 2.], device=cuda)
b = torch.tensor([1., 2.]).cuda()
c = torch.tensor([1., 2.]).to(cuda)
# Create tensor d,e,f on device cuda:1
with torch.cuda.device(1):
d = torch.tensor([1., 2.], device=cuda)
e = torch.tensor([1., 2.]).cuda()
f = torch.tensor([1., 2.]).to(cuda)
doc: J. CUDA Environment Variables :: CUDA Toolkit Documentation
ref: CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES | NVIDIA Developer Blog
ref: 2Pac – Can't C Me Lyrics | Genius Lyrics
Let's suppose (or dream about) that you have 4 GPUs, and want to use three of them to train your model while using the remaining one to play CUDA_VISIBLE_DEVICES
to restrict the devices that your CUDA application(model training process) sees.
Many ways to achieve that, just introduce 2 of them:
Set the environment variable in your python script (not recommended)
import os
os.environ['CUDA_VISIBLE_DEVICE']='1,2,3'
This method is not recommended cuz it's not flexible. Use this method when you do this thing as normal.
Set the environment variable when you run the python script (recommended)
CUDA_VISIBLE_DEVICES=1,2,3 python train.py
Use this if you just want to play
And after that,
The blind stares of a million pairs of eyes
Lookin' hard but won't realize
That they will never see the 'GPU0'!
ref: Optional: Data Parallelism — PyTorch Tutorials
doc: torch.nn — PyTorch master documentation
Torch will only use one GPU by default. Simply use torch.nn.DataParallel
to run your model parallelized over multiple GPUs in the batch dimension.
model = nn.DataParallel(model)
ref: When to set pin_memory to true? - vision - PyTorch Forums
ref: How to Optimize Data Transfers in CUDA C/C++ | NVIDIA Developer Blog
torch.utils.data.DataLoader
admits a parameter pin_memory
, and if True
the tensors will be copied into CUDA pinned memory.
#https://github.com/pytorch/examples/blob/master/imagenet/main.py#L211-L223
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
num_workers=args.workers, pin_memory=True, sampler=train_sampler)
By default, GPU operations are asynchronous, this allows to execute more computations in parallel. But when copying data between CPU and GPU or between GPUs, it will be synchronous by default. E.g. torch.Tensor.to()
， torch.Tensor.cuda()
and torch.nn.Module.to()
. And these functions admit a non_blocking
argument which was named as async
before. When non_blocking
is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.
#https://github.com/pytorch/examples/blob/master/imagenet/main.py#L270-L272
input = input.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)
These methods provides a larger bandwidth between the host(CPU) and the device(GPU), also improves the data transfer performance.
# At the begining of the script
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# When loading data
image, label = image.to(device), label.to(device)
# Create the model
model = Model().to(device)
# https://pytorch.org/docs/stable/notes/cuda.html#device-agnostic-code
import argparse
import torch
parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--disable-cuda', action='store_true',
help='Disable CUDA')
args = parser.parse_args()
args.device = None
if not args.disable_cuda and torch.cuda.is_available():
args.device = torch.device('cuda')
else:
args.device = torch.device('cpu')
# When loading the data
for i, x in enumerate(train_loader):
x = x.to(args.device)
# When creating the model
model = Model().to(args.device)
Actually, it's a brief conclusion. So in practice, we should:
]]>Post cover image from Quick Guide for setting up PyTorch with Window in 2 mins
CSE455: Computer Vision - Spring 2018
I saw this course on pjreddie's GitHub page, and found it intersting.👍
It is an undergraduate course provided by School of Computer Science and Engineering at University of Washington. I did the assignment for my personal interest.😋
CSE455: Computer Vision - Spring 2018
I saw this course on pjreddie's GitHub page, and found it intersting.👍
It is an undergraduate course provided by School of Computer Science and Engineering at University of Washington. I did the assignment for my personal interest.😋
My solution to the Assignments includes codes to finish the homework and extra things to get the credits.
-std=c99
flag to tell the complier to use the C99, i think it's cooler to do the declaration out of the loop.int i, j, k;
for (i = 0; i < im.c; ++i){
for (j = 0; j < im.h; ++j){
for (k = 0; k < im.w; ++k){
/*body*/
}
}
}
if(expression)
for single-line things, and if (expression){
for multiple lines. And always use if(1)
or if(0)
to enable/disable code snippet.if(!sum) return;
if (a == LOGISTIC){
d.data[i][j] *= x * (1 - x);
} else if (a == RELU){
d.data[i][j] *= x > 0 ? 1 : 0;
} else if (a == LRELU){
d.data[i][j] *= x > 0 ? 1 : 0.1;
}
if(0){
/*disabled body*/
} else
{
/*enabled body*/
}
So i can search for if(0)
to locate the snippet and do the switch quickly?++i
when i have a choiceMakefile
TODO Should write a gist for it.
Complie with opencv(using MinGW)
struct with pointer inside it
When we define a struct with at least one pointer in it.
typedef struct matrix{
int rows, cols;
double **data;
int shallow;
} matrix;
We should write a function to allocate and initialize memory for it for safety amd convenience:
matrix make_matrix(int rows, int cols)
{
matrix m;
m.rows = rows;
m.cols = cols;
m.shallow = 0;
m.data = calloc(m.rows, sizeof(double *));
int i;
for(i = 0; i < m.rows; ++i) m.data[i] = calloc(m.cols, sizeof(double));
return m;
}
And also a function to free the memory:
void free_matrix(matrix m)
{
if (m.data) {
int i;
if (!m.shallow) for(i = 0; i < m.rows; ++i) free(m.data[i]);
free(m.data);
}
}
Remember to call it to free the memory manually to avoid ⚠️segmentation fault.
And also a funtion for deep copy(if necessary).
matrix copy_matrix(matrix m)
{
int i,j;
matrix c = make_matrix(m.rows, m.cols);
for(i = 0; i < m.rows; ++i){
for(j = 0; j < m.cols; ++j){
c.data[i][j] = m.data[i][j];
}
}
return c;
}
Never use struct with pointer inside it as intermediate varible in the expression
in ./vision-hw4/src/classifier.c
i used to write things like this.
// THIS IS TOTALLY WRONG!
matrix backward_layer(layer *l, matrix delta)
{
// back propagation through the activation
gradient_matrix(l->out, l->activation, delta);
// calculate dL/dw and save it in l->dw
free_matrix(l->dw);
matrix dw = matrix_mult_matrix(transpose_matrix(l->in), delta);
l->dw = dw;
// calculate dL/dx and return it.
matrix dx = matrix_mult_matrix(delta, transpose_matrix(l->w));
return dx;
}
It is totally wrong because the intermediate struct variable transpose_matrix(l->in)
and transpose_matrix(l->w)
will never ever be freed. And this stupid Python-like convenient writing will run out of ur memory. And fill it up with these intermediate garbage. Finally throw out a ⚠️segmentation fault.
The right way to do this is:
matrix backward_layer(layer *l, matrix delta)
{
// back propagation through the activation
gradient_matrix(l->out, l->activation, delta);
// calculate dL/dw and save it in l->dw
free_matrix(l->dw);
matrix inT = transpose_matrix(l->in);
matrix dw = matrix_mult_matrix(inT, delta);
free_matrix(inT);
l->dw = dw;
// calculate dL/dx and return it.
matrix wT = transpose_matrix(l->w);
matrix dx = matrix_mult_matrix(delta, wT);
free_matrix(wT);
return dx;
}
String things could cause fatal mistake
After finishing my code in ./vision-hw4
, i trained the model on my windows laptop and it worked well. But when i tried to use Linux to do the same thing, the training procedure just get crashed which gave me a 0% training and test accuracy.
After debugging i found that i accidentally changed the line ending(of the file mnist.labels
) from LF
to CRLF
, which is default on Windows.
This converts all \n
(represents line break on Linux) to \r\n
(represents line break on Windows).
So num0\n
becomes num0\r\n
in mnist.labels
, so does the rest.
And see char *fgetl(FILE *fp)
function in ./src/data.c
. This function parses labels from the text file and stores labels for training and test phase.
char *fgetl(FILE *fp)
{
if(feof(fp)) return 0;
size_t size = 512;
char *line = malloc(size*sizeof(char));
if(!fgets(line, size, fp)){
free(line);
return 0;
}
size_t curr = strlen(line);
while((line[curr-1] != '\n') && !feof(fp)){
if(curr == size-1){
size *= 2;
line = realloc(line, size*sizeof(char));
if(!line) {
fprintf(stderr, "malloc failed %ld\n", size);
exit(0);
}
}
size_t readsize = size-curr;
if(readsize > INT_MAX) readsize = INT_MAX-1;
fgets(&line[curr], readsize, fp);
curr = strlen(line);
}
if(line[curr-1] == '\n') line[curr-1] = '\0';
return line;
}
And most importantly, this function looks for \n
as a marker of line ending. So label num0
becomes num0\r
, so does the other labels.
At the training phase, all the training samples will be considered as negative so does the test phase. Surprisingly but reasonablely, i got 0% for both training and test accuracy.
Remember:
LF
as a default optionMore Extra Credit of vision-hw2(spherical coordinates)
All GPU implementations have been ignored
Written by ivanpp for fun, contact me: ding@ivanpp.me
make_convolutional_layer
convolutional_layer make_convolutional_layer(int batch, int h, int w, int c, int n, int groups, int size, int stride, int padding, ACTIVATION activation, int batch_
]]>All GPU implementations have been ignored
Written by ivanpp for fun, contact me: ding@ivanpp.me
make_convolutional_layer
convolutional_layer make_convolutional_layer(int batch, int h, int w, int c, int n, int groups, int size, int stride, int padding, ACTIVATION activation, int batch_normalize, int binary, int xnor, int adam)
{
int i;
// create a convolutional_layer(layer) type variable l, initialize all struct members to 0.
convolutional_layer l = {0};
l.type = CONVOLUTIONAL;
// Get the params
l.groups = groups; // optional: weight sharing across 'groups' channels
l.h = h; // input height
l.w = w; // input width
l.c = c; // input channels
l.n = n; // num of filters
l.binary = binary; // optional: ?
l.xnor = xnor; // optional: ?
l.batch = batch; // num of image per batch
l.stride = stride; // stride of the conv operation
l.size = size; // kernel size of filters
l.pad = padding; // padding of the conv operation
l.batch_normalize = batch_normalize; // optional: bn after conv
// Allocate memory (for conv weight and conv weight_update)
l.weights = calloc(c/groups*n*size*size, sizeof(float)); // stored as (n*(c/groups)*size*size)
l.weight_updates = calloc(c/groups*n*size*size, sizeof(float));
l.biases = calloc(n, sizeof(float));
l.bias_updates = calloc(n, sizeof(float));
l.nweights = c/groups*n*size*size; // num of params for l.weights
l.nbiases = n; // num of params for l.biases
// Initialize weights to random_uniform
float scale = sqrt(2./(size*size*c/l.groups));
for(i = 0; i < l.nweights; ++i) l.weights[i] = scale*rand_normal();
// Allocate memory (for forward and backward)
int out_w = convolutional_out_width(l); // compute output width
int out_h = convolutional_out_height(l); // compute output height
l.out_h = out_h;
l.out_w = out_w;
l.out_c = n; // output channel should be num of filter, n
l.outputs = l.out_h * l.out_w * l.out_c;
l.inputs = l.w * l.h * l.c;
l.output = calloc(l.batch*l.outputs, sizeof(float)); // for conv output(forward pass)
l.delta = calloc(l.batch*l.outputs, sizeof(float)); // for prev layer's gradient(backward pass)
// Assign forward, backward and update function
l.forward = forward_convolutional_layer;
l.backward = backward_convolutional_layer;
l.update = update_convolutional_layer;
if(binary){
l.binary_weights = calloc(l.nweights, sizeof(float));
l.cweights = calloc(l.nweights, sizeof(char));
l.scales = calloc(n, sizeof(float));
}
if(xnor){
l.binary_weights = calloc(l.nweights, sizeof(float));
l.binary_input = calloc(l.inputs*l.batch, sizeof(float));
}
if(batch_normalize){
l.scales = calloc(n, sizeof(float));
l.scale_updates = calloc(n, sizeof(float));
for(i = 0; i < n; ++i){
l.scales[i] = 1;
}
l.mean = calloc(n, sizeof(float));
l.variance = calloc(n, sizeof(float));
l.mean_delta = calloc(n, sizeof(float));
l.variance_delta = calloc(n, sizeof(float));
l.rolling_mean = calloc(n, sizeof(float));
l.rolling_variance = calloc(n, sizeof(float));
l.x = calloc(l.batch*l.outputs, sizeof(float));
l.x_norm = calloc(l.batch*l.outputs, sizeof(float));
}
if(adam){
l.m = calloc(l.nweights, sizeof(float));
l.v = calloc(l.nweights, sizeof(float));
l.bias_m = calloc(n, sizeof(float));
l.scale_m = calloc(n, sizeof(float));
l.bias_v = calloc(n, sizeof(float));
l.scale_v = calloc(n, sizeof(float));
}
l.workspace_size = get_workspace_size(l);
l.activation = activation; // which activation to use
fprintf(stderr, "conv %5d %2d x%2d /%2d %4d x%4d x%4d -> %4d x%4d x%4d %5.3f BFLOPs\n", n, size, size, stride, w, h, c, l.out_w, l.out_h, l.out_c, (2.0 * l.n * l.size*l.size*l.c/l.groups * l.out_h*l.out_w)/1000000000.);
return l;
}
Describe Sth please
Optional params:
Optional params | Forward | Backward | Update | Usage | Defined in |
---|---|---|---|---|---|
l.groups | Y | Y | N | [convolutional] | |
l.binary | TODO | TODO | TODO | [convolutional] | |
l.xnor | TODO | TODO | TODO | [convolutional] | |
l.batch_normalize | Y | Y | N | Regularization | [convolutional] |
adam | Optimization Algorithm | [net] |
Batch normalization and Adam will not be covered in this blog.
Image(or image like) input *net.input (given by the net) has the size of $[batch\times c\times h\times w]$, consider it as a $(batch, c, h, w)$ matrix.
Conv filter *l.weights has the size of $[n\times \frac{c}{groups}\times\ size\times size]$, consider it as a $(n, \frac{c}{groups}, size, size)$ matrix.
Conv bias *l.biases has the size of $[n]$, namely 1 float bias for 1 filter.
Conv output *l.output has the size of $[batch\times n\times out_h\times out_w]$, consider it as a $(batch, n, out_h, out_w)$ matrix.
Conv workspace *l.workspace should have the size of $[\frac{c}{groups}\times out_h\times out_w\times size\times size]$. But the actual size of the *net.workspace(workspace for all conv/deconv/local layers in a net) should be suitable for the layer that needs the most workspace memory. All conv/deconv/local layers share the workspace of the net. So the most of the conv/deconv/local layers are only using part of the net's workspace.
Use default value for all optional params:
net.adam = 0;
l.groups = 1;
l.batch_normalize = 0;
l.binary = 0;
l.xnor = 0;
We can get the minimal implementation:
// minimal implementation of conv forward
void forward_convolutional_layer_min(convolutional_layer l, network net)
{
int i;
fill_cpu(l.outputs*l.batch, 0, l.output, 1);
int m = l.n;
int k = l.size*l.size*l.c;
int n = l.out_w*l.out_h;
for(i = 0; i < l.batch; ++i){
float *a = l.weights;
float *b = net.workspace;
float *c = l.output + i*n*m;
im2col_cpu(net.input + i*l.c*l.h*l.w, l.c, l.h, l.w, l.size, l.stride, l.pad, b);
gemm(0,0,m,n,k,1,a,k,b,n,1,c,n);
}
add_bias(l.output, l.biases, l.batch, l.n, l.out_h*l.out_w);
activate_array(l.output, l.outputs*l.batch, l.activation);
}
Now *net.input remains the same, has the size of $[batch\times c\times h\times\ w]$. For one single image in current batch, it has the size of $[c\times h\times w]$. And conv filter matrix has the shape of $(n,c,size,size)$.
Instead of using for loops to do conv operations at each input location using all the filters, we use im2col and then just do matrix multiplication.
// src/im2col.c
void im2col_cpu(float* data_im,
int channels, int height, int width,
int ksize, int stride, int pad, float* data_col)
{
int c,h,w;
int height_col = (height + 2*pad - ksize) / stride + 1; // height after reconstruct
int width_col = (width + 2*pad - ksize) / stride + 1; // width after reconstruct
int channels_col = channels * ksize * ksize; // filter(channels, ksize, ksize)
for (c = 0; c < channels_col; ++c) { // flatten the filter
int w_offset = c % ksize; // from which column
int h_offset = (c / ksize) % ksize; // from which row
int c_im = c / ksize / ksize; // from which channel
for (h = 0; h < height_col; ++h) { // iterate the reconstructed img(out_h*out_w)
for (w = 0; w < width_col; ++w) {
// mapping reconstructed img to padded img(which col)
int im_row = h_offset + h * stride;
// mapping reconstructed img to padded img(which row)
int im_col = w_offset + w * stride;
// index of the data_col(reconstructed img)
int col_index = (c * height_col + h) * width_col + w;
data_col[col_index] = im2col_get_pixel(data_im, height, width, channels,
im_row, im_col, c_im, pad); // mapping pixel by pixel
}
}
}
}
im2col_cpu() accepts 2 pointer as input. Input pointer(a.k.a. image data pointer) points to start address of the image input, net.input + i*l.c*l.h*l.w
. Output pointer(a.k.a. col data pointer) points to the start address of the workspace, b = net.workspace
.
im2col_cpu() reconstruct image data $(c,h,w)$ to col data $(c\times size\times size, out_h\times out_w)$.
gemm() stands for General Matrix Multiplication. So weight matrix $(n, c\times size\times size)$ multiplies col data matrix $(c\times size\times size, out_h, out_w)$, finally get the output matrix $(n, out_h, out_w)$, for one single image. Note that pointer for *l.output has already points to the right place float *c = l.output + i*n*m
.
And for batch images, we will get the $(batch, n, out_h, out_w)$ output for *l.output.
Use add_bias() to add bias to *l.output and use activate_array() to pass through some chosen activation function. Conv forward done!
Just a review:
Each image has been divided into l.groups groups, or more specifically, grouped by channels. So each group of image has the shape of $(\frac{c}{groups},h,w)$. Size of the filters is $[n\times \frac{c}{groups}\times size\times size]$. In other words, there are $l.n\times [\frac{c}{groups}\times size\times size]$ filters.
We don't use all the $l.n\times (\frac{c}{groups}, size, size)$ kernels to do conv operation with all $l.groups\times (\frac{c}{groups},h,w)$ partial-channel image, we group the filters as well as the image(actually image channels) first. Conv kernels have also been divided into l.groups groups, so each filter group has $\frac{l.n}{l.groups}\times (\frac{c}{groups},h,w)$ filters.
The image (channel) groups and filter groups are one-to-one correspondent. $Gropu_j$ filters are only responsible for $Group_j$ image, like, sort of conv pair.
int i, j;
int m = l.n/l.groups; // num of filters
int k = l.size*l.size*l.c/l.groups; // len of filter
int n = l.out_w*l.out_h; // len of output per output channel
for(i = 0; i < l.batch; ++i){
for(j = 0; j < l.groups; ++j){
float *a = l.weights + j*l.nweights/l.groups;
float *b = net.workspace;
float *c = l.output + (i*l.groups + j)*n*m;
// use im2col_cpu() to reconstruct input for each (input, weight) pair
im2col_cpu(net.input + (i*l.groups + j)*l.c/l.groups*l.h*l.w,
l.c/l.groups, l.h, l.w, l.size, l.stride, l.pad, b);
// conv operation(actually matrix multiplication) for one pair
gemm(0,0,m,n,k,1,a,k,b,n,1,c,n);
}
}
*a has the start address of $Group_j$ filters
*b has the start address of the workspace
*c has the start address of the output for $Group_j$ of $image_i$
net.input + (i*l.groups + j)*l.c/l.groups*l.h*l.w gives the address for $Group_j$ of $image_i$
Using im2col_cpu() each $(\frac{c}{groups},h,w)$ partial-channel image will get the 'partial-channel col data' of shape $(\frac{c}{groups}\times size\times size, out_h, out_w)$. Along with its $(\frac{n}{groups},\frac{c}{groups}\times size\times size)$ filters pair, do the matrix multiplication, outcome will be $(\frac{n}{groups},out_h,out_w)$.
Concatenating $groups \times (\frac{n}{groups},out_h,out_w)$ output, we will get $(n,out_h,out_w)$ output for one image as usual. The output shape remains the same, regardless of using this group thing.
Just a review:
E.G. No fuckin examples because it is stupid.
*l.delta has the same size of *l.output, as it will store the gradients w.r.t the output of current conv layer.
What forward_convolutional_layer() should compute for each input $x$ is $y=x\ast W$, and what it actually does is to compute $y=W\times x_{col}$.
So for backward_convolutional_layer(), it computes:
$\frac{\partial L}{\partial W}=\frac{\partial L}{\partial y}\cdot \frac{\partial y}{\partial W}=\frac{\partial L}{\partial y} \times {x_{col}}^T$
$\frac{\partial L}{\partial x_{col}}=\frac{\partial L}{\partial y}\cdot \frac{\partial y}{\partial x_{col}}=W^T\times \frac{\partial L}{\partial y}$
void backward_convolutional_layer(convolutional_layer l, network net)
{
int i, j;
int m = l.n;
int n = l.size*l.size*l.c;
int k = l.out_w*l.out_h;
// gradients pass through activation function
gradient_array(l.output, l.outputs*l.batch, l.activation, l.delta);
if(l.batch_normalize){
backward_batchnorm_layer(l, net);
} else {
backward_bias(l.bias_updates, l.delta, l.batch, l.n, k);
}
for(i = 0; i < l.batch; ++i){
for(j = 0; j < l.groups; ++j){
float *a = l.delta + (i*l.groups + j)*m*k;
float *b = net.workspace;
float *c = l.weight_updates + j*l.nweights/l.groups;
float *im = net.input+(i*l.groups + j)*l.c/l.groups*l.h*l.w;
im2col_cpu(im, l.c/l.groups, l.h, l.w,
l.size, l.stride, l.pad, b);
// compute gradients w.r.t. weights
gemm(0,1,m,n,k,1,a,k,b,k,1,c,n);
if(net.delta){ // if gradient descent continues(not the first layer)
a = l.weights + j*l.nweights/l.groups;
b = l.delta + (i*l.groups + j)*m*k;
c = net.workspace;
// compute gradients w.r.t. the reconstructed inputs(x_col)
gemm(1,0,n,k,m,1,a,n,b,k,0,c,k);
// reconstruct the im_col using col2im_cpu, resotre the struct,
// and get the gradients w.r.t. the inputs
col2im_cpu(net.workspace, l.c/l.groups, l.h, l.w, l.size, l.stride,
l.pad, net.delta + (i*l.groups + j)*l.c/l.groups*l.h*l.w);
}
}
}
}
$X$ has the shape of $(batch,c,h,w)$, and $X_{col}$ has the shape of $(batch, c\times size\times size, out_h, out_w)$.
For $Group_j$ in $Image_i$, $x_{col}^{ij}$ should have the shape of $(\frac{c}{groups}\times size\times size,out_h\times out_w)$.
$W$ has the shape of $(n,\frac{c}{groups},size,size)$. And what is responsible for $x_{col}^{ij}$, $w^{ij}$ has the shape of $(\frac{n}{groups},\frac{c}{groups}\times size\times size)$.
$\frac{\partial L}{\partial Y}$ has the shape of $(batch,n,out_h,out_w)$. What we need at a time is $\frac{\partial L}{\partial y^{ij}}$, has the shape of $(\frac{n}{groups},out_h\times out_w)$.
Given $\frac{\partial L}{\partial y^{ij}}$ and $x_{col}^{ij}$, call gemm(TA=0, TB=1, ...), we will get $\frac{\partial L}{\partial w^{ij}}=\frac{\partial L}{\partial y^{ij}}\times {x_{col}^{ij}}^T$, which has the shape of $(\frac{n}{groups}, \frac{c}{groups}\times size\times size)$. And will finally get the $\frac{\partial L}{\partial W}$, which will be stored in the memory block started from *l.weight_updates, of the size $(n,\frac{c}{groups},size,size)$.
Also, give $w^{ij}$ and $\frac{\partial L}{\partial y^{ij}}$, call gemm(TA=1, TB=0), we will get $\frac{\partial L}{\partial x_{col}^{ij}}={w^{ij}}^T\times \frac{\partial L}{\partial y^{ij}}$, which has the shape of $(\frac{c}{groups}\times size\times size, out_h\times out_w)$. And will finally get the $\frac{\partial L}{\partial X_{col}}$ of the shape $(batch, c\times size\times size, out_h, out_w)$, stored (start) from *net.workspace, $X_{col}$ will be overwritten.
Using col2im_cpu(), $\frac{\partial L}{\partial X_{col}}$ will be reconstruct to $\frac{\partial L}{\partial X}$, stored in the memory block that start from *net.delta, of the size $(batch,c,h,w)$.
Just a review (again...):
void update_convolutional_layer(convolutional_layer l, update_args a)
{
float learning_rate = a.learning_rate*l.learning_rate_scale;
float momentum = a.momentum;
float decay = a.decay;
int batch = a.batch;
axpy_cpu(l.n, learning_rate/batch, l.bias_updates, 1, l.biases, 1);
scal_cpu(l.n, momentum, l.bias_updates, 1);
if(l.scales){
axpy_cpu(l.n, learning_rate/batch, l.scale_updates, 1, l.scales, 1);
scal_cpu(l.n, momentum, l.scale_updates, 1);
}
axpy_cpu(l.nweights, -decay*batch, l.weights, 1, l.weight_updates, 1);
axpy_cpu(l.nweights, learning_rate/batch, l.weight_updates, 1, l.weights, 1);
scal_cpu(l.nweights, momentum, l.weight_updates, 1);
}
*l.weight_updates has the same size of *l.weights, *l.bias_update has the same size of *l.biases as well.
]]>