Tuesday, August 28, 2018

Deep Learning with Raspberry Pi -- Real-time object detection with YOLO v3 Tiny! [updated on Dec 19 2018, detailed instruction included]

A quick note on Dec 18 2018:
Since I posted this article late Aug, I have been inquired many times on the detailed instruction and also the python wrapper. Having been really busy in the last several months, I finally found some spare time completing this blog with detailed instruction! All the information can be found in my GitHub repos which was forked from shizukachan/darknet-nnpack. I have modified the Makefile, added the two Python nonblocking wrapper, and made some other minor modification. It should "almost" work out of the box!

Here goes the updated article

I am a big fan of Yolo (You Only Look Once, Yolo website). Redmon & Farhadi's famous Yolo series work had big impacts on the deep learning society. BTW, their recent "paper" (Yolo v3: an incremental Improvement) is an interesting read as well.

So, what is Yolo? Yolo is a cutting-edge object detection algorithm, i.e., it detects objects from images. Traditionally people used moving windows to scan an image, and then try to recognize each snapshot in every possible window locations. This method is of course very time consuming because there are many different ways to place the window, and many computations need to be done repeatedly. Yolo, standing for "You Only Look Once" (not You Only Live Once), smartly avoids those heavy computations by directly predicting object category and their bounding boxes simultaneously.

YoloV3 is one of the latest updates of Yolo algorithm. The biggest change is that YoloV3 now uses only convolutional layers and no more fully-connected layer. Don't let the technical term scare you away! What does this implies is that YoloV3 does not care about the input image size anymore! As long as the height and width are integer times 32 (such as 224x224, 288x288, 608x288, etc), YoloV3 will work fine! Another major improvement of YoloV3 is that it gives predictions in the intermediate layers as well. Again, what does it mean, is that Yolo3 now does a better job predicting small objects than its previous version!

I will have to skip the technical detail here because the paper explained everything. The only thing you need to know is that Yolo is lightweight and fast and decently accurate. It is so lightweight and fast that it can even be used on Raspberry Pi, a single-board computer with smart-phone-grade CPU and limited RAM and no CUDA GPU, to run object detection in real-time! And, it is also convenient because the authors had provided configuration files and weights trained on COCO dataset. So no need to train your own model if you are only interested to detect common objects.

Although Yolo is super efficient, it still requires quite a lot of computation. The original YoloV3, which was written with a C++ library called Darknet by the same authors, will report "segmentation fault" on Raspberry Pi v3 model B+ because Raspberry Pi simply cannot provide enough memory to load the weight. YoloV3-tiny version, however, can be run on RPI 3, very slowly.

Again, I wasn't able to run YoloV3 full version on Pi 3. I think it wouldn't be possible to do so considering the large memory requirement by YoloV3. This article is all about implementing YoloV3-Tiny on Raspberry Pi Model 3B!

Quite a few steps still have to be done to speed up yolov3-tiny on the pi:
1. Install NNPACK, an acceleration library for the neural network to run on multi-core CPU
2. Add some special configuration to the Makefile to compile the Darknet Yolo source code on Cortex CPU and NNPACK optimization
3. Either install opencv C++ (big pain on raspberry pi) or write some python code to wrap darknet. I believe Yolo comes with a python wrapper but I haven't had a chance to test it on RPI.
4. Download Yolov3-tiny.cfg and Yolov3-tiny.weights. Run Darknet with Yolo tiny version (not full version)!

Sounds complicated? Luckily digitalbrain79 (not me) had already figured it out (https://github.com/digitalbrain79/darknet-nnpack). I had more luck with Shizukachan's fork version. I even made a few more changes to make it easier to follow:

Step 0: prepare Python and Pi Camera

Log in to Raspberry Pi using SSH or directly in terminal.
Make sure pip-install is included (it should come together with Debian
sudo apt-get install python-pip
Install OpenCV. The simplest way on RPI is as follows (do not build from source!):
sudo apt-get install python-opencv
Enable pi camera
sudo raspi-config
Go to Interfacing Options, and enable P1/Camera
You will have to reboot the pi to be able to use the camera
A few additional words here. In the advanced option of raspi-config, you can adjust the memory split between CPU and GPU. Although we would like to allocate more ram to CPU so that the pi can load a larger model, you will want to allocate at least 64MB to GPU as the camera module would require it.

Step 1: Install NNPACK

NNPACK was used to optimize Darknet without using a GPU. It is useful for embedded devices using ARM CPUs.
Idein's qmkl is also used to accelerate the SGEMM using the GPU. This is slower than NNPACK on NEON-capable devices and primarily useful for ARM CPUs without NEON.
The NNPACK implementation in Darknet was improved to use transform-based convolution computation, allowing for 40%+ faster inference performance on non-initial frames. This is most useful for repeated inferences, ie. video, or if Darknet is left open to continue processing input instead of allowed to terminate after processing input.

Install Ninja (building tool)

Install PeachPy and confu
sudo pip install --upgrade git+https://github.com/Maratyszcza/PeachPy
sudo pip install --upgrade git+https://github.com/Maratyszcza/confu
Install Ninja
git clone https://github.com/ninja-build/ninja.git
cd ninja
git checkout release
./configure.py --bootstrap
Install clang (I'm not sure why we need this, NNPACK doesn't use it unless you specifically target it).
sudo apt-get install clang

Install NNPACK

Install modified NNPACK
git clone https://github.com/shizukachan/NNPACK
confu setup
python ./configure.py --backend auto
If you are compiling for the Pi Zero, change the last line to python ./configure.py --backend scalar
You can skip the following several lines from the original darknet-nnpack repos. I found them not very necessary (or maybe I missed something)
It's also recommended to examine and edit https://github.com/digitalbrain79/NNPACK-darknet/blob/master/src/init.c#L215 to match your CPU architecture if you're on ARM, as the cache size detection code only works on x86.
Since none of the ARM CPUs have a L3, it's recommended to set L3 = L2 and set inclusive=false. This should lead to the L2 size being set equal to the L3 size.
Ironically, after some trial and error, I've found that setting L3 to an arbitrary 2MB seems to work pretty well.
Build NNPACK with ninja (this might take * quie * a while, be patient. In fact my Pi crashed in the first time. Just reboot and run again):
do a ls and you should be able to find the folders lib and include if all went well:
Test if NNPACK is working:
In my case, the test actually failed in the first time. But I just ran the test again and all items are passed. So if your test failed, don't panic, try one more time.
Copy the libraries and header files to the system environment:
sudo cp -a lib/* /usr/lib/
sudo cp include/nnpack.h /usr/include/
sudo cp deps/pthreadpool/include/pthreadpool.h /usr/include/
If the convolution-inference-smoketest fails, you've probably hit a compiler bug and will have to change to Clang or an older version of GCC.
You can skip the qmkl/qasm/qbin2hex steps if you aren't targeting the QPU.
Install qmkl
sudo apt-get install cmake
git clone https://github.com/Idein/qmkl.git
cd qmkl
cmake .
sudo make install
Install qasm2
sudo apt-get install flex
git clone https://github.com/Terminus-IMRC/qpu-assembler2
cd qpu-assembler2
makesudo make install
Install qbin2hex
git clone https://github.com/Terminus-IMRC/qpu-bin-to-hex
cd qpu-bin-to-hex
sudo make install

Step 2. Install darknet-nnpack

We have finally finished configuring everything needed. Now simply clone this repository. Note that we are cloning the yolov3branch. It comes with the python wrapper I wrote, correct makefile, and yolov3 weight:
git clone -b yolov3 https://github.com/zxzhaixiang/darknet-nnpack
cd darknet-nnpack
git checkout yolov3
At this point, you can build darknet-nnpack using make. Be sure to edit the Makefile before compiling.

Step 3. Test with YoloV3-tiny

Despite doing so many pre-configurations, Raspberry Pi is not powerful enough to run the full YoloV3 version. The YoloV3-tiny version, however, can be run at about 1 frame per second rate
I wrote two python nonblocking wrappers to run Yolo, rpi_video.py and rpi_record.py. What these two python codes do is to take pictures with PiCamera python library, and spawn darknet executable to conduct detection tasks to the picture, and then save to prediction.png, and the python code will load prediction.png and display it on the screen via opencv. Therefore, all the detection jobs are done by darknet, and python simply provides in and out. rpi_video.py will only display the real-time object detection result on the screen as an animation (about 1 frame every 1-1.5 second); rpi_record.py will also save each frame for your own record (like making a git animation afterwards)
To test it, simply run
sudo python rpi_video.py
sudo python rpi_record.py
You can adjust the task type (detection/classification?), weight, configure file, and threshold in line
yolo_proc = Popen(["./darknet",
                   "-thresh", "0.1"],
                   stdin = PIPE, stdout = PIPE)
For more details/weights/configuration/different ways to call darknet, refer to the official YOLO homepage.
As I mentioned, YoloV3-tiny does not care about the size of the input image. So feel free to adjust the camera resolution as long as both height and width are integer multiplication of 32.

#camera.resolution = (224, 224)
#camera.resolution = (608, 608)
camera.resolution = (544, 416)

Here are my test results:

1. It worked. Yolov3-tiny on Raspberry Pi 3 Model B+ has a frame rate of 1 frame per sec (FPS). The rpi_video.py will print the time it requires Yolov3-tiny to predict on an image. I was able to get numbers like 0.9 second to 1.1 second per frame. Not bad at all! Of course, you can't do any rigorous fast object tracing. But for a surveillance camera, or slow robot, or even drone, 1FPS is promising. NNPACK is critical here. As pointed out by Shizukachan, without NNPACK the frame rate will be lower than 0.1FPS!

2.Make sure the power supply you are using can truly provide 2.4A (which is desired by RPI 3B). I have seen cases that the detection speed drops to 1 frame per 1.7 seconds because the power supply did not provide sufficient power.

3. It worked limitedly. Yolov3-tiny is not that accurate compared to Yolov3 full version. But if you want to detect specific objects in some specific scene, you can probably train your own Yolo v3 model (must be the tiny version) on GPU desktop, and transplant it to RPI. Never try to train the model on RPI. Don't even think about it.. With pre-trained Yolov3-tiny on COCO dataset, some good transfer learning can be leveraged to speed up the training speed.

4. I didn't modify the source code of Yolo. When performing a detection task, Yolo outputs an image with bounding box, label and confidence overlaied on top. If you would like to get such information in a digital form, you will have to dig into Yolo's source code and modify the output part. It should be relatively straightforward.

Finally, the results. Note that I accelerated the video 5 times. The actual frame rate is about 1 frame per second.

Yolov3-tiny successfully detected keyboard, banana, person (me), cup, sometimes sofa, car, etc. It thought curious George as teddy bear all the time, probably because COCO dataset does not have a category called "Curious George stuffed animal". It got confused on the old-fashion calculator and sometimes recognized it as a laptop or a cell phone. But in general, I was very surprised to see the results, and the frame rate! 

Tuesday, August 21, 2018

Deep Learning With Raspberry Pi - Installation

This is goind to be the begining of a series of posts about fusion of deep learning and Raspberry Pi!

Deep Learning has become a new world language in the recent 5 years. With the latest development in the convolutional neural network, LSTM, attention models, GANs, reinforcement learning, we see a promising trend of training model to do things that in the past human believed only human brain can master. For example, writing a caption to an image, or composing a piece of music, or driving a car. With millions of images/text corpse, properly designed deep neural network model can somehow be calibrated to “learn” specific task without explicit programming. Normally when people talk about training deep learning, people talk about CUDA, GPU matrix operation, and parallelization, massive memory requirement, etc.

Now, as the most popular single-board computer/development kit/IoT board, Raspberry Pi, even the latest 3 Model B+ (1.4GHz CPU, 1G DDR2 RAM), does not have enough computation power to train any decent deep learning model. Forget about training. However, this does not mean that deep learning and Raspberry Pi are exclusive to each other. It is still possible to run a deep learning framework and deep learning model on Raspberry Pi. In fact, it is super fun, and probably also super useful to run forward deep learning on Raspberry Pi. Imaging that your Pi Camera can now identify human being and probably who they are, or issue alert when a bunny is eating your garden, or recognize obstacle for a Pi-powered robot, or display camera frame in van Goghor style, or maybe just play endless Pi-composed Jazz. A new world is enabled by Raspberry Pi + Deep Learning!

As a lazy person, I don’t want to reinvent the wheel. Given that there are well-established, robust, deep learning libraries, such as tensorflow, pyTorch, etc., it makes sense to first try those libraries in the Pi. In this article, I will be showing how to install tensorflow and keras (a high-level wrapper of tensorflow) on Raspberry Pi 3 Model B+ running a Raspbian Stretch (version 9). I haven’t tested the workflow in other Raspberry Pi models or another Raspbian version. However, my intuition told me that Pi 3 Model B or Raspbian Jessie should work the same way.

To proceed, you’ll need to understand basic Linux commands and Python programming and know how to use Raspberry Pi. You do not need to know deep learning, just assume it as a magic black box. I get a lot of help from this post:

    1. Which version of Python? Python 2.7!

Raspbian comes with Python 2.7 and 3.5. Although I am a fan of Python 3 and tensorflow prefers Python 3, for Pi, I still highly recommend Python 2.7. The reason is that installing numpy, scipy and opencv with Python 2.7 is so much easier and hassle-free! The last thing I want to do is to build scipy and opencv from binary on Pi. IT IS GOING TO TAKE FOREVER!

2. Installing pre-request libraries

In order to install/run tensorflow and kera, you have to install numpy and scipy, h5py. I also recommend to install OpenCV, because, come on, we want to do image stuff with deep learning.

I highly recommend installing those libraries pre-compiled. Because Pi is a slow computer, it might take 10 min – 2 hr to install those libraries by compiling binary on Pi. And, forget installing OpenCV from source code on Pi! Trust me, it is a painful process!

So how to install pre-compiled libraries?

pi:~ $ sudo apt-get install python-numpy python-scipy python-h5py python-opencv           

pi:~ $ pip install numpy scipy h5py opencv                                                                       

 The second approach, most times, end up downloading wheel file and run setup.py for long long long time. I think scipy took me more than 30 min and still failed for some reason. The first approach, easy and fast.

    3. Install Tensorflow

I basically followed tensorflow official websites for this part. Some people said that they have to install an older version of tensorflow like 1.0, however, I was able to install 1.9.0 and run it without a problem (well, there were some non-harmful warnings)

First, make sure that you have libatlas library, a linear computation library, is installed. Simply do

pi:~ $ sudo apt-get install libatlas-base-dev                                                                 

Second, let’s install tensorflow. A simple pip install is likely to fail here. This is become tensorflow and some associated libraries will take more than 100MB size, and be default Raspbian has 100MB allocated for swap. If you use pip install directly, highly likely that you will encounter memory errors. There are two ways to overcome this. One is to temporarily change the swap size, install tensorflow, and chance swap size back. This will require to reboot the Pi twice. An easier way, I believe, is to add some additional argument to pip install:

pi:~ $ pip install --no-cache-dir tensorflow                                                                  

In this way, we are installing tensorflow without caching. No need to chance swap size.

Installing tensorflow took a while, as for Python 2 we have to compile some libraries. Time for a cup of coffee.

    4. Installing keras

This took me a while. Because for some reason installing keras wants to recompile scipy and it always fails me due to some dependencies issues. Now I am very sure that I have all key libraries installed for keras, I only want pip install to install keras itself. So finally I realized that I only need to tell pip install to ignore dependencies. To do this, simply type

pi:~ $ pip install keras==2.1.5 --no-cache-dir --no-deps                                                 

I didn’t test other keras version. But I think the newer version should be fine.

5. Test that packages are all installed correctly.

As I said, there are some warnings. But, hooray!

6. Run a pre-trained model

Keras comes with many well-known pre-trained CNN models for image recognition. As a first try, I tested MobileNet, a lightweight small CNN first brought by Howard et al in Google in Apr 2017. The concept of MobileNet is that it is so lightweight and simple and it can be run on mobile devices.
To test it, I downloaded this image

from this website http://www.shadesofgreensafaris.net/images/uploads/mikumi.jpg by typing the following command in terminal
pi:~ $ curl http://www.shadeofgreensafaris.net/images/uploads/mikumi.jpg > image.jpg  

Here is the python code

And, the bottom of the outputs:

So MobileNet does recognize the impala correctly in its first guess. It tooks about 40 seconds to load the 4 million parameter model, and only took 3 seconds to make a prediction. Not bad!