Near-perfect YOLO3 Object Detection from scratch

I recently completed the Microsoft Professional Program in Artificial Intelligence and have been really impressed by some of the many computer vision examples I’ve seen. It’s a great course and if you are interested in AI I highly recommend it, along with the fantastic blog and training offered by Adrian Rosebrock at

There are a number of key technologies and platforms that will continuously come up in AI as you learn – Tensorflow, CNTK, OpenCV and of course Keras. Once you start exploring computer vision and specifically Convoluted Neural Networks you are bound to run into numerous examples of real-time object detection from video, whether it’s a car, person, dog or street-sign, and most of these examples will use a pre-built model, laboriously created to detect dozens or even thousands of classes of objects out of the box, and ready for you to use in your own models with little to no effort required.

That’s all great, but what if you wanted to detect something that is not included in the pre-built model? The solution lies in building and training your own from scratch, which is what I did for this post.

I’ve found YOLO3 to be really fantastic, and since I’m a Windows user my focus was on being able to build and train a model without having to struggle with code or tutorials designed for Linux. I found a pretty good set of scripts on GitHub and started off by getting it all running locally and training their example detector which detects raccoons.

Sometimes I use a laptop with Intel HD5000 GPU and PlaidML sitting between Keras and Tensorflow. This works well in most cases but for training a YOLO3 model you’ll need a better setup, and I used an Azure Windows 2016 Server VM I deployed and loaded it with Python 3.6, Tensorflow and Keras.

The VM comes with 112GB of RAM and dual Nvidia K80 GPU’s. It’s not cheap to operate so I do all my prep work locally, making sure the model starts training without obvious errors and then copy that all over to the VM for the training run.

For this post I decided that while raccoons are cool, rats would be more interesting. Rats are fast, come in all shapes, sizes and colours, and can unfortunately cause problems when not kept as pets. They nest, chew through electrical wiring, and cause havoc in agriculture and food manufacturing. They are also used for neuroscience research with the classic example being a rat running a maze.

Because of the speed they move and ways they can contort their bodies it should, in theory, be pretty hard to detect and classify using a CNN. Let’s give it a try.

I started off by collecting 200 assorted images of rats and mice using both Google and Bing, then did the annotation using LabelImg as shown below.


This presents the first major decision we need to make. Do we include the tail in the annotation or not? So, we need to take a step back and think carefully what it is we are trying to achieve.

  • We want to detect rats (and mice), and detecting their bodies or heads is good enough
  • Sometimes all you see is a tail, no body, and yet it’s still a rat!
  • Including the tail also introduces the visual environment around the tail, which could throw our training

Consider for a moment if our task was to build a model that detects both rats and earthworms. Suddenly a rat tail can (and likely will) be detected as an earthworm, or the other way around since they are both similar in shape and colour. I don’t really have an answer here, and I’ve opted to ignore tails completely, except for maybe a stump or an inch of the tail, no more. Let’s see how that works out. We don’t have a lot of training images so our options are limited.

I modified the config.json file as shown below to include our single class (rodent) and generated the anchors as recommended and changed that in the config file. I am not using the YOLO3 pre-trained weights file as I want to train from scratch completely. (Tip: I did a run with pre-trained weights as a test and the results were disappointing)

    "model" : {
        "min_input_size":       128,
        "max_input_size":       872,
        "anchors":              [76,100, 94,201, 139,285, 188,127, 222,339, 234,225, 317,186, 323,281, 331,382],
        "labels":               ["rodent"]

    "train": {
        "train_image_folder":   "C:/Users/xalvm/Documents/Projects/keras-yolo3/data/rodent_dataset/images/",
        "train_annot_folder":   "C:/Users/xalvm/Documents/Projects/keras-yolo3/data/rodent_dataset/anns/",      
        "cache_name":           "rodent_train.pkl",
        "train_times":          10,             
        "pretrained_weights":   "",             
        "batch_size":           4,             
        "learning_rate":        1e-4,           
        "nb_epochs":             30,             
        "warmup_epochs":        3,              
        "ignore_thresh":        0.5,
        "gpus":                 "0,1",
        "grid_scales":          [1,1,1],
        "obj_scale":            5,
        "noobj_scale":          1,
        "xywh_scale":           1,
        "class_scale":          1,
        "tensorboard_dir":      "logs",
        "saved_weights_name":   "rodent.h5",
        "debug":                false            

    "valid": {
        "valid_image_folder":   "",
        "valid_annot_folder":   "",
        "cache_name":           "",
        "valid_times":          1


A typical training run in-progress is shown below, and I stopped the training at around 27 epochs since there was no loss reduction after epoch 24.


Using a sample video off YouTube I ran and viewed the results frame by frame, noticing some good results and a fair amount of missed predictions. The best way to improve prediction is with more training data, so back we go to Google and Bing for more images, and we also grab some frames from random rat videos for more annotation.

My resulting set now contains 560 annotated training images which the script will split into a train/test set for me. With more training images comes longer training runs, and the next run took 20 hours before I stopped it at Epoch 30. This time the results were a lot more impressive.

There were still some failures, so let’s look at those first.


Here are three consecutive frames where the first we have a hit, the second nearly identical frame was missed, while the third again got a hit. This is quite bizarre, as our predictor does a frame by frame prediction. It’s not seeing the video clip as a whole, it literally detects frame by frame and yet in the middle frame we failed.


Again we see three frames where the first was missed, and we would assume the low quality of the frame is to blame. However, notice the following sequence:


Here we barely have the silhouette of a head appearing and yet we get a 98% probability on what is a small, very fuzzy image.


The final sequence above is quite impressive though, a good hit on what is no more than a ball of white fur. If you watch the full clip you will see a few more misses that should have been obvious, and then some pretty incredible hits.

All in all really impressive results, and we only had 560 training images.

Watch the clip here: (I removed 10 seconds from the clip to protect privacy)

YOLO3 Results

2 thoughts on “Near-perfect YOLO3 Object Detection from scratch

    1. Pretty incredible result for such a small training set. I am also interested, whether the data available. I have tons of rat videos from my research that need processing and I was wondering can I use your code to start with?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s