Gesture Controls for Robotics

In my previous post I built a dice detection library via OpenCV, the idea being that using a small camera I can detect the dice and maneuver a robotic arm to pick it up and move it around, sorting it by color. Well it turns out that was way too easy and a bit lame to take up a whole blog post. Suffice it to say it works unbelievably well.

Instead, I figured maybe I can train a model to recognize hand gestures and have the robotic arm respond to commands made via these gestures. Turns out that is fairly easy too but let’s do it anyway.

Hand gesture recognition is really, really hard. I started off with HAAR Cascades I found on the web and some worked really well. Palm, fist. However I needed at least four and finding the remaining two turned out harder than expected. There are plenty of posts with photos showing it working but for some reason recognizing an “okay” or “vickey” just failed for me.

Instead I pulled out my trusty multi-label Keras model I used previously for X-Ray detection and using a few dozen video clips with frames split out into folders I managed to get together around 2000 training images, 500 for each gesture I want to respond to, split into 4 different folders, one for each class. These are shown below.


We have flat palm for forward motion, flipped backhand for backward motion of the robotic arm, and then one each to open and close the claw for grabbing.

The Keras model training code in Python is shown below, a very simple model.

import numpy as np
import keras
from keras.preprocessing.image import img_to_array
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from keras.models import Sequential, load_model
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.callbacks import ModelCheckpoint
from imutils import paths
import random
import pickle
import cv2 as cv2
import os

# set our parameters
BatchSize = 32
Classes = 4
Epochs = 15 
InputShape = (64, 64, 3)
data = []
labels = []

print("Loading training data...")
imagePaths = sorted(list(paths.list_images('/home/gideon/Pictures/Hands')))

# loop over the input images
for imagePath in imagePaths:
    image = cv2.imread(imagePath)
    image = cv2.resize(image,(64,64)) # larger than 64x64 results in a model too big for RPi, this yields 86MB
    image = img_to_array(image)
    # augment the data here if required
    # rotate or swap on hor & ver axis

    # train images are spread across four folders based on their classes
    label = imagePath.split(os.path.sep)[-2].split("_")[0]

data = np.array(data, dtype="float") / 255.0
labels = np.array(labels)
mlb = LabelBinarizer()
labels = mlb.fit_transform(labels)

# partition the data into training and test splits (80/20)
(x_train, x_test, y_train, y_test) = train_test_split(data, labels, test_size=0.20, random_state=42)

# construct our model
model = Sequential()
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', input_shape=InputShape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(128, activation='relu'))
model.add(Dense(Classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])

# train, y_train, batch_size=BatchSize, epochs=Epochs, verbose=1, validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

# save final model'hands.model')
f = open('hands.pickle', "wb")



15 Epochs later we had really good results for accuracy and loss as shown here. The model is not going to give us bounding boxes, only a detected class, but that is good enough. If you want bounding boxes use Yolo3 instead.


Assembling the robotic arm is much less enjoyable. For $20 you get a box of screws and some acrylic parts and instructions not even an IKEA engineer could make sense of. If you do buy one of these, make sure to center your servos prior to assembly. You do NOT want to disassemble this thing and start again, trust me. A sense of humor and patience truly is a virtue in this department.

If you are thinking of buying a robotic arm I highly recommend spending more and getting one that is aluminum, with 6 degrees of freedom, a decent claw, and preferably already assembled. Make sure the servos are high-quality with good torque too.

The servos run off 5 volts and need 1500 to 2000 amps ideally, off a separate power supply, and connecting the data pins directly to your Raspberry Pi is not advised, so I built a small circuit to protect the Pi from any malfunctioning servo using four 100K resistors as shown below. You could use one of the more expensive servo drivers available as well. I opted to just make my own.


The final assembly with Pi and circuit board is shown below mounted on a heavy board. The arm moves really fast and makes a lot of noise, so make sure you add weight to the floor portion to keep things steady when it’s in motion.


Using the fantastic servoblaster library I wrote a couple of functions to control the arm movements, and then connected it all together with the trained model and image detection code.

My model works off 64×64 input images which keeps the final model under 90MB. Bigger than that and it won’t run on the Pi. If you want to use Yolo3 instead, tiny-yolo is the way to go for sure.

from keras.preprocessing.image import img_to_array
from keras.models import load_model
import numpy as np
import cv2 as cv2
import pickle
import time
import os

model = load_model("hands.model")
mlb = pickle.loads(open("hands.pickle", "rb").read())

state_open = False
state_forward = False

# robotic arm movement functions
def ClawOpen():
  os.system("echo 3=2000us > /dev/servoblaster")

def ClawClose():
  os.system("echo 3=500us > /dev/servoblaster")

def ArmForward():
  os.system("echo 4=2000us > /dev/servoblaster")

def ArmBack():
  os.system("echo 4=1100us > /dev/servoblaster")

def ArmMiddle():
  os.system("echo 4=1400us > /dev/servoblaster")

def ArmUp():
  os.system("echo 0=2000us > /dev/servoblaster")

def ArmDown():
  os.system("echo 0=300us > /dev/servoblaster")

def BaseMiddle():
  os.system("echo 1=1300us > /dev/servoblaster")

def BaseLeft():
  os.system("echo 1=2500us > /dev/servoblaster")

def BaseLeftHalf():
  os.system("echo 1=1900us > /dev/servoblaster")

def BaseRight():
  os.system("echo 1=500us > /dev/servoblaster")

def BaseRightHalf():
  os.system("echo 1=900us > /dev/servoblaster")

# Init arm to default starting position and start video capture
video_capture =  cv2.VideoCapture(0)

while True:
    ret, frame =
    if ret == True:
        image = cv2.resize(frame, (64, 64))
        image = img_to_array(image)
        image = image.astype("float") / 255.0
        image = np.expand_dims(image, axis=0)
        proba = model.predict(image)[0]
        idxs = np.argsort(proba)[::-1][:2]

        for (i, j) in enumerate(idxs):
            if ((proba[j] * 100) > 90.00): # 90% or higher certainty before we react
                detected = mlb.classes_[j]
                if (detected == "close"):
                    if (state_open==True):
                if (detected == "open"):
                    if (state_open==False):
                if (detected =="forward"):
                    if (state_forward==False):
                if (detected =="back"):
                    if (state_forward==True):
            break # only care about the top prediction

        if (state_forward==True):
            if (state_changed==True):
        if (state_forward==False):
            if (state_changed==True):
        if (state_open==True):
        if (state_open==False):
            if (state_changed==True):

        # display current state on lcd as a sanity check
        cv2.putText(frame, state, (10, (i * 30) + 25), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)

        cv2.imshow("Output", frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):



So the idea is that using gestures you should be able to perform basic movements and grab hold of something, like a small plastic bottle. Keep in mind this is a $20 robotic arm so don’t expect it to lift anything heavier than a dry teabag (actually part of their online demo, which now makes perfect sense).

To see the system in action I’ve uploaded a 6 second clip over here: YouTube

It’s very basic, but considering the simplicity of the model and the low cost of the parts, this could make a nice attraction at a trade show, if nothing else.


From AI Model to Production in Azure

Problem Description (courtesy of

When a patient has a CT scan taken, a special device uses X-rays to take measurements from a variety of angles which are then computationally reconstructed into a 3D matrix of intensity values. Each layer of the matrix shows one very thin “slice” of the patient’s body.

This data is saved in an industry-standard format known as DICOM, which saves the image matrix in a set binary format and then wraps this data with a huge variety of metadata tags.

Some of these fields (e.g. hardware manufacturer, device serial number, voltage) are usually correct because they are automatically read from hardware and software settings.

The problem is that many important fields must be added manually by the technician and are therefore subject to human error factors like confusion, fatigue, loss of situational awareness, and simple typos.

A doctor scrutinising image data will usually be able to detect incorrect metadata, but in an era when more and more diagnoses are being carried out by computers it is becoming increasingly important that patient record data is as accurate as possible.

This is where Artificial Intelligence comes in. We want to improve the error checking for one single but incredibly important value: a field known as Image Orientation (Patient) which indicates the 3D orientation of the patient’s body in the image.

For this challenge we’re given 20,000 CT scan images, sized 64×64 pixels and labelled correctly for training. The basic premise is given an image, the AI model needs to predict the correct orientation as explained graphically below. The red arrow shows the location of the spine, which our AI model needs to find to figure out the image orientation.


We’ll use Tensorflow and Keras to build and train an AI model in Python and validate against another 20,000 unlabelled images. The pipeline I used had three parts to it, but the core is shown in Python below and achieved 99.98% accuracy on the validation set. The second and third parts (not shown) pushed this to 100%, landing me a #6 ranking on the leader board. A preview of the 20,000 sample training images is shown below.


Our model in Python:

(x_train, x_test, y_train, y_test) = train_test_split(data, labels, test_size=0.15, random_state=42)

# construct our model
model = Sequential()
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', input_shape=InputShape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(128, activation='relu'))
model.add(Dense(Classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])

checkpoint = ModelCheckpoint("model.h5", monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# start training, y_train, batch_size=BatchSize, epochs=Epochs, verbose=1, validation_data=(x_test, y_test), callbacks=callbacks_list)
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

# save the model and multi-label binarizer to disk'capstone.model')
f = open('capstone.pickle', "wb")


I split the sample images into four folders according to their labels and I used ZERO, ONE, TWO and THREE as the class labels. So, given a test image the model will do a prediction and return one of those class labels to assign.

First things first, we’ll construct our model and start the training. On my dual-K80 GPU server this took about an hour. The model is saved at various stages, and once we are happy with the accuracy we’ll save the resulting model and pickle file (capstone.model & capstone.pickle in the code)

To deploy this as an API in Azure we’ll create a new web app with default Azure settings. Once deployed, we’ll add the Python 3.6 extension. Switch to the console mode and use pip to install any additional libraries we need, including Flask, OpenCV, Tensorflow and Keras. Modify the web.config to look like the one shown below. Note that our Python server script will be named

    <add key="PYTHONPATH" value="D:\home\site\wwwroot"/>
    <add key="WSGI_HANDLER" value=""/>
    <add key="WSGI_LOG" value="D:\home\LogFiles\wfastcgi.log"/>
      <add name="PythonHandler" path="*" verb="*" modules="FastCgiModule" scriptProcessor="D:\home\Python364x64\python.exe|D:\home\Python364x64\" resourceType="Unspecified" requireAccess="Script"/>


Our Python script:

import numpy as np
from keras.preprocessing.image import img_to_array
from keras.applications import imagenet_utils
from keras.models import load_model
import cv2
import flask
import io
import pickle

app = flask.Flask(__name__)

model = load_model("capstone.model")
mlb = pickle.loads(open('capstone.pickle', "rb").read())

def _grab_image(stream=None):
	if stream is not None:
		data =
		image = np.asarray(bytearray(data), dtype="uint8")
		image = cv2.imdecode(image, cv2.IMREAD_COLOR)
	return image
@app.route("/predict", methods=["POST"])
def predict():
    data = {"success": False, "label":"None"}

    if flask.request.method == "POST":
        if flask.request.files.get('image'):
            image = _grab_image(stream=flask.request.files["image"])
            image = image.astype("float") / 255.0
            image = img_to_array(image)
            image = np.expand_dims(image, axis=0)
            proba = model.predict(image)[0]
            idxs = np.argsort(proba)[::-1][:2]
            label = mlb.classes_[idxs[0]]
            if label == "ZERO":
                label = "Spine at bottom, patient facing up."
            if label == "ONE":
                label = "Spine at right, patient facing left."
            if label == "TWO":
                label = "Spine at top, patient facing down."
            if label == "THREE":
                label = "Spine at left, patient facing right."
            data["label"] = label
            data["success"] = True

    return flask.jsonify(data)

if __name__ == "__main__":


Using your FTP tool of choice, upload the script, along with capstone.model and capstone.pickle, into the D:\home\site\wwwroot folder. Restart the web app from within Azure.

We can test our API using Postman, or the C# script shown below, which takes a sample image and performs a prediction.

using System;
using System.Net.Http;
using System.Threading.Tasks;

namespace CallPythonAPI
    class Program
        private static readonly HttpClient client = new HttpClient();

        static void Main(string[] args)
            string responsePayload = Upload().GetAwaiter().GetResult();

        private static async Task<string> Upload()
            var request = new HttpRequestMessage(HttpMethod.Post, "");
            var content = new MultipartFormDataContent();
            byte[] byteArray = System.IO.File.ReadAllBytes("20.png");
            content.Add(new ByteArrayContent(byteArray), "image", "20.png");
            request.Content = content;
            var response = await client.SendAsync(request);
            return await response.Content.ReadAsStringAsync();


Our sample image looks like this:


Running the prediction on this image yields the following result:


That’s it. We can incorporate the API call into a web site, desktop client app or even a Raspberry PI device, since all the heavy lifting is done on the server-side.

Near-perfect YOLO3 Object Detection from scratch

I recently completed the Microsoft Professional Program in Artificial Intelligence and have been really impressed by some of the many computer vision examples I’ve seen. It’s a great course and if you are interested in AI I highly recommend it, along with the fantastic blog and training offered by Adrian Rosebrock at

There are a number of key technologies and platforms that will continuously come up in AI as you learn – Tensorflow, CNTK, OpenCV and of course Keras. Once you start exploring computer vision and specifically Convoluted Neural Networks you are bound to run into numerous examples of real-time object detection from video, whether it’s a car, person, dog or street-sign, and most of these examples will use a pre-built model, laboriously created to detect dozens or even thousands of classes of objects out of the box, and ready for you to use in your own models with little to no effort required.

That’s all great, but what if you wanted to detect something that is not included in the pre-built model? The solution lies in building and training your own from scratch, which is what I did for this post.

I’ve found YOLO3 to be really fantastic, and since I’m a Windows user my focus was on being able to build and train a model without having to struggle with code or tutorials designed for Linux. I found a pretty good set of scripts on GitHub and started off by getting it all running locally and training their example detector which detects raccoons.

Sometimes I use a laptop with Intel HD5000 GPU and PlaidML sitting between Keras and Tensorflow. This works well in most cases but for training a YOLO3 model you’ll need a better setup, and I used an Azure Windows 2016 Server VM I deployed and loaded it with Python 3.6, Tensorflow and Keras.

The VM comes with 112GB of RAM and dual Nvidia K80 GPU’s. It’s not cheap to operate so I do all my prep work locally, making sure the model starts training without obvious errors and then copy that all over to the VM for the training run.

For this post I decided that while raccoons are cool, rats would be more interesting. Rats are fast, come in all shapes, sizes and colours, and can unfortunately cause problems when not kept as pets. They nest, chew through electrical wiring, and cause havoc in agriculture and food manufacturing. They are also used for neuroscience research with the classic example being a rat running a maze.

Because of the speed they move and ways they can contort their bodies it should, in theory, be pretty hard to detect and classify using a CNN. Let’s give it a try.

I started off by collecting 200 assorted images of rats and mice using both Google and Bing, then did the annotation using LabelImg as shown below.


This presents the first major decision we need to make. Do we include the tail in the annotation or not? So, we need to take a step back and think carefully what it is we are trying to achieve.

  • We want to detect rats (and mice), and detecting their bodies or heads is good enough
  • Sometimes all you see is a tail, no body, and yet it’s still a rat!
  • Including the tail also introduces the visual environment around the tail, which could throw our training

Consider for a moment if our task was to build a model that detects both rats and earthworms. Suddenly a rat tail can (and likely will) be detected as an earthworm, or the other way around since they are both similar in shape and colour. I don’t really have an answer here, and I’ve opted to ignore tails completely, except for maybe a stump or an inch of the tail, no more. Let’s see how that works out. We don’t have a lot of training images so our options are limited.

I modified the config.json file as shown below to include our single class (rodent) and generated the anchors as recommended and changed that in the config file. I am not using the YOLO3 pre-trained weights file as I want to train from scratch completely. (Tip: I did a run with pre-trained weights as a test and the results were disappointing)

    "model" : {
        "min_input_size":       128,
        "max_input_size":       872,
        "anchors":              [76,100, 94,201, 139,285, 188,127, 222,339, 234,225, 317,186, 323,281, 331,382],
        "labels":               ["rodent"]

    "train": {
        "train_image_folder":   "C:/Users/xalvm/Documents/Projects/keras-yolo3/data/rodent_dataset/images/",
        "train_annot_folder":   "C:/Users/xalvm/Documents/Projects/keras-yolo3/data/rodent_dataset/anns/",      
        "cache_name":           "rodent_train.pkl",
        "train_times":          10,             
        "pretrained_weights":   "",             
        "batch_size":           4,             
        "learning_rate":        1e-4,           
        "nb_epochs":             30,             
        "warmup_epochs":        3,              
        "ignore_thresh":        0.5,
        "gpus":                 "0,1",
        "grid_scales":          [1,1,1],
        "obj_scale":            5,
        "noobj_scale":          1,
        "xywh_scale":           1,
        "class_scale":          1,
        "tensorboard_dir":      "logs",
        "saved_weights_name":   "rodent.h5",
        "debug":                false            

    "valid": {
        "valid_image_folder":   "",
        "valid_annot_folder":   "",
        "cache_name":           "",
        "valid_times":          1


A typical training run in-progress is shown below, and I stopped the training at around 27 epochs since there was no loss reduction after epoch 24.


Using a sample video off YouTube I ran and viewed the results frame by frame, noticing some good results and a fair amount of missed predictions. The best way to improve prediction is with more training data, so back we go to Google and Bing for more images, and we also grab some frames from random rat videos for more annotation.

My resulting set now contains 560 annotated training images which the script will split into a train/test set for me. With more training images comes longer training runs, and the next run took 20 hours before I stopped it at Epoch 30. This time the results were a lot more impressive.

There were still some failures, so let’s look at those first.


Here are three consecutive frames where the first we have a hit, the second nearly identical frame was missed, while the third again got a hit. This is quite bizarre, as our predictor does a frame by frame prediction. It’s not seeing the video clip as a whole, it literally detects frame by frame and yet in the middle frame we failed.


Again we see three frames where the first was missed, and we would assume the low quality of the frame is to blame. However, notice the following sequence:


Here we barely have the silhouette of a head appearing and yet we get a 98% probability on what is a small, very fuzzy image.


The final sequence above is quite impressive though, a good hit on what is no more than a ball of white fur. If you watch the full clip you will see a few more misses that should have been obvious, and then some pretty incredible hits.

All in all really impressive results, and we only had 560 training images.

Watch the clip here: (I removed 10 seconds from the clip to protect privacy)

YOLO3 Results