Object detection using ML with Turi Create and augmentation using ARKit

March 14, 2018

Suyash Gupta

Contributor

March 14, 2018

Suyash Gupta

Contributor

Introduction

Over the past few years, the use of Machine Learning to solve complex problems has been increasing. Machine learning (ML) is a field of computer science that gives computer systems the ability to “learn” (i.e. progressively improve performance on a specific task) with data, without being explicitly programmed.

Last year was a good year for the freedom of information, as titans of the industry Google, Microsoft, Facebook, Amazon, Apple and even Baidu open-sourced their ML frameworks. In this blog, let’s explore a framework provided by Apple named Turi Create.

Turi Create

In WWDC 2017, Apple provided machine learning videos which uses CoreML framework. However, Apple had a prerequisite of Machine Learning knowledge from the developers. To simplify the development of custom Machine Learning models, Apple released Turi Create in Dec 2017. With Turi Create, you don’t have to be a Machine Learning expert to add recommendations, object detection or activity classification to your iOS App. Turi Create focuses on tasks instead of algorithms and with Turi Create we can tackle a number of scenarios:

Recommender systems: Allows you to provide personalized recommendations to users. With this toolkit, you can create a model based on past interaction data and use that model to make recommendations.
Image classification : Given an image, assigns it to one of a pre-determined number of labels or category of images.
Image similarity: Given an image, finds similar images.
Object detection: The task of simultaneously classifying (what) and localizing (where) object instances in an image.
Activity classifier: The task of identifying a pre-defined set of physical actions using motion-sensory inputs. Such sensors include accelerometers, gyroscopes, thermostats, and many more found in most handheld devices today.
Text classifier: Text classification, commonly used in tasks such as sentiment analysis, refers to the use of Natural Language Processing (NLP) techniques to extract subjective information such as the polarity of the text. For example, whether or not the author is speaking positively or negatively about a particular topic.

You can also work with essential Machine Learning models organized into algorithm-based toolkits:

Moving on, let’s talk a bit about Augmented Reality technology.

Augmented Reality (AR)

AR is a technology that superimposes a computer-generated image on a user’s view of the real world, thus providing a composite view. It is used to enhance the natural environments or situations and offer perceptually enriched experiences. The first functional AR systems that provided immersive mixed reality experiences for users were invented in the early 1990s, starting with the Virtual Fixtures system developed at the U.S. Air Force’s Armstrong Labs in 1992. Checkout the image below of AR implementation in a native app in which directions to the check-in counter of airport is shown.

Case Study

The purpose of this blog is to detect an object with the help of a model file generated by Turi Create and then show the details related to the object using AR view. I thought of conducting this process inside a conference room in my organization. Luckily the rooms here are named after computer scientists. The room in which I conducted this task goes by the name ‘Donald Knuth’. So I have shown a little profile information like name, photo, description etc about the computer scientist using AR view after the detection of an object.

Now about identifying an object part, there are paintings in the room which are unique in nature. I have selected one of them and conducted this experiment:

To summarize, we will be creating an iOS app in which we will scan the AR View for the unique object (painting) presence. Once it is detected, we will show the details related to that object i.e. the room in which it is placed. The details will be the room name, description and a picture and will be shown in the AR view.

Prerequisites

Sample Images : We need to have sample pictures of an object which is required by Turi Create for Machine learning model generation. According to the official documentation of Turi Create, we need 30-200 sample images of the object in different contexts, from a variety of angles and scales, lighting conditions, etc. The more samples we have, the better will be the detection. I have taken 30 images of the painting from different angles and varying distance using my iPhone-7 device. Some of the samples are shown below:

Bounding Box details: Each sample image should have the complete painting in view and dimension details of the painting inside the image is required. For more clarity we need the bounding box of the painting inside the sample image. I have used 30 sample images at different angles and lighting conditions and I made use of the GIMP software to figure out the bounding box dimension details. Following image shows the details:

The Bounding box details of all the sample images are required. This might be a time-consuming process and if we have more than 100 images then it takes a lot of time. To reduce this manual work we can make use of an annotation tool which helps us provide the details of the bounding box pretty fast. I found a list of annotation tools here and for quick start you can go for this simple image annotator tool here. It provides the output in CSV format too.

Now using the bounding box details, we need the following 4 details of the painting in each of the sample image:

width of the bounding box
height of the bounding box
x-center and y-center of the bounding box relative to the sample image

Once done, we need to create a file name build.py and this can be created in any text editor. But first lets setup Turi Create.

Setup and usage of Turi Create

Let’s get started with the setup of Turi Create. Following are the prerequisities:

A 64 bit processor computer with Python v2.7.x installed. You can install python from here.
ARKit for developing an iOS app
Mac machine with Xcode (v9.2).
Install Turi Create by executing the following command in Terminal

https://gist.github.com/suyashgupta25/47ce27b8821ea91bf9774c25cf11ebcd

NOTE: Apple recommends installing Turi Create using vitrualenv but for the sake of trying this tool for the first time, I installed it directly. I would recommend to use virtualenv too.

Now let’s start with its usage.

Create a folder with name ‘MLModel’.
Create a file name build.py inside the MLModel folder.
Create a folder named ‘images’ inside MLModel folder and put all sample images in this folder.
Add the following code snippet to build.py (Read the comments before every line of code for step-by-step understanding).

https://gist.github.com/suyashgupta25/10d755bd728eeac1a2771bd2dd22f630

After making the build.py file with the above code snippet, go to Terminal and migrate to the directory where build.py is present (using cd command) and execute the below command:

https://gist.github.com/suyashgupta25/20909bea6e389fb5bba74dcfa31a4928

This command starts the building process and exports the Machine Learning model file used by the iOS App for object detection. The key parameters for the creation of the model file are:

No of images used: 30
Train data and test data split ratio: 0.8 (training data: 24, test data: 6)
Number of iterations using train data: 1000 (the more max_iterations you provide the better will be accuracy of the model.)
Size of end product (‘Painting.mlmodel’ file): ~ 65 MB
Time taken to complete the processing: ~3 hours

Setup of Xcode project

Let’s start with creation of an Xcode project with name RoomDetector with default settings. ARKit is available from iOS sdk v11.0+ and is apart from camera there is no specific permission required to use it. Drag and drop your Painting.mlmodel file into your project as shown below:

The above project is available on Github which contains all assets used to develop it.

To make use of the ML model file for object detection process, first import the CoreML and Vision framework of iOS into your UIViewController and then create a VNCoreMLModel:

https://gist.github.com/suyashgupta25/710fb4498656c7521ab9a10898cc61f6

VNCoreMLModel is a container for a Core ML model used with Vision requests. Vision framework provides high-performance image analysis and computer techniques to identify faces, detect features, and classify scenes in images and video.

Next step is to create a ML Request which is provided by CoreML Framework and here you can provide a completion handler which will be triggered in case of detections:

https://gist.github.com/suyashgupta25/7356660863f973c2c7c9bf753d0c515b

Now we need to provide the input from ARKit’s scene or its camera view to the ML request. We need to capture the ARKit’s camera view in the form of a CVPixelBuffer. CVPixelBuffer is an image buffer that holds pixels in the memory and it can be captured using the below API:

https://gist.github.com/suyashgupta25/e13c0ce48dccd45f829168ffe6475cb3

We need to continuously call this API in an interval to get the images and using ML request we will evaluating the results for object detection. Remember the CVPixelBuffer is an in-memory image buffer so don’t store it globally in an array. As soon as the processing of the image in the buffer is done for object detection, it will be released automatically from memory by iOS. Also, choose the interval wisely for calling the above API(>= 1000 msec).

Finally, to execute the request, make a request handler which takes pixel buffer as a parameter and will call completion handler in case of object detection:

https://gist.github.com/suyashgupta25/9d05f8738e43659f1fffe33d7d90a348

Conclusion

The detection of an object using Turi Create is amazing. We used 30 images in the dataset and when we tested it for detection there were cases when it took some time to detect (>3 sec). But in 70% of the cases the detection was instantaneous (< 3 secs). There are a few ways to increase the accuracy:

We took a dataset size of 30 images which is the minimum sample size according to Turi Create documentation. It says to achieve greater accuracy we should provide sample size (sample images) up to 200.
We are capturing the camera view of ARKit with an interval of 1 sec which is an appropriate interval for capturing an image buffer. But reducing this interval increases the accuracy.

I made use of ARkit’s scene view (camera view) for showing image and text just after detection (details of the room). You can check-out the project for more details here and the demo of the App below:

Happy Coding!

Lastly, if you are looking to implement AI then you may refer the AI implementation blog to get a detailed list of factors that you should consider before committing to such investments.