cs249r_book/embedded_ml_exercise.qmd

# CV on Nicla Vision {.unnumbered}

As we initiate our studies into embedded machine learning or tinyML,
it\'s impossible to overlook the transformative impact of Computer
Vision (CV) and Artificial Intelligence (AI) in our lives. These two
intertwined disciplines redefine what machines can perceive and
accomplish, from autonomous vehicles and robotics to healthcare and
surveillance.

More and more, we are facing an artificial intelligence (AI) revolution
where, as stated by Gartner, **Edge AI** has a very high impact
potential, and **it is for now**!

![](images_4/media/image2.jpg){width="4.729166666666667in"
height="4.895833333333333in"}

In the \"bull-eye\" of emerging technologies, radar is the *Edge
Computer Vision*, and when we talk about Machine Learning (ML) applied
to vision, the first thing that comes to mind is **Image
Classification**, a kind of ML \"Hello World\"!

This exercise will explore a computer vision project utilizing
Convolutional Neural Networks (CNNs) for real-time image classification.
Leveraging TensorFlow\'s robust ecosystem, we\'ll implement a
pre-trained MobileNet model and adapt it for edge deployment. The focus
will be optimizing the model to run efficiently on resource-constrained
hardware without sacrificing accuracy.

We\'ll employ techniques like quantization and pruning to reduce the
computational load. By the end of this tutorial, you\'ll have a working
prototype capable of classifying images in real time, all running on a
low-power embedded system based on the Arduino Nicla Vision board.

## Computer Vision

At its core, computer vision aims to enable machines to interpret and
make decisions based on visual data from the world---essentially
mimicking the capability of the human optical system. Conversely, AI is
a broader field encompassing machine learning, natural language
processing, and robotics, among other technologies. When you bring AI
algorithms into computer vision projects, you supercharge the system\'s
ability to understand, interpret, and react to visual stimuli.

When discussing Computer Vision projects applied to embedded devices,
the most common applications that come to mind are *Image
Classification* and *Object Detection*.

![](images_4/media/image15.jpg){width="6.5in"
height="2.8333333333333335in"}

Both models can be implemented on tiny devices like the Arduino Nicla
Vision and used on real projects. Let\'s start with the first one.

## Image Classification Project

The first step in any ML project is to define our goal. In this case, it
is to detect and classify two specific objects present in one image. For
this project, we will use two small toys: a *robot* and a small
Brazilian parrot (named *Periquito*). Also, we will collect images of a
*background* where those two objects are absent.

![](images_4/media/image36.jpg){width="6.5in"
height="3.638888888888889in"}

## Data Collection

Once you have defined your Machine Learning project goal, the next and
most crucial step is the dataset collection. You can use the Edge
Impulse Studio, the OpenMV IDE we installed, or even your phone for the
image capture. Here, we will use the OpenMV IDE for that.

**Collecting Dataset with OpenMV IDE**

First, create in your computer a folder where your data will be saved,
for example, \"data.\" Next, on the OpenMV IDE, go to Tools \> Dataset
Editor and select New Dataset to start the dataset collection:

![](images_4/media/image29.png){width="6.291666666666667in"
height="4.010416666666667in"}

The IDE will ask you to open the file where your data will be saved and
choose the \"data\" folder that was created. Note that new icons will
appear on the Left panel.

![](images_4/media/image46.png){width="0.9583333333333334in"
height="1.5208333333333333in"}

Using the upper icon (1), enter with the first class name, for example,
\"periquito\":

![](images_4/media/image22.png){width="3.25in"
height="2.65625in"}

Run the dataset_capture_script.py, and clicking on the bottom icon (2),
will start capturing images:

![](images_4/media/image43.png){width="6.5in"
height="4.041666666666667in"}

Repeat the same procedure with the other classes

![](images_4/media/image6.jpg){width="6.5in"
height="3.0972222222222223in"}

> *We suggest around 60 images from each category. Try to capture
> different angles, backgrounds, and light conditions.*

The stored images use a QVGA frame size 320x240 and RGB565 (color pixel
format).

After capturing your dataset, close the Dataset Editor Tool on the Tools
\> Dataset Editor.

On your computer, you will end with a dataset that contains three
classes: periquito, robot, and background.

![](images_4/media/image20.png){width="6.5in"
height="2.2083333333333335in"}

You should return to Edge Impulse Studio and upload the dataset to your
project.

## Training the model with Edge Impulse Studio

We will use the Edge Impulse Studio for training our model. Enter your
account credentials at Edge Impulse and create a new project:

![](images_4/media/image45.png){width="6.5in"
height="4.263888888888889in"}

> *Here, you can clone a similar project:*
> *[NICLA-Vision_Image_Classification](https://studio.edgeimpulse.com/public/273858/latest).*

## Dataset

Using the EI Studio (or *Studio*), we will pass over four main steps to
have our model ready for use on the Nicla Vision board: Dataset,
Impulse, Tests, and Deploy (on the Edge Device, in this case, the
NiclaV).

![](images_4/media/image41.jpg){width="6.5in"
height="4.194444444444445in"}

Regarding the Dataset, it is essential to point out that our Original
Dataset, captured with the OpenMV IDE, will be split into three parts:
Training, Validation, and Test. The Test Set will be divided from the
beginning and left a part to be used only in the Test phase after
training. The Validation Set will be used during training.

![](images_4/media/image7.jpg){width="6.5in"
height="4.763888888888889in"}

On Studio, go to the Data acquisition tab, and on the UPLOAD DATA
section, upload from your computer the files from chosen categories:

![](images_4/media/image39.png){width="6.5in"
height="4.263888888888889in"}

Left to the Studio to automatically split the original dataset into
training and test and choose the label related to that specific data:

![](images_4/media/image30.png){width="6.5in"
height="4.263888888888889in"}

Repeat the procedure for all three classes. At the end, you should see
your \"raw data in the Studio:

![](images_4/media/image11.png){width="6.5in"
height="4.263888888888889in"}

The Studio allows you to explore your data, showing a complete view of
all the data in your project. You can clear, inspect, or change labels
by clicking on individual data items. In our case, a simple project, the
data seems OK.

![](images_4/media/image44.png){width="6.5in"
height="4.263888888888889in"}

## The Impulse Design

In this phase, we should define how to:

-   Pre-process our data, which consists of resizing the individual
    > images and determining the color depth to use (RGB or Grayscale)
    > and

-   Design a Model that will be \"Transfer Learning (Images)\" to
    > fine-tune a pre-trained MobileNet V2 image classification model on
    > our data. This method performs well even with relatively small
    > image datasets (around 150 images in our case).

![](images_4/media/image23.jpg){width="6.5in"
height="4.0in"}

Transfer Learning with MobileNet offers a streamlined approach to model
training, which is especially beneficial for resource-constrained
environments and projects with limited labeled data. MobileNet, known
for its lightweight architecture, is a pre-trained model that has
already learned valuable features from a large dataset (ImageNet).

![](images_4/media/image9.jpg){width="6.5in"
height="1.9305555555555556in"}

By leveraging these learned features, you can train a new model for your
specific task with fewer data and computational resources yet achieve
competitive accuracy.

![](images_4/media/image32.jpg){width="6.5in"
height="2.3055555555555554in"}

This approach significantly reduces training time and computational
cost, making it ideal for quick prototyping and deployment on embedded
devices where efficiency is paramount.

Go to the Impulse Design Tab and create the *impulse*, defining an image
size of 96x96 and squashing them (squared form, without crop). Select
Image and Transfer Learning blocks. Save the Impulse.

![](images_4/media/image16.png){width="6.5in"
height="4.263888888888889in"}

### **Image Pre-Processing**

All input QVGA/RGB565 images will be converted to 27,640 features
(96x96x3).

![](images_4/media/image17.png){width="6.5in"
height="4.319444444444445in"}

Press \[Save parameters\] and Generate all features:

![](images_4/media/image5.png){width="6.5in"
height="4.263888888888889in"}

## Model Design

In 2007, Google introduced
[[MobileNetV1]{.underline}](https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html),
a family of general-purpose computer vision neural networks designed
with mobile devices in mind to support classification, detection, and
more. MobileNets are small, low-latency, low-power models parameterized
to meet the resource constraints of various use cases. in 2018, Google
launched [MobileNetV2: Inverted Residuals and Linear
Bottlenecks](https://arxiv.org/abs/1801.04381).

MobileNet V1 and MobileNet V2 aim for mobile efficiency and embedded
vision applications but differ in architectural complexity and
performance. While both use depthwise separable convolutions to reduce
the computational cost, MobileNet V2 introduces Inverted Residual Blocks
and Linear Bottlenecks to enhance performance. These new features allow
V2 to capture more complex features using fewer parameters, making it
computationally more efficient and generally more accurate than its
predecessor. Additionally, V2 employs a non-linear activation in the
intermediate expansion layer. Still, it uses a linear activation for the
bottleneck layer, a design choice found to preserve important
information through the network better. MobileNet V2 offers a more
optimized architecture for higher accuracy and efficiency and will be
used in this project.

Although the base MobileNet architecture is already tiny and has low
latency, many times, a specific use case or application may require the
model to be smaller and faster. MobileNets introduces a straightforward
parameter α (alpha) called width multiplier to construct these smaller,
less computationally expensive models. The role of the width multiplier
α is to thin a network uniformly at each layer.

Edge Impulse Studio has available MobileNetV1 (96x96 images) and V2
(96x96 and 160x160 images), with several different **α** values (from
0.05 to 1.0). For example, you will get the highest accuracy with V2,
160x160 images, and α=1.0. Of course, there is a trade-off. The higher
the accuracy, the more memory (around 1.3M RAM and 2.6M ROM) will be
needed to run the model, implying more latency. The smaller footprint
will be obtained at another extreme with MobileNetV1 and α=0.10 (around
53.2K RAM and 101K ROM).

![](images_4/media/image27.jpg){width="6.5in"
height="3.5277777777777777in"}

For this project, we will use **MobileNetV2 96x96 0.1**, which estimates
a memory cost of 265.3 KB in RAM. This model should be OK for the Nicla
Vision with 1MB of SRAM. On the Transfer Learning Tab, select this
model:

![](images_4/media/image24.png){width="6.5in"
height="4.263888888888889in"}

Another necessary technique to be used with Deep Learning is **Data
Augmentation**. Data augmentation is a method that can help improve the
accuracy of machine learning models, creating additional artificial
data. A data augmentation system makes small, random changes to your
training data during the training process (such as flipping, cropping,
or rotating the images).

Under the rood, here you can see how Edge Impulse implements a data
Augmentation policy on your data:

```python
# Implements the data augmentation policy
def augment_image(image, label):
    # Flips the image randomly
    image = tf.image.random_flip_left_right(image)

    # Increase the image size, then randomly crop it down to
    # the original dimensions
    resize_factor = random.uniform(1, 1.2)
    new_height = math.floor(resize_factor * INPUT_SHAPE[0])
    new_width = math.floor(resize_factor * INPUT_SHAPE[1])
    image = tf.image.resize_with_crop_or_pad(image, new_height, new_width)
    image = tf.image.random_crop(image, size=INPUT_SHAPE)

    # Vary the brightness of the image
    image = tf.image.random_brightness(image, max_delta=0.2)

    return image, label

```
Exposure to these variations during training can help prevent your model
from taking shortcuts by \"memorizing\" superficial clues in your
training data, meaning it may better reflect the deep underlying
patterns in your dataset.

The final layer of our model will have 12 neurons with a 15% dropout for
overfitting prevention. Here is the Training result:

![](images_4/media/image31.jpg){width="6.5in"
height="3.5in"}

The result is excellent, with 77ms of latency, which should result in
13fps (frames per second) during inference.

## Model Testing

![](images_4/media/image10.jpg){width="6.5in"
height="3.8472222222222223in"}

Now, you should take the data put apart at the start of the project and
run the trained model having them as input:

![](images_4/media/image34.png){width="3.1041666666666665in"
height="1.7083333333333333in"}

The result was, again, excellent.

![](images_4/media/image12.png){width="6.5in"
height="4.263888888888889in"}

## Deploying the model

At this point, we can deploy the trained model as.tflite and use the
OpenMV IDE to run it using MicroPython, or we can deploy it as a C/C++
or an Arduino library.

![](images_4/media/image28.jpg){width="6.5in"
height="3.763888888888889in"}

**Arduino Library**

First, Let\'s deploy it as an Arduino Library:

![](images_4/media/image48.png){width="6.5in"
height="4.263888888888889in"}

You should install the library as.zip on the Arduino IDE and run the
sketch nicla_vision_camera.ino available in Examples under your library
name.

> *Note that Arduino Nicla Vision has, by default, 512KB of RAM
> allocated for the M7 core and an additional 244KB on the M4 address
> space. In the code, this allocation was changed to 288 kB to guarantee
> that the model will run on the device
> (malloc_addblock((void\*)0x30000000, 288 \* 1024);).*

The result was good, with 86ms of measured latency.

![](images_4/media/image25.jpg){width="6.5in"
height="3.4444444444444446in"}

Here is a short video showing the inference results:
[[https://youtu.be/bZPZZJblU-o]{.underline}](https://youtu.be/bZPZZJblU-o)

**OpenMV**

It is possible to deploy the trained model to be used with OpenMV in two
ways: as a library and as a firmware.

Three files are generated as a library: the.tflite model, a list with
the labels, and a simple MicroPython script that can make inferences
using the model.

![](images_4/media/image26.png){width="6.5in"
height="1.0in"}

Running this model as a.tflite directly in the Nicla was impossible. So,
we can sacrifice the accuracy using a smaller model or deploy the model
as an OpenMV Firmware (FW). As an FW, the Edge Impulse Studio generates
optimized models, libraries, and frameworks needed to make the
inference. Let\'s explore this last one.

Select OpenMV Firmware on the Deploy Tab and press \[Build\].

![](images_4/media/image3.png){width="6.5in"
height="4.263888888888889in"}

On your computer, you will find a ZIP file. Open it:

![](images_4/media/image33.png){width="6.5in" height="2.625in"}

Use the Bootloader tool on the OpenMV IDE to load the FW on your board:

![](images_4/media/image35.jpg){width="6.5in" height="3.625in"}

Select the appropriate file (.bin for Nicla-Vision):

![](images_4/media/image8.png){width="6.5in" height="1.9722222222222223in"}

After the download is finished, press OK:

![DFU firmware update complete!.png](images_4/media/image40.png){width="3.875in" height="5.708333333333333in"}

If a message says that the FW is outdated, DO NOT UPGRADE. Select
\[NO\].

![](images_4/media/image42.png){width="4.572916666666667in"
height="2.875in"}

Now, open the script **ei_image_classification.py** that was downloaded
from the Studio and the.bin file for the Nicla.

![](images_4/media/image14.png){width="6.5in"
height="4.0in"}

And run it. Pointing the camera to the objects we want to classify, the
inference result will be displayed on the Serial Terminal.

![](images_4/media/image37.png){width="6.5in"
height="3.736111111111111in"}

**Changing Code to add labels:**

The code provided by Edge Impulse can be modified so that we can see,
for test reasons, the inference result directly on the image displayed
on the OpenMV IDE.

[[Upload the code from
GitHub,]{.underline}](https://github.com/Mjrovai/Arduino_Nicla_Vision/blob/main/Micropython/nicla_image_classification.py)
or modify it as below:

```python
# Marcelo Rovai - NICLA Vision - Image Classification
# Adapted from Edge Impulse - OpenMV Image Classification Example
# @24Aug23

import sensor, image, time, os, tf, uos, gc

sensor.reset()                         # Reset and initialize the sensor.
sensor.set_pixformat(sensor.RGB565)    # Set pxl fmt to RGB565 (or GRAYSCALE)
sensor.set_framesize(sensor.QVGA)      # Set frame size to QVGA (320x240)
sensor.set_windowing((240, 240))       # Set 240x240 window.
sensor.skip_frames(time=2000)          # Let the camera adjust.

net = None
labels = None

try:
    # Load built in model
    labels, net = tf.load_builtin_model('trained')
except Exception as e:
    raise Exception(e)

clock = time.clock()
while(True):
    clock.tick()  # Starts tracking elapsed time.

    img = sensor.snapshot()

    # default settings just do one detection
    for obj in net.classify(img,
                            min_scale=1.0,
                            scale_mul=0.8,
                            x_overlap=0.5,
                            y_overlap=0.5):
        fps = clock.fps()
        lat = clock.avg()

        print("**********\nPrediction:")
        img.draw_rectangle(obj.rect())
        # This combines the labels and confidence values into a list of tuples
        predictions_list = list(zip(labels, obj.output()))

        max_val = predictions_list[0][1]
        max_lbl = 'background'
        for i in range(len(predictions_list)):
            val = predictions_list[i][1]
            lbl = predictions_list[i][0]

            if val > max_val:
                max_val = val
                max_lbl = lbl

    # Print label with the highest probability
    if max_val < 0.5:
        max_lbl = 'uncertain'
    print("{} with a prob of {:.2f}".format(max_lbl, max_val))
    print("FPS: {:.2f} fps ==> latency: {:.0f} ms".format(fps, lat))

    # Draw label with highest probability to image viewer
    img.draw_string(
        10, 10,
        max_lbl + "\n{:.2f}".format(max_val),
        mono_space = False,
        scale=2
        )

```

Here you can see the result:

![](images_4/media/image47.jpg){width="6.5in"
height="2.9444444444444446in"}

Note that the latency (136 ms) is almost double what we got directly
with the Arduino IDE. This is because we are using the IDE as an
interface and the time to wait for the camera to be ready. If we start
the clock just before the inference:

![](images_4/media/image13.jpg){width="6.5in"
height="2.0972222222222223in"}

The latency will drop to only 71 ms.

![](images_4/media/image1.jpg){width="3.5520833333333335in"
height="1.53125in"}

> *The NiclaV runs about half as fast when connected to the IDE. The FPS should increase once disconnected.*

### **Post-Processing with LEDs**

When working with embedded machine learning, we are looking for devices
that can continually proceed with the inference and result, taking some
action directly on the physical world and not displaying the result on a
connected computer. To simulate this, we will define one LED to light up
for each one of the possible inference results.

![](images_4/media/image38.jpg){width="6.5in"
height="3.236111111111111in"}

For that, we should [[upload the code from
GitHub]{.underline}](https://github.com/Mjrovai/Arduino_Nicla_Vision/blob/main/Micropython/nicla_image_classification_LED.py)
or change the last code to include the LEDs:

```python
# Marcelo Rovai - NICLA Vision - Image Classification with LEDs
# Adapted from Edge Impulse - OpenMV Image Classification Example
# @24Aug23

import sensor, image, time, os, tf, uos, gc, pyb

ledRed = pyb.LED(1)
ledGre = pyb.LED(2)
ledBlu = pyb.LED(3)

sensor.reset()                         # Reset and initialize the sensor.
sensor.set_pixformat(sensor.RGB565)    # Set pixl fmt to RGB565 (or GRAYSCALE)
sensor.set_framesize(sensor.QVGA)      # Set frame size to QVGA (320x240)
sensor.set_windowing((240, 240))       # Set 240x240 window.
sensor.skip_frames(time=2000)          # Let the camera adjust.

net = None
labels = None

ledRed.off()
ledGre.off()
ledBlu.off()

try:
    # Load built in model
    labels, net = tf.load_builtin_model('trained')
except Exception as e:
    raise Exception(e)

clock = time.clock()


def setLEDs(max_lbl):

    if max_lbl == 'uncertain':
        ledRed.on()
        ledGre.off()
        ledBlu.off()

    if max_lbl == 'periquito':
        ledRed.off()
        ledGre.on()
        ledBlu.off()

    if max_lbl == 'robot':
        ledRed.off()
        ledGre.off()
        ledBlu.on()

    if max_lbl == 'background':
        ledRed.off()
        ledGre.off()
        ledBlu.off()


while(True):
    img = sensor.snapshot()
    clock.tick()  # Starts tracking elapsed time.

    # default settings just do one detection.
    for obj in net.classify(img,
                            min_scale=1.0,
                            scale_mul=0.8,
                            x_overlap=0.5,
                            y_overlap=0.5):
        fps = clock.fps()
        lat = clock.avg()

        print("**********\nPrediction:")
        img.draw_rectangle(obj.rect())
        # This combines the labels and confidence values into a list of tuples
        predictions_list = list(zip(labels, obj.output()))

        max_val = predictions_list[0][1]
        max_lbl = 'background'
        for i in range(len(predictions_list)):
            val = predictions_list[i][1]
            lbl = predictions_list[i][0]

            if val > max_val:
                max_val = val
                max_lbl = lbl

    # Print label and turn on LED with the highest probability
    if max_val < 0.8:
        max_lbl = 'uncertain'

    setLEDs(max_lbl)

    print("{} with a prob of {:.2f}".format(max_lbl, max_val))
    print("FPS: {:.2f} fps ==> latency: {:.0f} ms".format(fps, lat))

    # Draw label with highest probability to image viewer
    img.draw_string(
        10, 10,
        max_lbl + "\n{:.2f}".format(max_val),
        mono_space = False,
        scale=2
        )

```

Now, each time that a class gets a result superior of 0.8, the
correspondent LED will be light on as below:

-   Led Red 0n: Uncertain (no one class is over 0.8)

-   Led Green 0n: Periquito \> 0.8

-   Led Blue 0n: Robot \> 0.8

-   All LEDs Off: Background \> 0.8

Here is the result:

![](images_4/media/image18.jpg){width="6.5in"
height="3.6527777777777777in"}

In more detail

![](images_4/media/image21.jpg){width="6.5in"
height="2.0972222222222223in"}

### **Image Classification (non-official) Benchmark**

Several development boards can be used for embedded machine learning
(tinyML), and the most common ones for Computer Vision applications
(with low energy), are the ESP32 CAM, the Seeed XIAO ESP32S3 Sense, the
Arduinos Nicla Vison, and Portenta.

![](images_4/media/image19.jpg){width="6.5in"
height="4.194444444444445in"}

Using the opportunity, the same trained model was deployed on the
ESP-CAM, the XIAO, and Portenta (in this one, the model was trained
again, using grayscaled images to be compatible with its camera. Here is
the result, deploying the models as Arduino\'s Library:

![](images_4/media/image4.jpg){width="6.5in"
height="3.4444444444444446in"}

## Conclusion

Before we finish, consider that Computer Vision is more than just image
classification. For example, you can develop Edge Machine Learning
projects around vision in several areas, such as:

-   **Autonomous Vehicles**: Use sensor fusion, lidar data, and computer
    > vision algorithms to navigate and make decisions.

-   **Healthcare**: Automated diagnosis of diseases through MRI, X-ray,
    > and CT scan image analysis

-   **Retail**: Automated checkout systems that identify products as
    > they pass through a scanner.

-   **Security and Surveillance**: Facial recognition, anomaly
    > detection, and object tracking in real-time video feeds.

-   **Augmented Reality**: Object detection and classification to
    > overlay digital information in the real world.

-   **Industrial Automation**: Visual inspection of products, predictive
    > maintenance, and robot and drone guidance.

-   **Agriculture**: Drone-based crop monitoring and automated
    > harvesting.

-   **Natural Language Processing**: Image captioning and visual
    > question answering.

-   **Gesture Recognition**: For gaming, sign language translation, and
    > human-machine interaction.

-   **Content Recommendation**: Image-based recommendation systems in
    > e-commerce.