Running a PyTorch machine learning model on an ESP32

Nov 10, 2023

I wanted to figure out how to run a PyTorch machine learning model on an ESP32. I use MicroPython pretty often so would be even better if I could do it natively on MicroPython.

I didn't really care what the model did so I chose handwriting detection as an arbitrary test case. There is a PCB with a bunch of pads that I use as touch sensors and then use my finger to draw alphabets on them and try and get the model to guess which letter I drew.

Couple things I find cool about this:

Model was trained on PyTorch on my laptop and inference performed on an ESP32
The way this is implemented it would be quite easy to completely change the model architecture and reconfigure it to do other things like audio recognition or object detection
The ML related custom code-base is extremely small (~120 lines of MicroPython code and ~235 lines of C optimizations)
Good enough for real-time inference (needs 31ms for inference)
Has 95% validation set accuracy (as a reference point - a simpler technique like measuring distance from the mean of each category got only 31% accuracy)

Hardware

ESP32 connected to a display
Pads on a PCB that act as touch sensors
- Only 13 of the 16 pads are connected due to MCU peripheral limitations - the MCU only has 13 capacitive touch input input pins
USB for power

Data

To collect training data I used my finger to draw out specific shapes for each letter and stored that data onto the ESP
Each "gesture" has touch sensing data from each of the pads collected at 50ms intervals for 1.5s
The data array is 13 sensors x 30 datapoints, each datapoint being 50ms apart.

Since this is a 2D array of floats, the techniques that apply to image classification models will also apply here.
To see why, let's visualize the data. Below is what one sample of each 1.5s "recording" of the input data looks like when visualized for each letter. The more green something is, the higher the capacitance i.e. the stronger the touch detection signal (larger contact area will lead to larger signal amplitude)

Data is stored onto the ESP32 as a CSV file then transferred onto desktop for training
I collected 120 samples of each letter which takes about 3 minutes of writing the same letter onto the pad each time (Note: Since I had to manually record the data I have only implemented detection for the alphabets A, B, C, X, Y, and Z. Doing all 26 alphabets would be around 1.5 hours of just writing alphabets onto the touchpad - not necessary for a proof of concept on a for-fun blog post)

Training - PyTorch / Laptop

Here is the model architecture I landed on after trying a couple.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

	model = nn.Sequential(
	### 13 x 30 input

	### 2 stride, 3x3 kernel, 3 output channel convolution layer
	### = 3 channels x 6 rows x 14 columns output
	nn.Conv2d(1, 3, (3,3), 2),

	### Activation layer
	nn.ReLU(),

	### Dropout to help generalize
	nn.Dropout(0.1),

	### 2 stride, 3x3 kernel, 12 output channel convolution layer
	### = 12 channels x 2 rows x 6 columns output
	nn.Conv2d(3, 12, (3,3), 2),

	### Activation layer
	nn.ReLU(),

	### Dropout to help generalize
	nn.Dropout(0.1),

	### 1 stride, 2x3 kernel, 12 output channel convolution layer
	### = 12 channels x 1 row x 4 columns output
	nn.Conv2d(12, 12, (2,3), 1),

	### Activation layer
	nn.ReLU(),

	### Dropout to help generalize
	nn.Dropout(0.1),

	### Reshape data from 12 x 1 x 4 tensor -> 48 long vector
	nn.Flatten(),

	### Linear layer with 48 inputs and 7 outputs
	### One output each for:
	### no gesture detected / A / B / C / X / Y / Z)
	nn.Linear(48,7)

	class convLayer(Module):
	def __init__(self, in_channels, out_channels, kernel_size: tuple, stride):
	self.kernels = [[None for _ in range(in_channels)] for _2
	in range(out_channels)]
	self.bias = [None for _ in range(out_channels)]
	self.stride = stride
	self.kernel_size = kernel_size
	def __call__(self, images):
	conv_func = convolution2D
	output_data = []
	for out_channel in range(len(self.kernels)):
	channel_output = []
	for in_channel, image in enumerate(images):
	channel_output.append(conv_func(image, self.kernels[out_channel][in_channel],self.stride))
	### Matrix addition operation
	### sum up all the different outputs for each channel
	summed_channel_output = sumMatrices(channel_output)
	### Add the bias for each channel to each element
	### in the output matrix for that channel
	biased_channel_output = addBias(summed_channel_output, self.bias[out_channel])
	output_data.append(biased_channel_output)
	return output_data

	def convolution2D(image, kernel, stride):
	### Convolve single 2D matrix with kernel using stride
	### returns a single 2D matrix

	### Get dimensions of the input matrix (image) and the kernel
	image_height, image_width = int(len(image)), int(len(image[0]))
	kernel_height, kernel_width = int(len(kernel)), int(len(kernel[0]))

	### Calculate dimensions of the output image
	output_height = (image_height - kernel_height) // stride + 1
	output_width = (image_width - kernel_width) // stride + 1

	### Initialize output image with zeros
	output = [[0]*output_width for _ in range(output_height)]

	### Iterate over the image
	for i in range(0, image_height, stride):

	layers_pred = [convLayer(1, 3, (3,3), 2),
	### 3 x 6 x 14
	ReLU(),
	dummyLayer(),
	convLayer(3, 12, (3,3), 2),
	### 3 x 2 x 6
	ReLU(),
	dummyLayer(),
	convLayer(12, 12, (2,3), 1),
	### 12 x 1 x 4
	ReLU(),
	dummyLayer(),
	Flatten(),
	linearLayer(48,7)]

	model_name = "model_weights.json"
	pred_model = Model(layers_pred, 35887.27,9239.95).load_weights(model_name) # The two numbers are the mean and standard deviation of the data and they are used to normalize data before feeding it into the model

	class Model:
	def __init__(self, layers, inp_norm_mean=0, inp_norm_sd=1):
	self.layers = layers
	### Subtract the mean divide by SD
	self.inp_norm = lambda inp: normInput(inp, 1/inp_norm_sd, -inp_norm_mean)
	def __call__(self, inp):
	return self.forward(inp)
	def __getitem__(self, i):
	return self.layers[i]
	def forward(self, inp):
	out = self.inp_norm(inp)
	for l in self.layers:
	out = l(out)
	return out

luvsheth.com

Running a PyTorch machine learning model on an ESP32

Hardware

Data

Training - PyTorch / Laptop

Inference - MicroPython / ESP32

Creating different blocks

Load weights

Validate ESP32 output against output from PyTorch model

Performance optimization

Side Notes

How many samples are needed?

Using coins while I was waiting for the PCB to arrive

Confidence Threshold

Libraries

Appendix - All the other layers

Linear Layer

Linear Layer with C Optimization

ReLU Layer

ReLU Layer with C Optimization

Dummy Layer

Flatten Layer

Flatten Layer with C Optimization

Input Normalization

Input Normalization with C Optimization

Conv2D Layer

Conv2D Layer with C optimization