Making MicroPython computations run 109x faster
For those not familiar - MicroPython is an implementation of Python developed for microcontrollers. It works pretty well out of the box but I was curious how much faster we could make things run if we tried to optimize it.
There are two things that could be making code run slow:
Waiting for external input like an HTTP request or GPIO button press
Lots of computations
To speed up code in scenario #1, you can use asyncio to run code concurrently. There's lots of information about this online so there's no need for me to write about it.
Here I'm looking at scenario #2 where there's lots of computation going on. Just in case anyone wants to follow along I am using an ESP-WROOM-32 development board to run this code.
As a spoiler, I tested 3 different methods to improve performance and here are the results I got:
I also provide the complete source towards the end in case you'd like to check the source for more information.
Baseline Function
Let's set up a baseline first.
Here's a function that tests whether a number is prime or not (source):
def testPrime(num):
if num == 1:
# 1 is defined as not prime
return False
elif num > 1:
# check for factors
for i in range(2, num):
if (num % i) == 0:
# if factor is found, this is not a prime number
return False
# No factors found - number is prime
return True
Let's benchmark how fast it runs on MicroPython.
Here's the test code - we measure how many CPU clock cycles it takes for the function to run and convert it to a milliseconds estimate based on clock frequency.
def compareRunTime(num):
print("------------")
print(f"TESTING whether {num} is prime")
print("------------")
start_time = time.ticks_cpu()
output1 = testPrime(num)
end_time = time.ticks_cpu()
regular = time.ticks_diff(end_time, start_time) # How many clock cycles did it take to run the function
time_ms = regular/cpu_frequency*1000
print(f"Regular script took {time_ms} ms / {regular} clock cycles")
compareRunTime(71569) # This is a prime number
Here's the output:
------------
TESTING whether 71569 is prime
------------
Regular script took 487.2375 ms / 77958003 clock cycles
So our baseline time to check a 5 digit number is 487ms.
Native Code Emitter
Next we are going to use Python's native code emitter - documentation can be found here. All we need to do is add a decorator. What this does is converts the code to machine op-codes rather than using Python bytecode as an intermediary which then needs to be converted to machine op-codes (which is what happened when we ran our baseline example). The trade-off here is larger compiled code size which shouldn't really be an issue if you only use this in the performance-critical sections of code.
@micropython.native
def testPrimeNative(num):
# No changes to anything below this
if num == 1:
# 1 is defined as not prime
return False
elif num > 1:
# check for factors
for i in range(2, num):
if (num % i) == 0:
# if factor is found, this is not a prime number
return False
# No factors found - number is prime
return True
Making code run with the native emitter is extremely easy - lots of code should run fine without any changes and the only features that can't be used are context managers and generators (so no async code either since that runs as a generator under the hood).
Here's the performance comparison code in case you want to see it.
def compareRunTime(num):
print("------------")
print(f"TESTING whether {num} is prime")
print("------------")
start_time = time.ticks_cpu()
output1 = testPrime(num)
end_time = time.ticks_cpu()
regular = time.ticks_diff(end_time, start_time)
time_ms = regular/cpu_frequency*1000
print(f"Regular script took {time_ms} ms / {regular} clock cycles")
start_time = time.ticks_cpu()
output2 = testPrimeNative(num)
end_time = time.ticks_cpu()
native = time.ticks_diff(end_time, start_time)
time_ms = native/cpu_frequency*1000
print(f"Native script took {time_ms} ms / {native} clock cycles")
And here are the results:
------------
TESTING whether 71569 is prime
------------
Regular script took 487.2414 ms / 77958627 clock cycles
Native script took 203.7882 ms / 32606107 clock cycles
So that's over twice as fast and with essentially no changes to the code.
Viper Code Emitter
Next we can use the viper code emitter. This produces machine op-codes similar to the native emitter but it optimises things further. To do that it needs to know the variable types and you will need to follow a few other restrictions (link). You will have to write your code to work specifically with the viper emitter and it may not run on the regular desktop version of CPython.
Here is the code:
@micropython.viper
def testPrimeViper(num:int) -> bool:
if num == 1:
# 1 is defined as not prime
return False
elif num > 1:
# check for factors
for i in range(2, num):
if (num % i) == 0:
# if factor is found, this is not a prime number
return False
Here are the results:
------------
TESTING whether 71569 is prime
------------
Regular script took 487.2413 ms / 77958605 clock cycles
Native script took 203.7885 ms / 32606159 clock cycles
Viper script took 30.02811 ms / 4804497 clock cycles
That's a lot faster - we are already running 16x faster than the initial MicroPython code and we didn't really have to do too much work.
Micropython C module
There's one last thing we can do - we can write our function in C and import it into MicroPython and use that to run the calculation. The nice thing about this method is most of the firmware can still be written in MicroPython and you can use all the nice high-level Python features for most of the firmware. You only write the performance-critical sections of code in C.
Once you have a compiled C module it's pretty simple to use in MicroPython:
import test_prime_c
is_prime = test_prime_c.test_prime(71569)
Compiling it takes a bit more effort, however. You need to set up the toolchain and then write your C module. Setting up the tool-chain shouldn't take more than 2 hours. As for writing the C module - the simple one here took about 5 minutes but something more complex could take considerably longer. I'd recommend using the MicroPython examples (link) and modifying them to suit your needs. Make sure you can compile and import their example module before you try and make your own.
Here is what the C code looks like (and here is the documentation to start building your own C modules - link).
test_prime_c.c
shown below.
// Include the header file to get access to the MicroPython API
#include "py/dynruntime.h"
// Helper function to check whether a number is prime
STATIC mp_int_t prime_helper(mp_int_t x) {
if (x == 1){
// 1 is not prime
return 0;
}
for (int i = 2; i<x; i++){
if(x%i == 0){
// factor found - not a prime number
return 0;
}
}
return 1; // if none of the above conditions determined a factor, then this is a prime
}
// This is the function which will be called from Python, as test_prime(x). We pass it a micropython object (everything is an object in python - int is an object of type int)
STATIC mp_obj_t test_prime(mp_obj_t x_obj) {
// Get the C compatible integer from the MicroPython integer object
mp_int_t x = mp_obj_get_int(x_obj);
// Check whether it is prime
mp_int_t result = prime_helper(x);
// Convert the result to a MicroPython boolean object and return it
return mp_obj_new_bool(result);
}
// Define a Python reference to the function above
// so that you can call it through test_prime_c.test_prime
STATIC MP_DEFINE_CONST_FUN_OBJ_1(test_prime_obj, test_prime);
// This is the entry point and is called when the module is imported
mp_obj_t mpy_init(mp_obj_fun_bc_t *self, size_t n_args, size_t n_kw, mp_obj_t *args) {
// This must be first, it sets up the globals dict and other things
MP_DYNRUNTIME_INIT_ENTRY
// Make the function available in the module's namespace
mp_store_global(MP_QSTR_test_prime, MP_OBJ_FROM_PTR(&test_prime_obj));
// This must be last, it restores the globals dict
MP_DYNRUNTIME_INIT_EXIT
}
We compile that for our target MCU and it will generate a test_prime_c.mpy
file that we upload onto our MCU. We can then import the module into MicroPython and start using it like any other module. Here's a copy of the module I compiled for the ESP32: link.
Here is the performance benchmark. It’s also printing the results to make sure that all of them return the same result.
------------
TESTING whether 71569 is prime
------------
Regular script took 487.2414 ms / 77958627 clock cycles
Native script took 203.7882 ms / 32606107 clock cycles
Viper script took 30.02839 ms / 4804543 clock cycles
Compiled C module took 4.476794 ms / 716287 clock cycles
------------
RESULTS
------------
True True True True
------------
108.8x faster than the baseline - not bad!
Let's check with a larger number and a smaller number to see how the performance compares for larger and smaller workloads.
compareRunTime(313) # This is a prime number - should be quick to calculate
------------
TESTING whether 313 is prime
------------
Regular script took 2.238606 ms / 358177 clock cycles
Native script took 0.9453438 ms / 151255 clock cycles
Viper script took 0.1852 ms / 29632 clock cycles
Compiled C module took 0.03605625 ms / 5769 clock cycles
------------
RESULTS
------------
True True True True
------------
compareLongRunTime(4951277) # This is a prime number - should take much longer to calculate
------------
TESTING whether 4951277 is prime
------------
Regular script took 34 s
Native script took 14 s
Viper script took 2 s
Compiled C module took 0s / 305.7551 ms
------------
RESULTS
------------
True True True True
------------
With a small workload we see a 62x improvement and with a large workload we see a 112x improvement.
Conclusion
In general I would try and stick to the native emitter or viper emitter because they are decent enough improvements but still quite quick and easy to implement.
There are also some other things you can do such as caching references to global objects that would make things faster without having to move to C (link).
However if there are some specific performance critical sections that are too slow then C is worth a shot. I haven't tried accessing hardware through C just yet so that should make an interesting experiment in the future.
Here are the results of the above tests in graph form.
In case anyone needs these for reference, here are source files of the micropython test script, the compiled C module, and C module source file:
test_prime_c.c (uncompiled C code that you could compile)
test_prime_c.mpy (compiled C module that you can import onto an ESP-WROOM-32)