Making MicroPython computations run 109x faster

Mar 26, 2023

For those not familiar - MicroPython is an implementation of Python developed for microcontrollers. It works pretty well out of the box but I was curious how much faster we could make things run if we tried to optimize it.

There are two things that could be making code run slow:

Waiting for external input like an HTTP request or GPIO button press
Lots of computations

To speed up code in scenario #1, you can use asyncio to run code concurrently. There's lots of information about this online so there's no need for me to write about it.

Here I'm looking at scenario #2 where there's lots of computation going on. Just in case anyone wants to follow along I am using an ESP-WROOM-32 development board to run this code.

As a spoiler, I tested 3 different methods to improve performance and here are the results I got:

I also provide the complete source towards the end in case you'd like to check the source for more information.

Baseline Function

Let's set up a baseline first.

Here's a function that tests whether a number is prime or not (source):

def testPrime(num): 
    if num == 1: 
        # 1 is defined as not prime 
        return False 
         
    elif num > 1: 
        # check for factors 
        for i in range(2, num): 
            if (num % i) == 0: 
                # if factor is found, this is not a prime number 
                return False 
        # No factors found - number is prime 
        return True

Let's benchmark how fast it runs on MicroPython.

Here's the test code - we measure how many CPU clock cycles it takes for the function to run and convert it to a milliseconds estimate based on clock frequency.

def compareRunTime(num): 
    print("------------") 
    print(f"TESTING whether {num} is prime") 
    print("------------") 
    start_time = time.ticks_cpu() 
    output1 = testPrime(num) 
    end_time = time.ticks_cpu() 
    regular = time.ticks_diff(end_time, start_time) # How many clock cycles did it take to run the function 
    time_ms = regular/cpu_frequency*1000 
    print(f"Regular script took {time_ms} ms / {regular} clock cycles")

compareRunTime(71569) # This is a prime number

Here's the output:

------------ 
TESTING whether 71569 is prime 
------------ 
Regular script took 487.2375 ms / 77958003 clock cycles

So our baseline time to check a 5 digit number is 487ms.

Native Code Emitter

Next we are going to use Python's native code emitter - documentation can be found here. All we need to do is add a decorator. What this does is converts the code to machine op-codes rather than using Python bytecode as an intermediary which then needs to be converted to machine op-codes (which is what happened when we ran our baseline example). The trade-off here is larger compiled code size which shouldn't really be an issue if you only use this in the performance-critical sections of code.

@micropython.native 
def testPrimeNative(num):
# No changes to anything below this
    if num == 1: 
        # 1 is defined as not prime 
        return False 
         
    elif num > 1: 
        # check for factors 
        for i in range(2, num): 
            if (num % i) == 0: 
                # if factor is found, this is not a prime number 
                return False 
        # No factors found - number is prime 
        return True

Making code run with the native emitter is extremely easy - lots of code should run fine without any changes and the only features that can't be used are context managers and generators (so no async code either since that runs as a generator under the hood).

Here's the performance comparison code in case you want to see it.

def compareRunTime(num): 
    print("------------") 
    print(f"TESTING whether {num} is prime") 
    print("------------") 
    start_time = time.ticks_cpu() 
    output1 = testPrime(num) 
    end_time = time.ticks_cpu() 
    regular = time.ticks_diff(end_time, start_time) 
    time_ms = regular/cpu_frequency*1000 
    print(f"Regular script took {time_ms} ms / {regular} clock cycles") 
    start_time = time.ticks_cpu() 
    output2 = testPrimeNative(num) 
    end_time = time.ticks_cpu() 
    native = time.ticks_diff(end_time, start_time) 
    time_ms = native/cpu_frequency*1000 
    print(f"Native script took {time_ms} ms / {native} clock cycles")

And here are the results:

------------ 
TESTING whether 71569 is prime 
------------ 
Regular script took 487.2414 ms / 77958627 clock cycles 
Native script took 203.7882 ms / 32606107 clock cycles

So that's over twice as fast and with essentially no changes to the code.

Viper Code Emitter

Next we can use the viper code emitter. This produces machine op-codes similar to the native emitter but it optimises things further. To do that it needs to know the variable types and you will need to follow a few other restrictions (link). You will have to write your code to work specifically with the viper emitter and it may not run on the regular desktop version of CPython.

Here is the code:

@micropython.viper 
def testPrimeViper(num:int) -> bool: 
    if num == 1: 
        # 1 is defined as not prime 
        return False 
         
    elif num > 1: 
        # check for factors 
        for i in range(2, num): 
            if (num % i) == 0: 
                # if factor is found, this is not a prime number 
                return False

Here are the results:

------------ 
TESTING whether 71569 is prime 
------------ 
Regular script took 487.2413 ms / 77958605 clock cycles 
Native script took 203.7885 ms / 32606159 clock cycles 
Viper script took 30.02811 ms / 4804497 clock cycles

That's a lot faster - we are already running 16x faster than the initial MicroPython code and we didn't really have to do too much work.

Micropython C module

There's one last thing we can do - we can write our function in C and import it into MicroPython and use that to run the calculation. The nice thing about this method is most of the firmware can still be written in MicroPython and you can use all the nice high-level Python features for most of the firmware. You only write the performance-critical sections of code in C.

Once you have a compiled C module it's pretty simple to use in MicroPython:

import test_prime_c 
is_prime = test_prime_c.test_prime(71569)

Compiling it takes a bit more effort, however. You need to set up the toolchain and then write your C module. Setting up the tool-chain shouldn't take more than 2 hours. As for writing the C module - the simple one here took about 5 minutes but something more complex could take considerably longer. I'd recommend using the MicroPython examples (link) and modifying them to suit your needs. Make sure you can compile and import their example module before you try and make your own.

Here is what the C code looks like (and here is the documentation to start building your own C modules - link).

test_prime_c.c shown below.

// Include the header file to get access to the MicroPython API
#include "py/dynruntime.h"

// Helper function to check whether a number is prime
STATIC mp_int_t prime_helper(mp_int_t x) {
    if (x == 1){
	// 1 is not prime
        return 0;
    }
	for (int i = 2; i<x; i++){
		if(x%i == 0){
			// factor found - not a prime number
			return 0;
		}
	}
	return 1; // if none of the above conditions determined a factor, then this is a prime
}

// This is the function which will be called from Python, as test_prime(x). We pass it a micropython object (everything is an object in python - int is an object of type int)
STATIC mp_obj_t test_prime(mp_obj_t x_obj) {
    // Get the C compatible integer from the MicroPython integer object
    mp_int_t x = mp_obj_get_int(x_obj);
    // Check whether it is prime
    mp_int_t result = prime_helper(x);
    // Convert the result to a MicroPython boolean object and return it
    return mp_obj_new_bool(result);
}

// Define a Python reference to the function above
// so that you can call it through test_prime_c.test_prime
STATIC MP_DEFINE_CONST_FUN_OBJ_1(test_prime_obj, test_prime);

// This is the entry point and is called when the module is imported
mp_obj_t mpy_init(mp_obj_fun_bc_t *self, size_t n_args, size_t n_kw, mp_obj_t *args) {
    // This must be first, it sets up the globals dict and other things
    MP_DYNRUNTIME_INIT_ENTRY

    // Make the function available in the module's namespace
    mp_store_global(MP_QSTR_test_prime, MP_OBJ_FROM_PTR(&test_prime_obj));

    // This must be last, it restores the globals dict
    MP_DYNRUNTIME_INIT_EXIT
}

We compile that for our target MCU and it will generate a test_prime_c.mpy file that we upload onto our MCU. We can then import the module into MicroPython and start using it like any other module. Here's a copy of the module I compiled for the ESP32: link.

Here is the performance benchmark. It’s also printing the results to make sure that all of them return the same result.

------------ 
TESTING whether 71569 is prime 
------------ 
Regular script took 487.2414 ms / 77958627 clock cycles 
Native script took 203.7882 ms / 32606107 clock cycles 
Viper script took 30.02839 ms / 4804543 clock cycles 
Compiled C module took 4.476794 ms / 716287 clock cycles 
------------ 
RESULTS 
------------ 
True True True True 
------------

108.8x faster than the baseline - not bad!

Let's check with a larger number and a smaller number to see how the performance compares for larger and smaller workloads.

compareRunTime(313) # This is a prime number - should be quick to calculate

------------ 
TESTING whether 313 is prime 
------------ 
Regular script took 2.238606 ms / 358177 clock cycles 
Native script took 0.9453438 ms / 151255 clock cycles 
Viper script took 0.1852 ms / 29632 clock cycles 
Compiled C module took 0.03605625 ms / 5769 clock cycles 
------------ 
RESULTS 
------------ 
True True True True 
------------

compareLongRunTime(4951277) # This is a prime number - should take much longer to calculate

------------ 
TESTING whether 4951277 is prime 
------------ 
Regular script took 34 s 
Native script took 14 s 
Viper script took 2 s 
Compiled C module took 0s / 305.7551 ms  
------------ 
RESULTS 
------------ 
True True True True 
------------

With a small workload we see a 62x improvement and with a large workload we see a 112x improvement.

Conclusion

In general I would try and stick to the native emitter or viper emitter because they are decent enough improvements but still quite quick and easy to implement.

There are also some other things you can do such as caching references to global objects that would make things faster without having to move to C (link).

However if there are some specific performance critical sections that are too slow then C is worth a shot. I haven't tried accessing hardware through C just yet so that should make an interesting experiment in the future.

Here are the results of the above tests in graph form.

In case anyone needs these for reference, here are source files of the micropython test script, the compiled C module, and C module source file:

test_prime_script.py

test_prime_c.c (uncompiled C code that you could compile)

test_prime_c.mpy (compiled C module that you can import onto an ESP-WROOM-32)