FOTA: Firmware Over The Air

When prototyping and writing the first version of the firmware you rarely spend a lot of time thinking about updating it; the focus is mainly on getting everything to work. After that you start ironing out all the other issues like watchdogs, logging, power hogs and correct behaviour whenever you loose connectivity or something not on the happy path and finally the first version is ready to be deployed. If you are lucky you might have just a few test devices in relatively close proximity that can be fixed whenever something in the real world isn’t quite what you hoped it would be. Adda a few more devices in the field and you find yourself spending more time upgrading the devices than finding and fixing bugs.

The really nice thing about IoT devices is that they can be placed just about anywhere and the really annoying thing about IoT devices is that they’re placed just about anywhere.

You might be tempted to roll your own upgrades – after all, how hard can it be? Retrieve a relatively small file, write it to flash, then reboot the device. It sounds really simple until you start thinking about all the things that can go wrong:

  • How are you going to retrieve it? If you are running with a downsized IP stack you might not have TCP at all.
  • The image must be valid and signed - if the image is corrupted during transfer (or worse - someone have tampered with it) you don’t want to install it.
  • If an image is invalid you must be able to revert to the old one
  • If an image is valid but crashes on startup you want to revert to the old one
  • You want the device to report the version it’s running so you can keep track of which devices have been upgraded

“HTTP” might be your first go-to-protocol for upgrades but there are a few pitfalls – if the connection is lost during upgrades you can either restart the download but that will consume both power and bandwidth. You can get around it by using HTTP range requests but that requires an additional roundtrip and support both in the client and the server for range requests. The second go-to-protocol is MQTT but then you find yourself designing a protocol on top of MQTT that must handle partial and slow downloads, dropped TCP connections and you still have to piece together the binary on the device side.

There’s not a lot of standards out there that handles FOTA either. Most are proprietary and are either running on top of HTTP, MQTT or CoAP. But there’s one standard out there: LwM2M. It’s not pretty by any stretch of the imagination but Zephyr (and the nRF91 DK) have a client that we can use. If you’re not running on Zephyr you might want to look into Eclipse Wakaama.

The LwM2M standard

If you read through the LwM2M standard you pretty soon discover that it’s… a lot but at its core it is built on top a resource model similar to what we see in HTTP and CoAP. The bottom level contains objects, each object can contain one or more resources and each resource contains zero or more instances. Each object, resource and instance has their own predefined ID. The lower range is reserved for the basic LwM2M objects while higher-numbered IDs are reserved for predefined sensors and user-defined sensors. The bottom range of objects looks like this (some of these are regular resources and support GET and PUT methods while other are commands and only supports POST methods):

/Object/Resource/Instance Description
/0 Security object
/3 Device information object
/3/0/0  Manufacturer
/3/0/1 Model number
/3/0/2 Serial number
/3/0/3 Firmware version
/3/4 Reboot command
/4 Connectivity monitoring object
/5 Firmware update object
/5/0/1 Firmware image URI
/5/0/2 Firmware update command
/5/0/3 Firmware state
/5/0/5 Firmware update result
You can tell the device to update itself by setting /5/0/1 (firmware image URI) to point to a valid URI for the firmware image you want to apply and when the image has been downloaded by the device you can start the update by POSTing to /5/0/2 on the device. There’s a few details here that I’m omitting but let’s just say that the Zephyr LwM2M and Wakaama libraries takes care of all the nitty gritty details here. There’s a lot of them. Trust me.

This takes care of the delivery - but we still have to manage the image on the device itself. But fear not!

The firmware update in Span works like this when talking to a LwM2M device:

FOTA update in Span

It all starts with the device checking in by registering with Span. Next Span queries the device for the firmware version and a few other properties. If the firmware version maps to one of the firmware images uploaded to Span it is assigned to that firmware image. If there’s an upgrade scheduled for that device Span will check if the device is ready to upgrade and if the update state (at /5/0/3) is reported as “Idle” the upgrade will start.

The firmware image URI is set on the device and Span starts observing the update state property (/5/0/3) on the device. Whenever this changes the device will send a notification to Span. Once the download is complete Span issues the update command to the device (/5/0/2) and the device updates itself.

The device installs the new firmware image, then restarts and re-registers with Span and the version check is repeated. If everything went as planned the device will report the new version.

If one of the upgrade steps fail the firmware update process is halted for the device and Span won’t attempt any further firmware upgrades.

That takes care of the upgrade protocol, writing everything to flash and managing the different firmware images is a whole different ball game. Luckily we can use another nice library in Zephyr for this: MCUBoot.

The MCUBoot library

If you dig through the Zephyr documentation you’ll find the MCUBoot library which will take care of the image shuffling on the device side. The design document for MCUBoot has most of the details but a it works roughly like this:

  • The flash is divided into two slots (a primary and a secondary) plus a scratch section. The scratch section is used to swap the two images.
  • The current image is in the primary slot
  • The new image is written to the secondary slot when it is downloaded
  • If a new image is found in the secondary slot the primary and secondary image is swapped and the image is marked as “test”
  • When the device have booted into the new image and it works it must be marked as good
  • If the test image isn’t marked as good when the device reboots the next time the known good image in the secondary slot is moved to the primary slot

…in short: It handles the image details.

Let’s try it out!

Let’s run through an example. I’m using a nRF91 DK board and the nRF91 samples in the Exploratory Engineering GitHub repo. Make sure you have a working toolchain and nRF91 SDK installed.

Start by registering the device at https://span.lab5e.com/. If you haven’t got the IMSI and IMEI you can use the AT Client sample from the nRF SDK and type AT+CIMI (for IMSI) plus AT+CGSN=1 (for IMEI) in a serial terminal.

Build the FOTA sample by running west build samples/fota then west flash when you run it you’ll see the client register itself:

***** Booting Zephyr OS build v2.0.99-ncs1 *****
[00:00:00.338,378] <inf> lte_lc: PDP Context: AT+CGDCONT=0,"IP","mda.ee"
[00:00:33.209,472] <inf> app_fota: Modem firmware version: mfw_nrf9160_1.0.1
[00:00:33.209,533] <inf> net_lwm2m_rd_client: Start LWM2M Client: nrf-352656100299737
[00:00:33.849,853] <inf> net_lwm2m_rd_client: RD Client started with endpoint 'nrf-352656100299737' with client lifetim0
[00:00:34.342,041] <inf> net_lwm2m_rd_client: Registration Done (EP='242016000001673')

If you want to manage the firmware on the device via the API you have to configure the collection. You can either manage all devices in the collection or each device individually. The latter is nice if you just want to test firmware for a single device or have fine-grained control on which device gets upgraded when.

I’m using curl to talk to the API (create an admin API token for the colletion that you’re using). If you are doing this on your own computer the IDs of the collection and device will be different from your own so make sure you change them.

If we look at the collection resource where our device is the firmware section has management set to ´“disabled”`. Note that I’m using environment variables for the API token and IDs to make life a bit easier:

$ curl -HX-API-Token:${TOKEN} https://api.lab5e.com/span/collections/${C}
{
  "collectionId": "17dh0cf43jfnni",
  "teamId": "17dh0cf43jfida",
  "fieldMask": {
    "imsi": false,
    "imei": false,
    "location": true,
    "msisdn": false
  },
  "firmware": {
    "management": "disabled"
  },
  "tags": {
    "name": "FOTA collection"
  }
}

Let’s update it to "device":

$ curl -XPATCH -d'{"firmware":{"management": "device"}}' -HX-API-Token:${TOKEN} https://api.lab5e.com/span/collections/${C}
{
  "collectionId": "17dh0cf43jfnni",
  "teamId": "17dh0cf43jfida",
  "fieldMask": {
    "imsi": false,
    "imei": false,
    "location": true,
    "msisdn": false
  },
  "firmware": {
    "management": "device"
  },
  "tags": {
    "name": "FOTA collection"
  }
}

Now that the API is ready we can reboot the device and when it registers the next time you’ll see a few new fields in the firmware section.

 curl -HX-API-Token:${TOKEN} https://api.lab5e.com/span/collections/${C}/devices/${D}
{
  "deviceId": "17dh0cf43jg6hn",
  "collectionId": "17dh0cf43jfnni",
  "imei": "352656100299737",
  "imsi": "242016000001673"
}
# ... the device registers
$ curl -HX-API-Token:${TOKEN} https://api.lab5e.com/span/collections/${C}/devices/${D}
{
  "deviceId": "17dh0cf43jg6hn",
  "collectionId": "17dh0cf43jfnni",
  "imei": "352656100299737",
  "imsi": "242016000001673",
  "tags": {
    "3gpp-ms-timezone": "4000",
    "radius-allocated-at": "2019-12-18T12:03:19Z",
    "radius-ip-address": "10.8.1.98"
  },
  "network": {
    "allocatedIp": "10.8.1.98",
    "allocatedAt": 1576670599183,
    "cellId": null
  },
  "firmware": {
    "firmwareVersion": "1.0.0",
    "serialNumber": "1",
    "modelNumber": "EE-FOTA-00",
    "manufacturer": "Exploratory Engineering",
    "state": "Current"
  }
}

If you go to the sample code in samples/fota/src/fota.h you’ll see the same values. If you change the values and re-flash the firmware you’ll see the fields change accordingly. It’s sort of nice but probably not why you’ve read this far so let’s upload the initial firmware image to Span so that we have the same image running on the device and in the API.

Upload the signed binary image (not the .hex file!) to the firmware library in the API. The file should be at build/zephyr/app_update.bin. Again, I’m using curl for this:

$ curl -HX-API-Token:${TOKEN} -XPOST -F image=@build/zephyr/app_update.bin https://api.lab5e.com/span/collections/${C}/firmware
{
  "imageId": "17dh0cf43jfgna",
  "version": "d7186ac87a8c1ab8a28d47999a79e465b51b1f3574a1fe8f39b1e6fb462d36de",
  "filename": "app_update.bin",
  "sha256": "d7186ac87a8c1ab8a28d47999a79e465b51b1f3574a1fe8f39b1e6fb462d36de",
  "length": 221956,
  "collectionId": "17dh0cf43jfnni",
  "created": 1576670719140,
  "tags": {}
}

The version isn’t set correctly (it will pick a random string for the version when you create it). This is version 1.0.0 so let’s update it:

$ curl -HX-API-Token:${TOKEN} -XPATCH -d'{"version":"1.0.0"}' https://api.lab5e.com/span/collections/${C}/firmware/17dh0cf43jfgna
{
  "imageId": "17dh0cf43jfgna",
  "version": "1.0.0",
  "filename": "app_update.bin",
  "sha256": "d7186ac87a8c1ab8a28d47999a79e465b51b1f3574a1fe8f39b1e6fb462d36de",
  "length": 221956,
  "collectionId": "17dh0cf43jfnni",
  "created": 1576670719140,
  "tags": {}
}

Now that version 1.0.0 is uploaded you can try rebooting the device. When it has registered you should see the field currentFirmwareId set to the same ID as the image you uploaded since Span now knows what version maps to which firmware image:

$ curl -HX-API-Token:${TOKEN} https://api.lab5e.com/span/collections/${C}/devices/${D}
{
  "deviceId": "17dh0cf43jg6hn",
  "collectionId": "17dh0cf43jfnni",
  "imei": "352656100299737",
  "imsi": "242016000001673",
  "tags": {
    "3gpp-ms-timezone": "4000",
    "radius-allocated-at": "2019-12-18T12:07:17Z",
    "radius-ip-address": "10.8.1.98"
  },
  "network": {
    "allocatedIp": "10.8.1.98",
    "allocatedAt": 1576670837902,
    "cellId": null
  },
  "firmware": {
    "currentFirmareId": "17dh0cf43jfgna",
    "firmwareVersion": "1.0.0",
    "serialNumber": "1",
    "modelNumber": "EE-FOTA-00",
    "manufacturer": "Exploratory Engineering",
    "state": "Current"
  }
}

Updating the firmware

Edit the sample code and set the version to 1.1.0 in samples/fota/src/fota.h by changing the CLIENT_FIRMWARE_VER setting to 1.1.0.

You can add some log statements so that you’ll see the new version running in a serial terminal. Run west build to build the new version.

Your fingers have incredible muscle memory and are probably halfway to writing west flash but let’s do it the Proper Way – upload the image! As before the version must be set after the upload. This is probably something you want to automate later on:

$ curl -HX-API-Token:${TOKEN} -XPOST -F image=@build/zephyr/app_update.bin  https://api.lab5e.com/span/collections/${C}/firmware
{
  "imageId": "17dh0cf43jfgnb",
  "version": "c5397c323b8bf4816b4bc3d0095fa13ae5d11fcf4d0feeb519c334b89cab8830",
  "filename": "app_update.bin",
  "sha256": "c5397c323b8bf4816b4bc3d0095fa13ae5d11fcf4d0feeb519c334b89cab8830",
  "length": 222080,
  "collectionId": "17dh0cf43jfnni",
  "created": 1576671247292,
  "tags": {}
}
$ curl -HX-API-Token:${TOKEN} -XPATCH -d'{"version":"1.1.0"}' https://api.lab5e.com/span/collections/${C}/firmware/17dh0cf43jfgnb
{
  "imageId": "17dh0cf43jfgnb",
  "version": "1.1.0",
  "filename": "app_update.bin",
  "sha256": "c5397c323b8bf4816b4bc3d0095fa13ae5d11fcf4d0feeb519c334b89cab8830",
  "length": 222080,
  "collectionId": "17dh0cf43jfnni",
  "created": 1576671247292,
  "tags": {}
}

Right now nothing new will happen with the device unless you tell the API you want to use a new image on the device. You do this by setting the targetFirmwareId to the image you want to upgrade to:

$ curl -HX-API-Token:${TOKEN} -XPATCH -d'{"firmware":{"targetFirmwareId":"17dh0cf43jfgnb"}}' https://api.lab5e.com/span/collections/${C}/devices/${D}
{
  "deviceId": "17dh0cf43jg6hn",
  "collectionId": "17dh0cf43jfnni",
  "imei": "352656100299737",
  "imsi": "242016000001673",
  "tags": {
    "3gpp-ms-timezone": "4000",
    "radius-allocated-at": "2019-12-18T12:07:17Z",
    "radius-ip-address": "10.8.1.98"
  },
  "network": {
    "allocatedIp": "10.8.1.98",
    "allocatedAt": 1576670837902,
    "cellId": null
  },
  "firmware": {
    "currentFirmareId": "17dh0cf43jfgna",
    "targetFirmwareId": "17dh0cf43jfgnb",
    "firmwareVersion": "1.0.0",
    "serialNumber": "1",
    "modelNumber": "EE-FOTA-00",
    "manufacturer": "Exploratory Engineering",
    "state": "Pending"
  }
}

Now it’s time to wait. The backend won’t upgrade the device until it reports back and refreshes its session. The sample has the session timeout set to 300 seconds, ie 5 minutes so you might want to grab a coffee at this moment. When the upgrade starts you’ll see several log messages in the terminal before it reboots, then registers:

[00:00:35.416,839] <inf> net_lwm2m_rd_client: Start LWM2M Client: nrf-352656100299737
[00:00:36.342,773] <inf> net_lwm2m_rd_client: RD Client started with endpoint 'nrf-352656100299737' with client lifetim0
[00:00:36.843,139] <inf> net_lwm2m_rd_client: Registration Done (EP='242016000001673')
[00:10:31.024,505] <inf> net_lwm2m_rd_client: Update callback (code:2.4)
[00:10:31.024,536] <inf> net_lwm2m_rd_client: Update Done
[00:10:31.434,295] <inf> net_lwm2m_obj_firmware_pull: Connecting to server coap://172.16.15.14:5683/fw
[00:10:31.832,916] <inf> app_fota: Started downloading MCUBoot image
[00:10:32.128,936] <inf> fota_flash_block: Erasing sector at offset 0x00000000
[00:10:34.476,928] <inf> app_fota: 1%
[00:10:36.516,906] <inf> fota_flash_block: Erasing sector at offset 0x00001000
[00:10:37.204,864] <inf> app_fota: 2%
[00:10:40.356,994] <inf> app_fota: 3%

    ...several log messages here. The upgrade process might take a minute or two
    to finish.

[00:13:05.565,643] <inf> app_fota: 99%
[00:13:06.405,700] <inf> fota_flash_block: Erasing sector at offset 0x00036000
[00:13:07.045,684] <inf> app_fota: 100%
[00:13:07.052,124] <inf> fota_flash_block: Erasing sector at offset 0x00068000
[00:13:07.137,542] <inf> dfu_target_mcuboot: MCUBoot image upgrade scheduled. Reset the device to apply
[00:13:07.317,657] <inf> app_fota: Executing firmware update
***** Booting Zephyr OS build v2.0.99-ncs1 *****
[00:00:00.004,791] <inf> mcuboot: Starting bootloader
[00:00:00.013,061] <inf> mcuboot: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x1
[00:00:00.025,970] <inf> mcuboot: Scratch: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
[00:00:00.038,238] <inf> mcuboot: Boot source: primary slot
[00:00:00.049,316] <inf> mcuboot: Swap type: test
[00:00:25.429,077] <inf> mcuboot: Bootloader chainload address offset: 0xc000
[00:00:25.436,584] <inf> mcuboot: Jumping to the first image slot
***** Booting Zephyr OS build v2.0.99-ncs1 *****
Flash region            Domain          Permissions
00 0x00000 0x08000      Secure          rwxl
01 0x08000 0x10000      Secure          rwxl

    ....boot messages from nRF91 ...

SPM: NS image at 0x18200
SPM: NS MSP at 0x2002c708
SPM: NS reset vector at 0x1ca05
SPM: prepare to jump to Non-Secure image.
***** Booting Zephyr OS build v2.0.99-ncs1 *****
[00:00:01.636,291] <inf> lte_lc: PDP Context: AT+CGDCONT=0,"IP","mda.ee"
[00:00:33.115,142] <inf> app_fota: Firmware version: 1.1.0
[00:00:33.115,142] <inf> app_fota: Model number:     EE-FOTA-00
[00:00:33.115,173] <inf> app_fota: Serial numbera:   1
[00:00:33.115,173] <inf> app_fota: Manufacturer:     Exploratory Engineering
[00:00:33.115,173] <inf> app_fota: This is the new version of the firmware!
[00:00:33.115,997] <inf> app_fota: Firmware update succeeded
[00:00:33.117,248] <inf> app_fota: Modem firmware version: mfw_nrf9160_1.0.1
[00:00:33.117,248] <inf> net_lwm2m_rd_client: Start LWM2M Client: nrf-352656100299737
[00:00:33.647,369] <inf> net_lwm2m_rd_client: RD Client started with endpoint 'nrf-352656100299737' with client lifetim0
[00:00:34.138,977] <inf> net_lwm2m_rd_client: Registration Done (EP='242016000001673')

It is now running the new version. If you query the device during the upgrade you’ll see the different states as it is updating:

$ curl -HX-API-Token:${TOKEN} https://api.lab5e.com/span/collections/${C}/devices/${D}
{
  # ....
  "firmware": {
    "currentFirmareId": "17dh0cf43jfgna",
    "targetFirmwareId": "17dh0cf43jfgnb",
    "firmwareVersion": "1.0.0",
    "serialNumber": "1",
    "modelNumber": "EE-FOTA-00",
    "manufacturer": "Exploratory Engineering",
    "state": "Pending"
  }
}
# Device starts downloading firmware
$ curl -HX-API-Token:${TOKEN} https://api.lab5e.com/span/collections/${C}/devices/${D}
{
  # ....
  "firmware": {
    "currentFirmareId": "17dh0cf43jfgna",
    "targetFirmwareId": "17dh0cf43jfgnb",
    "firmwareVersion": "1.0.0",
    "serialNumber": "1",
    "modelNumber": "EE-FOTA-00",
    "manufacturer": "Exploratory Engineering",
    "state": "Downloading",
    "stateMessage": "Waiting for device to download firmware image"
  }
}
# Device has dowloaded and is restarting
$ curl -HX-API-Token:${TOKEN} https://api.lab5e.com/span/collections/${C}/devices/${D}
{
  # .....
  "firmware": {
    "currentFirmareId": "17dh0cf43jfgna",
    "targetFirmwareId": "17dh0cf43jfgnb",
    "firmwareVersion": "1.0.0",
    "serialNumber": "1",
    "modelNumber": "EE-FOTA-00",
    "manufacturer": "Exploratory Engineering",
    "state": "Completed",
    "stateMessage": "Device has downloaded firmware image and is performing update"
  }
}
# Device has rebooted and is running the new firmware
$ curl -HX-API-Token:${TOKEN} https://api.lab5e.com/span/collections/${C}/devices/${D}
{
  # .....
  "firmware": {
    "currentFirmareId": "17dh0cf43jfgnb",
    "targetFirmwareId": "17dh0cf43jfgnb",
    "firmwareVersion": "1.1.0",
    "serialNumber": "1",
    "modelNumber": "EE-FOTA-00",
    "manufacturer": "Exploratory Engineering",
    "state": "Current"
  }
}

When the update is complete the state is set to “Current” and the current and target firmware IDs are set to the same value.

That’s it!

What if something goes wrong?

Firmware upgrades can fail in a lot of ways – the device can get stuck somewhere when it is downloading, the device can drop of the network, the battery can run out and something might crash either in the old firmware or in the new. Just make sure you use a watchdog timer in your firmware to recover.

Also make sure that the version that you report is the correct one. If the version reported by the device is the same for each firmware image Span is going to assume that the firmware update has failed in some way and the device has rolled back to the previous version.

During the upgrade the device transitions between the following states: Firmware states

  • When in the Current state the firmware is up to date
  • Pending state is when a firmware upgrade is scheduled
  • The device is in the Initializing when the firmware upgrade has started
  • When it is in the Downloading state the device is downloading the new image from Span
  • In the Downloaded state the device has downloaded the image and is installing it.
  • If everything is OK the device will return back to the Current state when it is finished.
  • In case of an error the Error, TimedOut or Reverted state is set. The stateMessage field will provide additional information.

Keeping track of several devices

When you reach a critical mass of devices (in my case this can be in the single digits) you probably want an overview over what firmware the different devices use. Send a request to the usage resource under the firmware and you’ll get a list of devices that are reporting that version:

$ curl -HX-API-Token:${TOKEN} https://api.lab5e.com/span/collections/${C}/firmware/17dh0cf43jfgnb/usage
{
  "firmwareId": "17dh0cf43jfgnb",
  "targetedDevices": [
    "17dh0cf43jg6hn"
  ],
  "currentDevices": [
    "17dh0cf43jg6hn"
  ]
}

The old version (1.0.0) is not in use by any other devices so it will report an empty list:

$ curl -HX-API-Token:${TOKEN} https://api.lab5e.com/span/collections/${C}/firmware/17dh0cf43jfgna/usage
{
  "firmwareId": "17dh0cf43jfgna",
  "targetedDevices": [],
  "currentDevices": []
}

Version numbering

There’s no hard and fast rule about how you number the firmware versions. In this example we’ve used a n.n.n version numbering scheme but if you want to name it after your git commit hashes, give it code names (like flawless-recording-walrus (4deacee) or just stick to a year-date-month scheme it’s all up to you – versions are just opaque strings for Span.

Photo credits: Einar Jenssen, a Telenor employee at Svalbard. He’ll probably like this feature.


-- sd, 2019-12-12