Some things are notoriously hard to test.
Since we’re not designing our own gateway or packet forwarder (yet…), we end up having this third party “thing” between our beautiful Congress backend and our elegant EE02 LoRa modules.
This “thing” being a LoRa gateway, which behaviour we simply have to trust. It should (in order to claim to be a LoRa gateway) reliably implement all the happy path scenarios needed for a successful LoRa deployment.
But what about the not so happy paths ?
Let’s say that the LoRa specification defines three join frequencies for our frequency band. What will happen if four or more devices attempt to join at exactly the same time ?
I decided to test this - and the LoRa Collider was born. It consists of a motherboard and up to 9 LoRa modules that will boot simultaneously.
The status LEDs indicate (from the top) JOIN REQUEST, JOINED, TX, TX_COMPLETE, RX and ACK. Two rows of buttons enables the user to trigger custom events on the LoRa modules
Looking at the output from GQRX using a HackRF, I could see that booting 9 devices simultaneously resulted in what appeared to be a shouting match on the same frequency. One device managed to join, but the remaining 8 were forced to retry - resulting in a new shouting match with one less participant.
This process repeated itself until the number of retries exeeded the default limit in the firmware implementation.
I was actually a bit surprised that one device managed to join from time to time, having assumed that simultaneous broadcasts at the same frequency would result in a CRC error on the gateway.
This was not optimal. We decided to add a pseudorandom back-off delay, using part of the deviceEUI as a random seed. A maximum delay of one second resolved the potential traffic jam and all devices were able to join. We should probably also implement a random delay modifier for TX timing, since fixed TX intervals could result in permanent conflicts between devices.
The necessary firmware modifications consisted of only two lines of C, but they made a huge difference in join reliability. The observations that led to these bugfixes were trivial once the collider was up and running. Without the collider, only a real deployment in a simultaneous boot scenario would have uncovered this weakness in the firmware.
The most important aspect of the collider is probably that it enables us to force the gateway to produce traffic patterns that we haven’t seen before and therefore haven’t been able to simulate in our load testing client.
Sourcing a specialized tool like this in the market place would have been hard, and probably prohibitively expensive. Solving this in-house required a modest operational expenditure of approximately 4 hours design time, 4 hours of soldering and 4 minutes programming. All necessary parts we already had lying around in the lab.
Once again we see that Ray Kurzweil’s law of accelerating returns also is valid on shorter timescales. It kicks in the instant you make something that is reusable in a new context. Each iteration is faster and cheaper than the previous one, since you accumulate increasingly more powerful tools and components that can be used to solve the next task at hand.