-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lost ACK for a confirmed uplink causes data loss #38
Comments
The race condition I am referring to here is that occasionally LoRaMac receives an ACK but AT Slave does not process/queue the received ACK before the module goes to sleep. Thus, the AT Slave must process the ACK the next time the host wakes up the module. This doesn't happen all the time but it is fairly reproducible. In the attached graph below, you can see it occurs 12 times in a period (2 hours) where there should have been 120 uplinks. This is about a 10% failure rate. |
The problem is here: mkrwan1300-fw/Drivers/BSP/Components/sx1276/sx1276.c Lines 1644 to 1649 in b1c48b2
It appears the module firmware transmits an "+ACK" anytime there is a downlink received. This is not incorrect. The +ACK should only be sent when there is an ack in the frame. ie:
|
I was having the same issue but didn't know why. Do you have the understanding to fix it? I guess the modem should only be put into sleep mode once the ack is delivered. |
The +ACK response could be fixed with a change to the firmware. Unfortunately, the firmware is based on an outdated version of some STM firmware that isn't on GitHub. So, fixing these problems is kind of difficult because we can never get the changes into the source tree for the STM firmware. Second issue is the development environment is based on the STM Cube IDE which is proprietary to STM. I would suggest Arduino consider a modem firmware using the Zephyr stack. Zephyr is open source and has lots of STM support and has the Semtech LoRa stack. The development environment is pretty good and it is an RTOS. Most of the work I do now using the MKRWAN uses Zephyr for my application. I have created a "modem driver" for the Murata module to talk to it and this driver overcomes many of the limitations of the existing Arduino library. |
That's great! And the change would only be in the firmware or does it need to implement a new hardware? |
As it turns out, ACK's for confirmed uplinks are really not guaranteed to arrive. If an ACK is lost along the way from the server to the end device, the end device will never get the ACK because the server will never resend an ACK it has already sent. |
No change to hardware is required. |
Just to clarify an earlier comment, I think the underlying issue is that the ACK is lost (not received by the mac layer). What I am observing is that LoRaMac transmits a null payload occasionally after a failed ACK. It is not clear to me why it is doing so. Here is an example of this:
The frames are shown with he most recent at the top. You can see the ConfirmedDataUp has a payload attached. The subsequent UnconfirmedDataUp has had its payload stripped from the frame. This is what I would expect to happen when LoRaMac does an ACK retry. But, I do not actually see any of retries in the frame log. Probably because Chirpstack filters them out. However, the retries should have expired after about 40 seconds so by the time the unconfirmed frame is sent I expect that the frame would contain the payload. I do not understand why the payload is missing. |
The root cause of this data loss issue is SendFrame(). SendFrame() strips the payload from an uplink following a confirmed uplink where the ACK is not received. This happens because confirmed uplink retries are ignored by the server and the ADR algorithm for retries reduces the data rate as the retries progress. For a US915 node the data rate becomes DR0 when the ACK retry process completes when the ACK is lost. The max payload size for an uplink at DR0 is 11 bytes. So, if your payload is larger than 11 bytes, SendFrame() strips the payload. The problem with all of this is that the application is not notified that this occurred. SendFrame() returns success if it strips the payload and sends the uplink. So, there is no way for the application to know the data was lost. |
Hi guys, Got a bit lost in creating automation scripts for my LoRa tests. Just to confirm: the ack change in the driver should be done anyway and you suggest switching to Zephyr for future releases? In short, I would add these changes and try to fix known issues, where possible, before switching. This just to avoid legacy issues. |
@flhofer Yes, the "+ACK" message appears to be sent anytime the end device receives a downlink. So, it's really a "+Downlink" indication. Of course, a downlink can include an ack but not always. The Semtech LoRaMac-node stack is included in the Zephyr libraries. I think the current version is 4.4.3 and someone is working on upgrading to 4.5.1 but that is not done yet. Zephyr has support for many of the STM MCU's including the stm32l0 series so I think it could be used for the modem firmware. |
I checked the Firmware, and this print executes in the chip driver. However, neither the AT slave, nor the MKRWAN library consider this print. The correct way to check for an ACK is to call I think you're getting lost somewhere here. DR0 payload length should be more than 11, that seems really short. Also, I followed send and didn't find any place where the payload has been stripped. I would rather guess that the lost ack causes 8 retries (hardcoded somehow). It does say in the Anyway, it seems that more and more flaws of this fw surface. We might need to change soon. |
I just looked up the standard. The modem should resend the full data if it didn't receive the server's ack. What is strange is that it sends an empty packet. Somehow the buffer might get overwritten by the next send request or so. Or, it fails the @sslupsky Want to try to limit DR and then we see? ( |
That technique works if you want to periodically poll for the confirmed response. However, if the firmware correctly emitted an AT response, the response would facilitate better power consumption since you would only wake up when the response arrived and not unnecessarily wake up to poll the device from time to time.
The US915 DR0 payload is only 11 bytes. SendFrame() strips the response during the retry process. The retry process is guaranteed to fail by design. Thus, ADR guarantee's that by the end of the retry process, the data rate is DR0. Therefore, the next uplink is guaranteed to fail if the payload is more than 11 bytes. I looked into the Semtech stack to see if I could submit a PR to fix this but this function does not exist in the Semtech stack. SendFrame() appears to be specific to the STM library. mkrwan1300-fw/Projects/Multi/Applications/LoRa/AT_Slave/src/lora.c Lines 234 to 246 in b1c48b2
Yes. These are flaws in the STM library. Since the STM library is not in a public repo making changes to the upstream is difficult if not impossible. I asked STM to make the library public on GitHub a couple years ago but they declined. So, the STM library should be ditched and replaced with the Semtech stack so that it can be properly maintained. This particular issue is the biggest impediment to progress for the MKRWAN.
Yes, this is how the retry mechanism of the STM library attempts to confirm the uplink and it is broken. Moreover, they way LoRaMac 1.0.x works, if the server responds and the device fails to receive the ACK within the RX window of the uplink, the device will NEVER receive the ACK. By design, the server will not resend a confirmed uplink response so the device stack will retry 8 times and then fail. As mentioned earlier, if ADR is enabled, at the end of the retries, the data rate can end up at DR0. |
An issue I see here is that the STM library silently drops the payload. So, the application has no knowledge that the uplink was not sent. I suggest that the firmware should, in addition to emitting the AT "+ACK" response correctly, emit an AT response that indicates the frame was actually sent or not. Could this be tied into the McpsIndication() event? Perhaps the lora_config struct should include the LoRaMac status and a asynchronous AT response could be sent from McpsIndiction()? A set of AT commands could be created to poll the status as well. mkrwan1300-fw/Projects/Multi/Applications/LoRa/AT_Slave/src/lora.c Lines 68 to 84 in b1c48b2
mkrwan1300-fw/Projects/Multi/Applications/LoRa/AT_Slave/src/lora.c Lines 355 to 384 in b1c48b2
|
The STM 1.3.1 stack is much closer to the current Semtech stack which has a "LoRaMac layer handling" (LmHandler). LmHandler appears to be "Inspired by the examples provided on the en.i-cube_lrwan fork". It is not clear why it hasn't been a priority to migrate the STM firmware to be up to date with the latest STM stack. Neither is it clear why it hasn't been a priority keep the mkrwan firmware in sync with the STM stack. The Zephyr implementation includes a layer for handling LoRaMac that looks similar to LmHandler. Thus, if we focused on migrating the mkrwan firmware to Zephyr we would have an open source repo for the stack. https://github.com/zephyrproject-rtos/zephyr/tree/main/subsys/lorawan |
As said before, the +ACK is written somewhere in the chipset driver and shouldn't actually appear (I removed that in my FW version). I tested the send ACK behavior again, and the send returns +ERR_BUSY if there has been no ack, or the payload gets too big, and an empty frame is sent. I bumped into this when testing data length changes. Anyway, the more I work with the firmware, the more I see the sloppily implemented parts, causing me to perform some "provvisorio definitivo" changes like we used to say in the old days 😉. |
@flhofer Hmmm, I do not think I am communicating the subtlety here. Yes, the firmware will respond with an error if you send a packet too large for the current DR. However, if ADR forces the DR lower after the transmission is queued, the packet is dropped silently if the DR falls below the size threshold. Take a closer look at the code snippets I referenced, it is there. This condition does not return an error. Regarding the "+ACK", that code was put in there by @facchinm . Perhaps he can provide some feedback on what, if anything, Arduino uses this particular code for. "ERR_BUSY" is sent if you attempt to ask the modem to do something while it is processing a transmission. It has nothing else to do with a confirmed packet ACK. More specifically, it is not an indication that an ack was not received. |
@sslupsky you're right; it just happened to match the confirmation. The TTN Uno board behaves like that, and I was for a moment inclined foo believe this might do the same. 😁 My fault. I just finalized some changes for my test MKR's, and I must say, the library/FW interaction is quite basic, maybe too basic. In my last implementation, I check the FCU every 3 seconds (+- standard retry time, i.e., 1-second send window, 1-second RX1 window, 1-second RX2 window). The counter increases if the CTX transmission succeeded or the max retries have been reached. Then I poll Anyway, if you have a payload that exceeds the maximum length in lower data rates, those data rates should not be settable. Such thing can be set on the server and communicated over OTAA. Another option would be to force higher data rates by monitoring it regularly. |
@flhofer I do not recall the data rate fallback process checks for a minimum data rate. |
@sslupsky On EU868 when configuring channels and bands, you set minimum and maximum data rates for each channel which then are used by the ADR procedure. If you perform an ABP join, unfortunately, the MKRWAN firmware is implemented to only use the default channels, for EU868 this is 1-3 with the default data rates 0-5. Don't know what these default settings are for US915. I use Loriot for my tests where I can set those parameters per device or device class/profile. I saw the MAC sets some default parameters
|
@flhofer Apologies, I thought I had posted this reference earlier but I think I overlooked that. Here is the section of code that performs the retransmission if a packet is not acknowledged. Every second retry, the data rate is automatically lowered to the "next lowest data rate". Further, if an acknowledgement is not received by the end device because it was "lost over the air", the LoRaWAN specification guarantees that this retry process always times out because the server will never send another acknowledgement for a frame that has already been acknowledged. So, AckTimeoutRetriesCounter always expires when the acknowledgement is lost. mkrwan1300-fw/Middlewares/Third_Party/Lora/Mac/LoRaMac.c Lines 1315 to 1351 in b1c48b2
This is the function that lowers the data rate for the US915 regional parameters: mkrwan1300-fw/Middlewares/Third_Party/Lora/Mac/region/RegionUS915-Hybrid.c Lines 69 to 82 in b1c48b2
It seems that the process of lowering the data rate does not pay attention to the server setting and only uses the minimum regional setting. |
@sslupsky yes, you're right. It seems that the US implementation does not give the server any power over this process. I also noticed recently that sometimes the FW hangs for more than a second when polling simple stuff like |
@sslupsky ok, done some tests and figured that sometimes the modem does not react at all to a command. I tried issuing periodic polls and put the timeout unreasonably high. After a timeout, the poll works perfectly. If instead, I issue an |
Hi @sslupsky Another issue I bumped into in the meantime is that 3G gateways can have very inconsistent latencies. It happens to me now that the ACK response does not reach the GW in time to transmit the response to the MKR. Could be that it is the same case for you. I found latencies varying from 300 to 2400ms; a disaster with an RX1 delay of 1 second. Florian |
There is an occasional race condition that causes a dropped packet. The problem can be seen in the graph below.
There appears to be some problem with missing data from time to time. The graph above shows period gaps in the data received. The purpose of this note is to describe the observations around the missing point at 8:49.
When the application queues an uplink with an ACK request, the confirmed request is transmitted:
and Chirpstack receives the confirmed uplink and tags it at 8:48:07:
Chirpstack sends an ACK within the receive window of the confirmed uplink at 8:48:08:
But the application doesn’t receive the ACK notification from the murata modem until the next uplink is received at 8:49:08 (which corresponds to 00:21:07 in the application log below):
For some reason, a datapoint is not written to the database and a gap appears in the data.
The confirmed uplink was timestamped by chirp-stack at 8:48:07 and the corresponding downlink was timestamped as 8:48:08. The downlink contains the ACK.
Note that the next uplink at 8:49:07 does not have a payload:
This corresponds to the “missing data point” in the graph 8:49:07
This particular scenario repeats again at 8:53:07 and the uplink at 8:54:07 has no payload and the datapoint is missing from the graph.
When we request an unconfirmed packet at 8:49 the AT Slave stack appears to strip the payload from the request and sends an uplink without a payload. It appears to be similar to what occurs when the AT Slave “retransmits” when attempting to confirm delivery an earlier packet. That is, the AT Slave transmits a “null” unconfirmed packet to open a receive window for the ACK. But in this case, the ACK was already received in the earlier response.
The consequence is that there is data loss because the AT Slave drops the payload from the uplink request.
The text was updated successfully, but these errors were encountered: