Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost ACK for a confirmed uplink causes data loss #38

Open
sslupsky opened this issue May 19, 2021 · 26 comments
Open

Lost ACK for a confirmed uplink causes data loss #38

sslupsky opened this issue May 19, 2021 · 26 comments

Comments

@sslupsky
Copy link

sslupsky commented May 19, 2021

There is an occasional race condition that causes a dropped packet. The problem can be seen in the graph below.

Screen Shot 2021-05-19 at 9 17 46 AM

There appears to be some problem with missing data from time to time. The graph above shows period gaps in the data received. The purpose of this note is to describe the observations around the missing point at 8:49.

When the application queues an uplink with an ACK request, the confirmed request is transmitted:

[00:20:06.373,000] <dbg> modem_lora.uart_pipe_send: uplink queued
[00:20:06.373,000] <dbg> modem_lora.uart_pipe_send: uplink ack requested
[00:20:06.412,000] <dbg> modem_lora.lora_cmd_ok: +OK: 

and Chirpstack receives the confirmed uplink and tags it at 8:48:07:

    {
        "uplinkMetaData": {
            "rxInfo": [
                {
                    "gatewayId": "647fdafffe0057e6",
                    "time": null,
                    "timeSinceGpsEpoch": null,
                    "timestamp": 142537524,
                    "rssi": 6,
                    "loraSnr": 9.5,
                    "channel": 5,
                    "rfChain": 0,
                    "board": 0,
                    "antenna": 0,
                    "location": {
                        "latitude": 53.47134582549126,
                        "longitude": -113.55489418681104,
                        "altitude": 0,
                        "source": "UNKNOWN",
                        "accuracy": 0
                    },
                    "fineTimestampType": "NONE"
                }
            ],
            "txInfo": {
                "frequency": 903300000,
                "modulation": "LORA",
                "loRaModulationInfo": {
                    "bandwidth": 125,
                    "spreadingFactor": 7,
                    "codeRate": "4/5",
                    "polarizationInversion": false
                }
            }
        },
        "phyPayload": {
            "mhdr": {
                "mType": "ConfirmedDataUp",
                "major": "LoRaWANR1"
            },
            "macPayload": {
                "fhdr": {
                    "devAddr": "0736c2f9",
                    "fCtrl": {
                        "adr": true,
                        "adrAckReq": false,
                        "ack": false,
                        "fPending": false,
                        "classB": false
                    },
                    "fCnt": 19,
                    "fOpts": null
                },
                "fPort": 10,
                "frmPayload": [
                    {
                        "bytes": "uoUynSAj0SgtYxPj/rscV1u2n3fOX7a+d2tj7jOoRA=="
                    }
                ]
            },
            "mic": "c0e9331c"
        }
    },

Chirpstack sends an ACK within the receive window of the confirmed uplink at 8:48:08:

    {
        "downlinkMetaData": {
            "txInfo": {
                "gatewayId": "647fdafffe0057e6",
                "immediately": false,
                "timeSinceGpsEpoch": null,
                "timestamp": 143537524,
                "frequency": 926300000,
                "power": 20,
                "modulation": "LORA",
                "loraModulationInfo": {
                    "bandwidth": 500,
                    "spreadingFactor": 7,
                    "codeRate": "4/5",
                    "polarizationInversion": true
                },
                "board": 0,
                "antenna": 0
            }
        },
        "phyPayload": {
            "mhdr": {
                "mType": "UnconfirmedDataDown",
                "major": "LoRaWANR1"
            },
            "macPayload": {
                "fhdr": {
                    "devAddr": "0736c2f9",
                    "fCtrl": {
                        "adr": true,
                        "adrAckReq": false,
                        "ack": true,
                        "fPending": false,
                        "classB": false
                    },
                    "fCnt": 19,
                    "fOpts": [
                        {
                            "cid": "LinkADRReq",
                            "payload": {
                                "dataRate": 3,
                                "txPower": 6,
                                "chMask": [
                                    true,
                                    true,
                                    true,
                                    true,
                                    true,
                                    true,
                                    true,
                                    true,
                                    false,
                                    false,
                                    false,
                                    false,
                                    false,
                                    false,
                                    false,
                                    false
                                ],
                                "redundancy": {
                                    "chMaskCntl": 0,
                                    "nbRep": 1
                                }
                            }
                        }
                    ]
                },
                "fPort": null,
                "frmPayload": null
            },
            "mic": "af76998f"
        }
    },

But the application doesn’t receive the ACK notification from the murata modem until the next uplink is received at 8:49:08 (which corresponds to 00:21:07 in the application log below):

[00:20:06.373,000] <dbg> modem_lora.uart_pipe_send: uplink queued
[00:20:06.373,000] <dbg> modem_lora.uart_pipe_send: uplink ack requested
[00:20:06.412,000] <dbg> modem_lora.lora_cmd_ok: +OK: 
[00:21:06.374,000] <dbg> modem_lora.uart_pipe_send: uplink queued
[00:21:06.410,000] <dbg> modem_lora.lora_cmd_ok: +OK: 
[00:21:07.786,000] <inf> modem_lora: network ack

For some reason, a datapoint is not written to the database and a gap appears in the data.

The confirmed uplink was timestamped by chirp-stack at 8:48:07 and the corresponding downlink was timestamped as 8:48:08. The downlink contains the ACK.

Note that the next uplink at 8:49:07 does not have a payload:

    {
        "uplinkMetaData": {
            "rxInfo": [
                {
                    "gatewayId": "647fdafffe0057e6",
                    "time": null,
                    "timeSinceGpsEpoch": null,
                    "timestamp": 202732028,
                    "rssi": 8,
                    "loraSnr": 9.8,
                    "channel": 4,
                    "rfChain": 0,
                    "board": 0,
                    "antenna": 0,
                    "location": {
                        "latitude": 53.47134582549126,
                        "longitude": -113.55489418681104,
                        "altitude": 0,
                        "source": "UNKNOWN",
                        "accuracy": 0
                    },
                    "fineTimestampType": "NONE"
                }
            ],
            "txInfo": {
                "frequency": 903100000,
                "modulation": "LORA",
                "loRaModulationInfo": {
                    "bandwidth": 125,
                    "spreadingFactor": 10,
                    "codeRate": "4/5",
                    "polarizationInversion": false
                }
            }
        },
        "phyPayload": {
            "mhdr": {
                "mType": "UnconfirmedDataUp",
                "major": "LoRaWANR1"
            },
            "macPayload": {
                "fhdr": {
                    "devAddr": "0736c2f9",
                    "fCtrl": {
                        "adr": true,
                        "adrAckReq": false,
                        "ack": false,
                        "fPending": false,
                        "classB": false
                    },
                    "fCnt": 20,
                    "fOpts": null
                },
                "fPort": null,
                "frmPayload": null
            },
            "mic": "7ff66043"
        }
    },

This corresponds to the “missing data point” in the graph 8:49:07

This particular scenario repeats again at 8:53:07 and the uplink at 8:54:07 has no payload and the datapoint is missing from the graph.

When we request an unconfirmed packet at 8:49 the AT Slave stack appears to strip the payload from the request and sends an uplink without a payload. It appears to be similar to what occurs when the AT Slave “retransmits” when attempting to confirm delivery an earlier packet. That is, the AT Slave transmits a “null” unconfirmed packet to open a receive window for the ACK. But in this case, the ACK was already received in the earlier response.

The consequence is that there is data loss because the AT Slave drops the payload from the uplink request.

@sslupsky
Copy link
Author

sslupsky commented May 19, 2021

The race condition I am referring to here is that occasionally LoRaMac receives an ACK but AT Slave does not process/queue the received ACK before the module goes to sleep. Thus, the AT Slave must process the ACK the next time the host wakes up the module.

This doesn't happen all the time but it is fairly reproducible. In the attached graph below, you can see it occurs 12 times in a period (2 hours) where there should have been 120 uplinks. This is about a 10% failure rate.

Screen Shot 2021-05-19 at 10 30 05 AM

@sslupsky
Copy link
Author

The problem is here:

if( ( RadioEvents != NULL ) && ( RadioEvents->RxDone != NULL ) )
{
RadioEvents->RxDone( RxTxBuffer, SX1276.Settings.LoRaPacketHandler.Size, SX1276.Settings.LoRaPacketHandler.RssiValue, SX1276.Settings.LoRaPacketHandler.SnrValue );
PRINTF( "+ACK\r" );
//PRINTF("+RECV=");
}

It appears the module firmware transmits an "+ACK" anytime there is a downlink received. This is not incorrect. The +ACK should only be sent when there is an ack in the frame. ie:

if (McpsConfirm.AckReceived == true) {
        PRINTF( "+ACK\r" );
}

@aalbinati
Copy link

I was having the same issue but didn't know why. Do you have the understanding to fix it? I guess the modem should only be put into sleep mode once the ack is delivered.

@sslupsky
Copy link
Author

The +ACK response could be fixed with a change to the firmware. Unfortunately, the firmware is based on an outdated version of some STM firmware that isn't on GitHub. So, fixing these problems is kind of difficult because we can never get the changes into the source tree for the STM firmware. Second issue is the development environment is based on the STM Cube IDE which is proprietary to STM.

I would suggest Arduino consider a modem firmware using the Zephyr stack. Zephyr is open source and has lots of STM support and has the Semtech LoRa stack. The development environment is pretty good and it is an RTOS. Most of the work I do now using the MKRWAN uses Zephyr for my application. I have created a "modem driver" for the Murata module to talk to it and this driver overcomes many of the limitations of the existing Arduino library.

@aalbinati
Copy link

That's great! And the change would only be in the firmware or does it need to implement a new hardware?

@sslupsky
Copy link
Author

As it turns out, ACK's for confirmed uplinks are really not guaranteed to arrive. If an ACK is lost along the way from the server to the end device, the end device will never get the ACK because the server will never resend an ACK it has already sent.

@sslupsky
Copy link
Author

No change to hardware is required.

@aalbinati
Copy link

Then I think this should be the way to go. @facchinm @flhofer what do you think?

@sslupsky
Copy link
Author

Just to clarify an earlier comment, I think the underlying issue is that the ACK is lost (not received by the mac layer). What I am observing is that LoRaMac transmits a null payload occasionally after a failed ACK. It is not clear to me why it is doing so.

Here is an example of this:

    {
        "downlinkMetaData": {
            "txInfo": {
                "gatewayId": "647fdafffe0057e6",
                "immediately": false,
                "timeSinceGpsEpoch": null,
                "timestamp": 3496804708,
                "frequency": 927500000,
                "power": 20,
                "modulation": "LORA",
                "loraModulationInfo": {
                    "bandwidth": 500,
                    "spreadingFactor": 7,
                    "codeRate": "4/5",
                    "polarizationInversion": true
                },
                "board": 0,
                "antenna": 0
            }
        },
        "phyPayload": {
            "mhdr": {
                "mType": "UnconfirmedDataDown",
                "major": "LoRaWANR1"
            },
            "macPayload": {
                "fhdr": {
                    "devAddr": "06989832",
                    "fCtrl": {
                        "adr": true,
                        "adrAckReq": false,
                        "ack": true,
                        "fPending": false,
                        "classB": false
                    },
                    "fCnt": 50,
                    "fOpts": null
                },
                "fPort": null,
                "frmPayload": null
            },
            "mic": "88dcf2d6"
        }
    },
    {
        "uplinkMetaData": {
            "rxInfo": [
                {
                    "gatewayId": "647fdafffe0057e6",
                    "time": null,
                    "timeSinceGpsEpoch": null,
                    "timestamp": 3495804708,
                    "rssi": 3,
                    "loraSnr": 9.5,
                    "channel": 7,
                    "rfChain": 0,
                    "board": 0,
                    "antenna": 0,
                    "location": {
                        "latitude": 53.47134582549126,
                        "longitude": -113.55489418681104,
                        "altitude": 0,
                        "source": "UNKNOWN",
                        "accuracy": 0
                    },
                    "fineTimestampType": "NONE"
                }
            ],
            "txInfo": {
                "frequency": 903700000,
                "modulation": "LORA",
                "loRaModulationInfo": {
                    "bandwidth": 125,
                    "spreadingFactor": 7,
                    "codeRate": "4/5",
                    "polarizationInversion": false
                }
            }
        },
        "phyPayload": {
            "mhdr": {
                "mType": "ConfirmedDataUp",
                "major": "LoRaWANR1"
            },
            "macPayload": {
                "fhdr": {
                    "devAddr": "06989832",
                    "fCtrl": {
                        "adr": true,
                        "adrAckReq": false,
                        "ack": false,
                        "fPending": false,
                        "classB": false
                    },
                    "fCnt": 114,
                    "fOpts": null
                },
                "fPort": 2,
                "frmPayload": [
                    {
                        "bytes": "/ECtJ6ctzjkaQ1ZkPN31Q8U56KMPrOXvvTuU3CThhw=="
                    }
                ]
            },
            "mic": "7eb89a3d"
        }
    },

The frames are shown with he most recent at the top. You can see the ConfirmedDataUp has a payload attached. The subsequent UnconfirmedDataUp has had its payload stripped from the frame. This is what I would expect to happen when LoRaMac does an ACK retry. But, I do not actually see any of retries in the frame log. Probably because Chirpstack filters them out. However, the retries should have expired after about 40 seconds so by the time the unconfirmed frame is sent I expect that the frame would contain the payload. I do not understand why the payload is missing.

@sslupsky
Copy link
Author

The root cause of this data loss issue is SendFrame().

SendFrame() strips the payload from an uplink following a confirmed uplink where the ACK is not received.

This happens because confirmed uplink retries are ignored by the server and the ADR algorithm for retries reduces the data rate as the retries progress. For a US915 node the data rate becomes DR0 when the ACK retry process completes when the ACK is lost. The max payload size for an uplink at DR0 is 11 bytes. So, if your payload is larger than 11 bytes, SendFrame() strips the payload.

The problem with all of this is that the application is not notified that this occurred. SendFrame() returns success if it strips the payload and sends the uplink.

So, there is no way for the application to know the data was lost.

@sslupsky sslupsky changed the title Race condition drops data Lost ACK for a confirmed uplink causes data loss May 21, 2021
@flhofer
Copy link

flhofer commented May 25, 2021

Hi guys,

Got a bit lost in creating automation scripts for my LoRa tests.

Just to confirm: the ack change in the driver should be done anyway and you suggest switching to Zephyr for future releases?
Anyway, I confirm that there is a Null-polling procedure that is triggered for non-received ACKs. Also, I can imagine that the MKR application, being synchronous, may loose some return until the next up/downlink processing. I already added methods to check if the last uplink got an ACK to the library, but due to the bug discovered by @sslupsky, I think I will always get an ACK.

In short, I would add these changes and try to fix known issues, where possible, before switching. This just to avoid legacy issues.

@sslupsky
Copy link
Author

@flhofer Yes, the "+ACK" message appears to be sent anytime the end device receives a downlink. So, it's really a "+Downlink" indication. Of course, a downlink can include an ack but not always.

The Semtech LoRaMac-node stack is included in the Zephyr libraries. I think the current version is 4.4.3 and someone is working on upgrading to 4.5.1 but that is not done yet. Zephyr has support for many of the STM MCU's including the stm32l0 series so I think it could be used for the modem firmware.

@flhofer
Copy link

flhofer commented Jun 4, 2021

I checked the Firmware, and this print executes in the chip driver. However, neither the AT slave, nor the MKRWAN library consider this print. The correct way to check for an ACK is to call getMsgConfirmed which directly accesses McpsConfirm.AckReceived on the Mac layer.

I think you're getting lost somewhere here. DR0 payload length should be more than 11, that seems really short. Also, I followed send and didn't find any place where the payload has been stripped. I would rather guess that the lost ack causes 8 retries (hardcoded somehow). It does say in the Lora.c of ATSlave that until successfully sent, there will be null uplinks. In short, the modem firmware is not able to piggyback, which is not foreseen by LoRa, as far as I know.

Anyway, it seems that more and more flaws of this fw surface. We might need to change soon.

@flhofer
Copy link

flhofer commented Jun 4, 2021

I just looked up the standard. The modem should resend the full data if it didn't receive the server's ack. What is strange is that it sends an empty packet. Somehow the buffer might get overwritten by the next send request or so. Or, it fails the LoRaMacQueryTxPossible test, sending thus an empty frame but somehow returning true. If your payload gets bigger than the max allowable for DR0, you might never be able to transmit. You should thus not allow the ADR server to switch to DR0 in the first place. I could fix LoRaMacQueryTxPossible to return the correct status, but then in your case it would just send empty frames for an infinite amount of time. Not a solution.

@sslupsky Want to try to limit DR and then we see?

(+MSIZE is not implemented, we could also add that..)

@sslupsky
Copy link
Author

sslupsky commented Jun 7, 2021

The correct way to check for an ACK is to call getMsgConfirmed which directly accesses McpsConfirm.AckReceived on the Mac layer.

That technique works if you want to periodically poll for the confirmed response. However, if the firmware correctly emitted an AT response, the response would facilitate better power consumption since you would only wake up when the response arrived and not unnecessarily wake up to poll the device from time to time.

I think you're getting lost somewhere here. DR0 payload length should be more than 11, that seems really short. Also, I followed send and didn't find any place where the payload has been stripped. I would rather guess that the lost ack causes 8 retries (hardcoded somehow). It does say in the Lora.c of ATSlave that until successfully sent, there will be null uplinks. In short, the modem firmware is not able to piggyback, which is not foreseen by LoRa, as far as I know.

The US915 DR0 payload is only 11 bytes. SendFrame() strips the response during the retry process. The retry process is guaranteed to fail by design. Thus, ADR guarantee's that by the end of the retry process, the data rate is DR0. Therefore, the next uplink is guaranteed to fail if the payload is more than 11 bytes.

I looked into the Semtech stack to see if I could submit a PR to fix this but this function does not exist in the Semtech stack. SendFrame() appears to be specific to the STM library.

static bool SendFrame( void )
{
McpsReq_t mcpsReq;
LoRaMacTxInfo_t txInfo;
if( LoRaMacQueryTxPossible( AppData.BuffSize, &txInfo ) != LORAMAC_STATUS_OK )
{
// Send empty frame in order to flush MAC commands
mcpsReq.Type = MCPS_UNCONFIRMED;
mcpsReq.Req.Unconfirmed.fBuffer = NULL;
mcpsReq.Req.Unconfirmed.fBufferSize = 0;
mcpsReq.Req.Unconfirmed.Datarate = LoRaParamInit->TxDatarate;
}

Anyway, it seems that more and more flaws of this fw surface. We might need to change soon.

Yes. These are flaws in the STM library. Since the STM library is not in a public repo making changes to the upstream is difficult if not impossible. I asked STM to make the library public on GitHub a couple years ago but they declined. So, the STM library should be ditched and replaced with the Semtech stack so that it can be properly maintained. This particular issue is the biggest impediment to progress for the MKRWAN.

I just looked up the standard. The modem should resend the full data if it didn't receive the server's ack. What is strange is that it sends an empty packet. Somehow the buffer might get overwritten by the next send request or so. Or, it fails the LoRaMacQueryTxPossible test, sending thus an empty frame but somehow returning true.

Yes, this is how the retry mechanism of the STM library attempts to confirm the uplink and it is broken. Moreover, they way LoRaMac 1.0.x works, if the server responds and the device fails to receive the ACK within the RX window of the uplink, the device will NEVER receive the ACK. By design, the server will not resend a confirmed uplink response so the device stack will retry 8 times and then fail. As mentioned earlier, if ADR is enabled, at the end of the retries, the data rate can end up at DR0.

@sslupsky
Copy link
Author

sslupsky commented Jun 7, 2021

If your payload gets bigger than the max allowable for DR0, you might never be able to transmit. You should thus not allow the ADR server to switch to DR0 in the first place. I could fix LoRaMacQueryTxPossible to return the correct status, but then in your case it would just send empty frames for an infinite amount of time. Not a solution.

@sslupsky Want to try to limit DR and then we see?

(+MSIZE is not implemented, we could also add that..)

An issue I see here is that the STM library silently drops the payload. So, the application has no knowledge that the uplink was not sent. I suggest that the firmware should, in addition to emitting the AT "+ACK" response correctly, emit an AT response that indicates the frame was actually sent or not.

Could this be tied into the McpsIndication() event? Perhaps the lora_config struct should include the LoRaMac status and a asynchronous AT response could be sent from McpsIndiction()? A set of AT commands could be created to poll the status as well.

static lora_configuration_t lora_config =
{
.otaa = ((OVER_THE_AIR_ACTIVATION == 0) ? DISABLE : ENABLE),
.duty_cycle = DISABLE,
.DevEui = LORAWAN_DEVICE_EUI,
.AppEui = LORAWAN_APPLICATION_EUI,
.AppKey = LORAWAN_APPLICATION_KEY,
.NetworkID = LORAWAN_NETWORK_ID,
.DevAddr = LORAWAN_DEVICE_ADDRESS,
.NwkSKey = LORAWAN_NWKSKEY,
.AppSKey = LORAWAN_APPSKEY,
.Rssi = 0,
.Snr = 0,
.application_port = 2,
.ReqAck = DISABLE,
.McpsConfirm = NULL,
};

static void McpsIndication( McpsIndication_t *mcpsIndication )
{
if( mcpsIndication->Status != LORAMAC_EVENT_INFO_STATUS_OK )
{
return;
}
switch( mcpsIndication->McpsIndication )
{
case MCPS_UNCONFIRMED:
{
set_comm_param(mcpsIndication);
break;
}
case MCPS_CONFIRMED:
{
set_comm_param(mcpsIndication);
break;
}
case MCPS_PROPRIETARY:
{
break;
}
case MCPS_MULTICAST:
{
break;
}
default:
break;
}

@sslupsky
Copy link
Author

sslupsky commented Jun 7, 2021

The STM 1.3.1 stack is much closer to the current Semtech stack which has a "LoRaMac layer handling" (LmHandler). LmHandler appears to be "Inspired by the examples provided on the en.i-cube_lrwan fork". It is not clear why it hasn't been a priority to migrate the STM firmware to be up to date with the latest STM stack. Neither is it clear why it hasn't been a priority keep the mkrwan firmware in sync with the STM stack.

The Zephyr implementation includes a layer for handling LoRaMac that looks similar to LmHandler. Thus, if we focused on migrating the mkrwan firmware to Zephyr we would have an open source repo for the stack.

https://github.com/zephyrproject-rtos/zephyr/tree/main/subsys/lorawan

@flhofer
Copy link

flhofer commented Jun 21, 2021

@sslupsky

An issue I see here is that the STM library silently drops the payload. So, the application has no knowledge that the uplink was not sent. I suggest that the firmware should, in addition to emitting the AT "+ACK" response correctly, emit an AT response that indicates the frame was actually sent or not.

As said before, the +ACK is written somewhere in the chipset driver and shouldn't actually appear (I removed that in my FW version). I tested the send ACK behavior again, and the send returns +ERR_BUSY if there has been no ack, or the payload gets too big, and an empty frame is sent. I bumped into this when testing data length changes.

Anyway, the more I work with the firmware, the more I see the sloppily implemented parts, causing me to perform some "provvisorio definitivo" changes like we used to say in the old days 😉.
It's time for a change

@sslupsky
Copy link
Author

@flhofer Hmmm, I do not think I am communicating the subtlety here. Yes, the firmware will respond with an error if you send a packet too large for the current DR. However, if ADR forces the DR lower after the transmission is queued, the packet is dropped silently if the DR falls below the size threshold. Take a closer look at the code snippets I referenced, it is there. This condition does not return an error.

Regarding the "+ACK", that code was put in there by @facchinm . Perhaps he can provide some feedback on what, if anything, Arduino uses this particular code for.

"ERR_BUSY" is sent if you attempt to ask the modem to do something while it is processing a transmission. It has nothing else to do with a confirmed packet ACK. More specifically, it is not an indication that an ack was not received.

@flhofer
Copy link

flhofer commented Jun 25, 2021

@sslupsky you're right; it just happened to match the confirmation. The TTN Uno board behaves like that, and I was for a moment inclined foo believe this might do the same. 😁 My fault.

I just finalized some changes for my test MKR's, and I must say, the library/FW interaction is quite basic, maybe too basic. In my last implementation, I check the FCU every 3 seconds (+- standard retry time, i.e., 1-second send window, 1-second RX1 window, 1-second RX2 window). The counter increases if the CTX transmission succeeded or the max retries have been reached. Then I poll +CFS to know if it has been confirmed. It seems the only way to know if there has been a confirmation. The +ACK was there, but the library didn't really read it.

Anyway, if you have a payload that exceeds the maximum length in lower data rates, those data rates should not be settable. Such thing can be set on the server and communicated over OTAA. Another option would be to force higher data rates by monitoring it regularly.
I still didn't figure out where your ACK might get lost.

@sslupsky
Copy link
Author

@flhofer I do not recall the data rate fallback process checks for a minimum data rate.

@flhofer
Copy link

flhofer commented Jul 5, 2021

@sslupsky On EU868 when configuring channels and bands, you set minimum and maximum data rates for each channel which then are used by the ADR procedure. If you perform an ABP join, unfortunately, the MKRWAN firmware is implemented to only use the default channels, for EU868 this is 1-3 with the default data rates 0-5. Don't know what these default settings are for US915.
If you use OTAA instead, the server downloads channel configurations onto the device at join, setting thus also ADR limits.

I use Loriot for my tests where I can set those parameters per device or device class/profile.
Example
Does TTN support this?

I saw the MAC sets some default parameters


/*!
 * LoRaMac maximum number of channels
 */
#define US915_MAX_NB_CHANNELS                       72

/*!
 * Minimal datarate that can be used by the node
 */
#define US915_TX_MIN_DATARATE                       DR_0

/*!
 * Maximal datarate that can be used by the node
 */
#define US915_TX_MAX_DATARATE                       DR_4

/*!
 * Minimal datarate that can be used by the node
 */
#define US915_RX_MIN_DATARATE                       DR_8

/*!
 * Maximal datarate that can be used by the node
 */
#define US915_RX_MAX_DATARATE                       DR_13

/*!
 * Default datarate used by the node
 */
#define US915_DEFAULT_DATARATE                      DR_0

@sslupsky
Copy link
Author

sslupsky commented Jul 5, 2021

@flhofer Apologies, I thought I had posted this reference earlier but I think I overlooked that. Here is the section of code that performs the retransmission if a packet is not acknowledged. Every second retry, the data rate is automatically lowered to the "next lowest data rate". Further, if an acknowledgement is not received by the end device because it was "lost over the air", the LoRaWAN specification guarantees that this retry process always times out because the server will never send another acknowledgement for a frame that has already been acknowledged. So, AckTimeoutRetriesCounter always expires when the acknowledgement is lost.

if( ( AckTimeoutRetry == true ) && ( ( LoRaMacState & LORAMAC_TX_DELAYED ) == 0 ) )
{// Retransmissions procedure for confirmed uplinks
AckTimeoutRetry = false;
if( ( AckTimeoutRetriesCounter < AckTimeoutRetries ) && ( AckTimeoutRetriesCounter <= MAX_ACK_RETRIES ) )
{
AckTimeoutRetriesCounter++;
if( ( AckTimeoutRetriesCounter % 2 ) == 1 )
{
getPhy.Attribute = PHY_NEXT_LOWER_TX_DR;
getPhy.UplinkDwellTime = LoRaMacParams.UplinkDwellTime;
getPhy.Datarate = LoRaMacParams.ChannelsDatarate;
phyParam = RegionGetPhyParam( LoRaMacRegion, &getPhy );
LoRaMacParams.ChannelsDatarate = phyParam.Value;
}
// Try to send the frame again
if( ScheduleTx( ) == LORAMAC_STATUS_OK )
{
LoRaMacFlags.Bits.MacDone = 0;
}
else
{
// The DR is not applicable for the payload size
McpsConfirm.Status = LORAMAC_EVENT_INFO_STATUS_TX_DR_PAYLOAD_SIZE_ERROR;
MacCommandsBufferIndex = 0;
LoRaMacState &= ~LORAMAC_TX_RUNNING;
NodeAckRequested = false;
McpsConfirm.AckReceived = false;
McpsConfirm.NbRetries = AckTimeoutRetriesCounter;
McpsConfirm.Datarate = LoRaMacParams.ChannelsDatarate;
if( IsUpLinkCounterFixed == false )
{
UpLinkCounter++;
}
}
}

This is the function that lowers the data rate for the US915 regional parameters:

static int8_t GetNextLowerTxDr( int8_t dr, int8_t minDr )
{
uint8_t nextLowerDr = 0;
if( dr == minDr )
{
nextLowerDr = minDr;
}
else
{
nextLowerDr = dr - 1;
}
return nextLowerDr;
}

It seems that the process of lowering the data rate does not pay attention to the server setting and only uses the minimum regional setting.

@flhofer
Copy link

flhofer commented Jul 10, 2021

@sslupsky yes, you're right. It seems that the US implementation does not give the server any power over this process.
No no, the ACK for the same message is not resent, but if the modem does not get an ACK, it should resend the whole frame again. Your problem is only with DR0 for message size, and that for some reason, the ack of the retries in DR4 to DR1 are not received either.

I also noticed recently that sometimes the FW hangs for more than a second when polling simple stuff like +FCU. Wouldn't be surprised if there is a bug that hangs the thing in a loop from time to time until it gets interrupted. At the moment I wouldn't know where to look for this, but I'll let you know if I have something new.

@flhofer
Copy link

flhofer commented Jul 12, 2021

@sslupsky ok, done some tests and figured that sometimes the modem does not react at all to a command. I tried issuing periodic polls and put the timeout unreasonably high. After a timeout, the poll works perfectly. If instead, I issue an AT after a shorter timeout, the modem reacts immediately. It really seems there is a sort of race condition where the modem firmware disables interrupts entering a critical section causing the incoming interrupt for the serial to be ignored. It could really be that the FW actually leaves the critical section somewhere with disabled interrupts or stays in a loop in the critical section so, leaving the modem unresponsive to the downlink or serial until the next wakeup.
Any clues or suggestions?

@flhofer
Copy link

flhofer commented Aug 4, 2021

Hi @sslupsky
Just a heads up: I found a problem in the transmission of data when passing from the library to the firmware. I fixed that in the latest versions. The problem was, that the library attempted to send Binary data as binary over an ASCII serial. As you can imagine, special characters such as 00 will cause problems. I changed it now to support the format change to binary, and then send HEX-translated bytes, which on the FW-side will be put back to bytes transmitting again pure bytes over the air. The latest firmware now also supports a max of 242 bytes (no matter what), which is the maximum packet size permitted by LoRa.

Another issue I bumped into in the meantime is that 3G gateways can have very inconsistent latencies. It happens to me now that the ACK response does not reach the GW in time to transmit the response to the MKR. Could be that it is the same case for you. I found latencies varying from 300 to 2400ms; a disaster with an RX1 delay of 1 second.
Maybe it's the same situation for you.

Florian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants