iOS 2.5 upload audit #66

MMel099 · 2024-06-04T17:13:29Z

Hassan has asked me to look into data uploading and if there are noticeable improvements in consistency/volume of any data streams since v2.5.

To do this, I plan to look at file uploading for all the RAs Beiwe ID for one month before the update (Feb 15 - March 15)and one month after (April 15 - May 15). Would it be possible to get the json files with the full upload histories for these users?

Studies server: Yale_Fucito_Young Adult Alcohol - Live Study

Staging server: Michelle Test Study 10.3.2023

tnca5ih4

Staging server: Zhimeng Liu - Beta Test - 2.5.24

g5f4djly

Staging server: Jenny_Prince_Test_Study_11.30.23

966u1cd7

Thanks so much and this is not time sensitive!

biblicabeebli · 2024-06-05T03:54:16Z

I'm working on building these endpoints, and there's no reason I can't do uploads first.

These will be going up on staging dailyish or possibly even more frequently.

Pushing anything to production servers will be a decision we make with Hassan, see Merge Data Access API and Tableau API credentials beiwe-backend#365
the issue for the api development is here: New APIs - platform improvement - extremely useful overlooked datapoints beiwe-backend#354
I'll try and ping you over here when I update that issue, but please check it too and feel free to ask me for an update here.

biblicabeebli · 2024-06-05T10:50:47Z

And now it exists!

This is a script that will work for the endpoint, you will need the python requests and orjson libraries installed via pip.

from datetime import datetime
from pprint import pprint

import orjson
import requests


# make a post request to the get-participant-upload-history/v1 endpoint, including the api key,
# secret key, and participant_id as post parameters.
t1 = datetime.now()
print("Starting request at", t1, flush=True)
response = requests.post(
    "https://staging.beiwe.org/get-participant-upload-history/v1/",
    data={
        "access_key": "your key part one",
        "secret_key": "your key part two",
        "participant_id": "some participant id",
        # "omit_keys": "true",
    },
    allow_redirects=False,
)
t2 = datetime.now()
print("Request completed at", t2, "duration:", (t2 - t1).total_seconds(), "seconds")

print("http status code:", response.status_code)
assert 200 <= response.status_code < 300, f"Why is it not a 200? {response.status_code} (if it's a 301 you may have cut off the s in https)"

print("Data should be a bytes object...")
assert isinstance(response.content, bytes), f"Why is it not a bytes? {type(response.content)}"

assert response.content != b"", "buuuuuut its empty."

print("cool, cool... is it valid json?")
imported_json_response = orjson.loads(response.content)
print("json was imported! Most of these endpoints return json lists...")

if isinstance(imported_json_response, list):
    print("it is a list with", len(imported_json_response), "entries!")
    print("\nthe first entry is:")
    pprint(imported_json_response[0])
    print("\nthe last entry is:")
    pprint(imported_json_response[-1])
else:
    print("it is not a list, it is a", type(imported_json_response), "so you will have to inspect it yourself.")

MMel099 · 2024-06-05T14:34:05Z

This looks great! Going to go ahead and give it a try. Thanks!

biblicabeebli · 2024-06-05T15:35:21Z

Speed is alright on staging, but when we get it onto production it is going to be S L O W and potentially a problem for database load.

I want to brainstorm ways to reduce the amount of data.

we could drop keys, make it a list of lists instead of a list of dicts. I might make endpoints support this and like have Mano use them.
could drop the +00:00 from the datetimes, it will always be utc. This is an orjson library thing - if there's an option for that I will make that change so just be aware that it might. (orjson is a highly accelerated json parser, it uses some crazy C++ library.)
applying real compression is ... hard, it just is.
I guess we could make optimized binary representations - yuck.
I guess we could cut unnecessary data out of the file names, they don't need to include the participant id (honestly we should be doing that in the database itself).
Am I intentionally going overboard here because I've been up for too long? Yes.

biblicabeebli · 2024-06-06T07:44:21Z

I did make some of those changes.

+00:00 is now converted to Z.
No more microseconds in the timestamp.
There is now an omit_keys parameter that returns the data as a list of lists instead of as a list of dicts, item ordering is the same as in the dict keys...... uuuuhhhh except those calls to pprint sort the keys, order is not as printed

biblicabeebli · 2024-06-12T11:42:04Z

Just want to check in on this briefly, brainstorm what you will want to look at.

MMel099 · 2024-06-12T13:47:29Z

Here's a little something on what I have so far.

I started by looking at upload histories of the three RA's on the staging server. Here is the history for one of the RA's. Another RA showed a similar pattern. The last RA had no data collected before March, making it hard to compare. You can visually see that collection is way improved in early/mid March which is definitely a positive sign!

Next, I want to shift focus to quantifying 'coverage'. Something like "Beiwe has good data collection for XX hours in a day, on average". Still brainstorming how exactly this will look like so if you have any input, let me know! An idea I had is to look at how long gaps are between upload times of consecutive files. The overwhelming majority of these gaps are seconds or milliseconds, indicating good Beiwe data collection.

Here I pulled one RA's upload history for May of 2024. If we consider any gap in consecutive upload times of at least one hour to be 'significant', these are all the gaps that are significant, with units in hours. Notice how a lot of the gaps are right around whole numbers, which I attribute to the heartbeat feature.

Originally, I was thinking that I could add all these gaps up and then divide by the total time in the time period being considered. With the example above, there's about 70 hours of gaps which is equivalent to about 10% of all of May. Therefore, we would conclude that "Beiwe has good data collection 90% of the time, on average".

However, I am not convinced that these gaps are actually a good surrogate for tracking coverage - there may be long gaps that don't actually indicate worse data collection, but just less data TO collect. Would love to hear your thoughts on this.

Sorry, this reply ended up being a lot denser than I anticipated, but hoping it's clear.

biblicabeebli · 2024-06-13T01:25:24Z

Wow.
I actually thought this was our internal tracking issue 😅 - but I am very in favor of being as outwardly open, especially with searches to find flaws in the platform.

Some comments that may be useful:

I've done some cleanup over on New APIs - platform improvement - extremely useful overlooked datapoints beiwe-backend#354 , including listing the new endpoints (this is all still just on staging and subject to change, also looking for feedback). Don't have full documentation yet but you should be able to hit any of those endpoints with the script I threw in. The ones I mention below require the participant_id parameter.
the unix timestamp in the file name is the time the file was created, it should slightly precede the first data point in the file starting on some 2.5 beta version. Prior to 2.5 the app would create their files WAY in advance.
The timestamp is the file upload time, this value may be completely unrelated to the data in the file.
get-participant-heartbeat-history/v1 may be of interest to us, it is the record of when the app checked in - its currently ~duplicated, but configured to hit every 5 minutes. The Reason I wanted to create that data stream is so we could look at things like heartbeat compared to upload times and data gathering, with upload being our ~proxy? view into historical app performance (because it's what we've got).
get-participant-version-history/v1 may be of solid utility, so can we sanity-check that it is working as expected. (Unfortunately v1 of recording this detail was totally broken, real enablement was pretty recent, I think it's on production already at the very least.)

Nothing else from me for now.

jprince127 · 2024-07-23T20:49:10Z

Hello!

I also was working on some file upload checks. I plotted the number of files being uploaded binned by the hour of the day they are created (in UTC), and found that for GPS, Gyro, and Accelerometer there are significantly more uploads.

Next, I'm working on something similar as Max, where I'm going to bin data collection by the known "sensor on" periods (like for example in GPS we know that there should be about a minute of GPS data being collected then a period of time without data) and then count the discrete number of data collection periods.

biblicabeebli · 2024-08-22T07:44:22Z

F*ck yeah.

MMel099 assigned biblicabeebli, hydawo, MMel099 and jprince127 Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iOS 2.5 upload audit #66

iOS 2.5 upload audit #66

MMel099 commented Jun 4, 2024

biblicabeebli commented Jun 5, 2024

biblicabeebli commented Jun 5, 2024 •

edited

Loading

MMel099 commented Jun 5, 2024

biblicabeebli commented Jun 5, 2024

biblicabeebli commented Jun 6, 2024 •

edited

Loading

biblicabeebli commented Jun 12, 2024

MMel099 commented Jun 12, 2024

biblicabeebli commented Jun 13, 2024

jprince127 commented Jul 23, 2024 •

edited

Loading

biblicabeebli commented Aug 22, 2024

iOS 2.5 upload audit #66

iOS 2.5 upload audit #66

Comments

MMel099 commented Jun 4, 2024

biblicabeebli commented Jun 5, 2024

biblicabeebli commented Jun 5, 2024 • edited Loading

MMel099 commented Jun 5, 2024

biblicabeebli commented Jun 5, 2024

biblicabeebli commented Jun 6, 2024 • edited Loading

biblicabeebli commented Jun 12, 2024

MMel099 commented Jun 12, 2024

biblicabeebli commented Jun 13, 2024

jprince127 commented Jul 23, 2024 • edited Loading

biblicabeebli commented Aug 22, 2024

biblicabeebli commented Jun 5, 2024 •

edited

Loading

biblicabeebli commented Jun 6, 2024 •

edited

Loading

jprince127 commented Jul 23, 2024 •

edited

Loading