Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iOS 2.5 upload audit #66

Open
MMel099 opened this issue Jun 4, 2024 · 10 comments
Open

iOS 2.5 upload audit #66

MMel099 opened this issue Jun 4, 2024 · 10 comments
Assignees

Comments

@MMel099
Copy link
Collaborator

MMel099 commented Jun 4, 2024

Morning @biblicabeebli

Hassan has asked me to look into data uploading and if there are noticeable improvements in consistency/volume of any data streams since v2.5.

To do this, I plan to look at file uploading for all the RAs Beiwe ID for one month before the update (Feb 15 - March 15)and one month after (April 15 - May 15). Would it be possible to get the json files with the full upload histories for these users?

Studies server: Yale_Fucito_Young Adult Alcohol - Live Study

Staging server: Michelle Test Study 10.3.2023

Staging server: Zhimeng Liu - Beta Test - 2.5.24

Staging server: Jenny_Prince_Test_Study_11.30.23

Thanks so much and this is not time sensitive!

@biblicabeebli
Copy link
Member

I'm working on building these endpoints, and there's no reason I can't do uploads first.

These will be going up on staging dailyish or possibly even more frequently.

@biblicabeebli
Copy link
Member

biblicabeebli commented Jun 5, 2024

And now it exists!

This is a script that will work for the endpoint, you will need the python requests and orjson libraries installed via pip.

from datetime import datetime
from pprint import pprint

import orjson
import requests


# make a post request to the get-participant-upload-history/v1 endpoint, including the api key,
# secret key, and participant_id as post parameters.
t1 = datetime.now()
print("Starting request at", t1, flush=True)
response = requests.post(
    "https://staging.beiwe.org/get-participant-upload-history/v1/",
    data={
        "access_key": "your key part one",
        "secret_key": "your key part two",
        "participant_id": "some participant id",
        # "omit_keys": "true",
    },
    allow_redirects=False,
)
t2 = datetime.now()
print("Request completed at", t2, "duration:", (t2 - t1).total_seconds(), "seconds")

print("http status code:", response.status_code)
assert 200 <= response.status_code < 300, f"Why is it not a 200? {response.status_code} (if it's a 301 you may have cut off the s in https)"

print("Data should be a bytes object...")
assert isinstance(response.content, bytes), f"Why is it not a bytes? {type(response.content)}"

assert response.content != b"", "buuuuuut its empty."

print("cool, cool... is it valid json?")
imported_json_response = orjson.loads(response.content)
print("json was imported! Most of these endpoints return json lists...")

if isinstance(imported_json_response, list):
    print("it is a list with", len(imported_json_response), "entries!")
    print("\nthe first entry is:")
    pprint(imported_json_response[0])
    print("\nthe last entry is:")
    pprint(imported_json_response[-1])
else:
    print("it is not a list, it is a", type(imported_json_response), "so you will have to inspect it yourself.")

@MMel099
Copy link
Collaborator Author

MMel099 commented Jun 5, 2024

This looks great! Going to go ahead and give it a try. Thanks!

@biblicabeebli
Copy link
Member

Speed is alright on staging, but when we get it onto production it is going to be S L O W and potentially a problem for database load.

I want to brainstorm ways to reduce the amount of data.

  • we could drop keys, make it a list of lists instead of a list of dicts. I might make endpoints support this and like have Mano use them.
  • could drop the +00:00 from the datetimes, it will always be utc. This is an orjson library thing - if there's an option for that I will make that change so just be aware that it might. (orjson is a highly accelerated json parser, it uses some crazy C++ library.)
  • applying real compression is ... hard, it just is.
  • I guess we could make optimized binary representations - yuck.
  • I guess we could cut unnecessary data out of the file names, they don't need to include the participant id (honestly we should be doing that in the database itself).
  • Am I intentionally going overboard here because I've been up for too long? Yes.

@biblicabeebli
Copy link
Member

biblicabeebli commented Jun 6, 2024

I did make some of those changes.

  • +00:00 is now converted to Z.
  • No more microseconds in the timestamp.
  • There is now an omit_keys parameter that returns the data as a list of lists instead of as a list of dicts, item ordering is the same as in the dict keys...... uuuuhhhh except those calls to pprint sort the keys, order is not as printed

@biblicabeebli
Copy link
Member

Just want to check in on this briefly, brainstorm what you will want to look at.

@MMel099
Copy link
Collaborator Author

MMel099 commented Jun 12, 2024

Here's a little something on what I have so far.

I started by looking at upload histories of the three RA's on the staging server. Here is the history for one of the RA's. Another RA showed a similar pattern. The last RA had no data collected before March, making it hard to compare. You can visually see that collection is way improved in early/mid March which is definitely a positive sign!

Image
Image

Next, I want to shift focus to quantifying 'coverage'. Something like "Beiwe has good data collection for XX hours in a day, on average". Still brainstorming how exactly this will look like so if you have any input, let me know! An idea I had is to look at how long gaps are between upload times of consecutive files. The overwhelming majority of these gaps are seconds or milliseconds, indicating good Beiwe data collection.

Here I pulled one RA's upload history for May of 2024. If we consider any gap in consecutive upload times of at least one hour to be 'significant', these are all the gaps that are significant, with units in hours. Notice how a lot of the gaps are right around whole numbers, which I attribute to the heartbeat feature.

Originally, I was thinking that I could add all these gaps up and then divide by the total time in the time period being considered. With the example above, there's about 70 hours of gaps which is equivalent to about 10% of all of May. Therefore, we would conclude that "Beiwe has good data collection 90% of the time, on average".

However, I am not convinced that these gaps are actually a good surrogate for tracking coverage - there may be long gaps that don't actually indicate worse data collection, but just less data TO collect. Would love to hear your thoughts on this.

Sorry, this reply ended up being a lot denser than I anticipated, but hoping it's clear.

@biblicabeebli
Copy link
Member

  1. Wow.
  2. I actually thought this was our internal tracking issue 😅 - but I am very in favor of being as outwardly open, especially with searches to find flaws in the platform.

Some comments that may be useful:

  • I've done some cleanup over on New APIs - platform improvement - extremely useful overlooked datapoints beiwe-backend#354 , including listing the new endpoints (this is all still just on staging and subject to change, also looking for feedback). Don't have full documentation yet but you should be able to hit any of those endpoints with the script I threw in. The ones I mention below require the participant_id parameter.
  • the unix timestamp in the file name is the time the file was created, it should slightly precede the first data point in the file starting on some 2.5 beta version. Prior to 2.5 the app would create their files WAY in advance.
  • The timestamp is the file upload time, this value may be completely unrelated to the data in the file.
  • get-participant-heartbeat-history/v1 may be of interest to us, it is the record of when the app checked in - its currently ~duplicated, but configured to hit every 5 minutes. The Reason I wanted to create that data stream is so we could look at things like heartbeat compared to upload times and data gathering, with upload being our ~proxy? view into historical app performance (because it's what we've got).
  • get-participant-version-history/v1 may be of solid utility, so can we sanity-check that it is working as expected. (Unfortunately v1 of recording this detail was totally broken, real enablement was pretty recent, I think it's on production already at the very least.)

Nothing else from me for now.

@jprince127
Copy link
Collaborator

jprince127 commented Jul 23, 2024

Hello!

I also was working on some file upload checks. I plotted the number of files being uploaded binned by the hour of the day they are created (in UTC), and found that for GPS, Gyro, and Accelerometer there are significantly more uploads.

Next, I'm working on something similar as Max, where I'm going to bin data collection by the known "sensor on" periods (like for example in GPS we know that there should be about a minute of GPS data being collected then a period of time without data) and then count the discrete number of data collection periods.

@biblicabeebli
Copy link
Member

F*ck yeah.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants