Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with pathogen nondetects #356

Closed
cristinamullin opened this issue Nov 13, 2023 · 9 comments
Closed

Dealing with pathogen nondetects #356

cristinamullin opened this issue Nov 13, 2023 · 9 comments
Assignees

Comments

@cristinamullin
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

TADA's simple censored data function does not allow for characteristic specific non-detect handling, e.g., bacteria. Regardless of the method chosen and applied for all other characteristics, should bacteria (e.g. e. coli) always default to a whole number close to zero such as 1-2 Colony Forming Units (CFU) or Most Probable Number (MPN) per 100 mL? Does applying a single method across all characteristics cause issues for any other characteristics/characteristic groups in addition to bacteria?

Here are the notes from the meeting in Jan 2021 where we discussed nondetect handling methods:
Subgroup A
IssuePaper_RetrievalQAQC_Jan2021.docx

There is some more information in the TADA Master List of Requirements (starting on pg 12).
“Substituted values can include the reporting limit or a very low pathogen count (1 to 2 Colony Forming Units (CFU) or Most Probable Number (MPN) per 100 mL). A concentration of 0 CFU or MPN per 100 mL is not typically used because the summary statistic used for pathogens (geometric mean) cannot be calculated when the dataset includes zeros.”
TADA Master List of Requirements.docx

@hillarymarler
Copy link
Collaborator

Are you thinking this should be a separate function or an additional argument in the simple censored data function? I'm interested in working on this issue.

@hillarymarler hillarymarler self-assigned this Dec 6, 2023
@cristinamullin
Copy link
Collaborator Author

Great! I think it should go into the simple censored data function. However this would be the only exception to the way all other censored data is handled in that function, so we will need to have good documentation of this nuance for bacteria in the Shiny app and function itself.

@hillarymarler
Copy link
Collaborator

image

Working on this using Fecal Coliform as an example. I noticed that "MPN/100ML" and "MPN/100 ML" are both listed in TADA.ResultMeasure.MeasureUnitCode. It does not impact adding a simple censor option for pathogens, but maybe something to keep in mind.

@cristinamullin
Copy link
Collaborator Author

cristinamullin commented Dec 8, 2023 via email

@hillarymarler
Copy link
Collaborator

Here is the Fecal Coliform data set I downloaded for testing and where I observed both "MPN/100ML" and "MPN/100 ML" as units. I'll take a look at the TADA_ConvertResultsUnits function and see if I can standardize it to the WQX version

test <- TADA_DataRetrieval(characteristicName = "Fecal Coliform",
                            startDate = "2017-05-01",
                            endDate = "2017-09-01") 

test_fecalcol_units <- test %>%
  dplyr::select(TADA.CharacteristicName, TADA.ResultMeasure.MeasureUnitCode) %>%
  unique()

I took a closer look into the "Coerced to NA" Fecal Coliform results that I mentioned during the Team Meeting. Specifically, I looked at those from the sample data set.

test_coerce_na <- subset(test, test$TADA.ResultMeasureValueDataTypes.Flag == "Coerced to NA")

I found that all had the same OrganizationFormalName, "Maryland Dept. of the Environment Shellfish Data". The original ResultMeasureValue for all of these entries was "." and they were coerced to NA in TADA_ConvertSpecialChars() but are flagged as "Uncensored" in TADA.CensoredData.Flag.

Currently, I have the coerced to NA pathogen results also receiving a TADA.MeasureResultValue of 1 (default) or the user specified value. But I can remove this if it confusing or incorrect to have a censored data function acting on non-censored data.

Additionally, there are also TADA.ResultMeasureValues of -99, -1, and 0 included in the test data set. These will also cause problems for calculation of geometric means. I can add these situations (ResultMeasureValue of equal to or less than 0) to the function too, but this is another instance of the censored data function acting on non-censored data.

What are your thoughts? Should I continue developing this as part of the TADA_SimpleCensoredMethod function? Or would it make more sense to create a separate function given that there are non-censored data which will require substitute values in order for users to calculate geometric means?

@cristinamullin
Copy link
Collaborator Author

Excellent! Let me know if you run into any questions or issues with the "MPN/100 ML" to "MPN/100ML" unit synonym conversion.

Regarding the issue with Maryland data with the result value "." being coerced to NA... I looked at the data and don't see any metadata included that we could use to identify that data as censored, or that indicated how the result value "." should be interpreted. Currently, all non-numeric results like this that cannot be easily interpreted or wrangled end up getting removed.

My initial thought is that the censored data functions should only handle the data we are able to ID as censored data based on the metadata provided in the Detection Limit Type Name, Result Detection Condition Text, and possibly the MeasureQualifierCode (other issue your working on). I don't think the censored data function should act on non-censored data.

Would make more sense to include substitutions for negative values within a new geometric mean calculator function?

@hillarymarler
Copy link
Collaborator

I like the idea of a geometric mean calculator function. That would be an appropriate place to substitute negative or zero values.

Do you think TADA.SimpleCensoredMethods should handle substitutions for non-detect pathogen data with additional substitutions for non-censored data in the geometric mean function? Or should all substitutions happen in the geometric mean function?

@cristinamullin
Copy link
Collaborator Author

Yes I think TADA.SimpleCensoredMethods should handle substitutions for non-detect pathogen data, with additional substitutions for non-censored data in the geometric mean function.

I think most states do calculate a geometric mean for bacteria, but many do not (https://usepa.sharepoint.com/:x:/r/sites/WQPDataAssessmentTeam/_layouts/15/Doc.aspx?sourcedoc=%7B49ACC717-A45E-41F0-9F3F-00BAE8743D81%7D&file=Summary%20of%20CALMs.xlsx&action=default&mobileredirect=true).

I recommend the separating these two because the censored data substitutions should be included as a pre-requisite for any assessment methodology (any criteria/methodology), but the substitutions for negative values may only be needed if the assessment method includes calculation of the geomean. Some E. coli, enterococci, and fecal coliform assessment criteria and methods simply use a "not to exceed x percent of the time" methodology (e.g. AZ: fecal coliform not to exceed 400 cfu/100mL more than 25% of the time from may-sept with a min of 8 samples evenly distributed across the season). In that case, the TADA user would still need to run the censored data substitutions function, but wouldn't need the neg value substitutions from the geometric mean function.

@hillarymarler
Copy link
Collaborator

I updated TADA.SimpleCensoredMethods to include only non-detect pathogen data. I think the next step is to figure out the list of characteristics to include by discussing with the work group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants