-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should we handle censoring / loss-of-follow-up for readmission risk prediction? #9
Comments
To make this more concrete and timely -- barring community input over the coming days, I will stick with Option 1 as our current default for this initial push. |
@mmcdermott i probably mentioned this already, but sorry for not being very responsive. i'm working towards a tight deadline for a grant proposal so have very little time for the next ~month. i have only scanned the discussion above, but, if i understand correctly, option 1 only includes patients if they have one or more data points >30 days from prediction time (/ discharge time). in real life we can't begin by filtering people who we don't see in the future, so making this a requirement for a benchmark seems like a bad idea. i think it makes more sense to (1) assume that if a patient returns to hospital then they are returning to the same hospital (2) make sure that our prediction window doesn't exceed censor date for the dataset. |
Thinking about the question of how to deal with patients who die within the prediction window...doesn't this just mean that we are trying to force a non-binary classification task into a binary task? |
No worries @tompollard -- Whatever cycles you have to offer insight is appreciated!
So, obviously you are correct in that we can't filter patients by future (unseen) data in a deployment scenario. However, I disagree with the logic that this makes the task bad for a benchmark. In fact, implicitly many tasks are characterized with future data dependencies -- for example, any study on MIMIC-IV has the implicit exclusion criteria that a patient will be excluded from the task cohort if they have not and will not ever go to the ED while they remain in the dataset. I'm not suggesting this means that this property is not a problem. Instead, what I would say is problematic about these tasks is not their inclusion in a benchmark, but is in any subsequent use of results over these tasks to justify inappropriate deployment strategies. In particular, in this case, when I say we should do "Option 1", I am also explicitly proposing that this task could not and should not be used in a deployment scenario without an additional predictor also being leveraged to predict whether or not the patient will be in the dataset for more than 30 day. These two tasks together give us the unconditioned probability of an "admission within the next 30 days", when that is of interest. They also give us more precise predictors of things like "is this patient likely to be in the dataset for more than 30 days" and "presuming this patient doesn't leave the dataset, would they likely be readmitted?". I would argue that in almost all cases when restricted to binary tasks, multiple predictors will be necessary to form a complete picture of the relevant probability distributions to motivate use in deployment. I would go further and say that (while I acknowledge this poses very real HCI and interpretability challenges), that this property is a good thing because it reflects that we are making more precise predictions of simpler probabilistic outcomes, rather than making more broad predictions of more complex, often more poorly understood probability distributions.
Can you map this into a concrete proposal of inclusion/exclusion & label under the tabular form above, to make sure I understand?
I instead prefer to think about this from the perspective that we are breaking down a complex task into simpler binary components, but that argument is at least half just semantics. |
For me this discussion emphasizes the importance of being clear about our intended goals of the benchmarks, and how we intend them to be used (not that we aren't, I'm still behind on reading up).
I assume this should be "not ever go to the ED [or ICU] while...". I agree, for me the construction of the cohort is a major problem with MIMIC-IV (along with confusing temporal misalignment of modules). I wouldn't want to use these existing problems as justification for creating a new one.
Maybe a side note, but is there also an upper bound (e.g. more than 30 day and less than 365 day)? Otherwise we're skewing the population towards those who were admitted early in the period spanned by the dataset.
e.g. without the upper bound on future data, we'll be lowballing the probability of being in the dataset for more than 30 days (e.g. patient has COVID? less likely to show up again).
I'll try to find the time but no promises!
So in practice you would run two separate models - e.g. one predicting 30 day mortality and one predicting 30 day readmission - and this would give you cleaner estimate of both outcomes? That makes sense I guess. It just feels odd to break down the task using a dependency on information not available at prediction time. |
Couldn't agree more!
Yeah, that is a good point, but I'm not sure the best way to handle it is to add a recency constraint. I'm very open to it, but it seems like it could also introduce other unintended confounders, like focusing the model only on patients with certain diseases who are more likely to be seen more regularly.
Yes, you'd run two (or more) models. This is not as weird as it sounds (or it shouldn't be, imo) -- in IPW for causal analyses, for example, you do something similar based on predicting whether the patient would receive the treatment then using that to reweight treatment response predictions. In general, almost all tasks we care about will be on restricted cohorts, with conditions on those cohorts that we can't know in advance. E.g., predicting an abnormal lab result is conditioned on the lab being measured, predicting any generic future event is conditioned on the patient still being in the dataset in that period, predicting treatment response is conditioned on the treatment being continued for a sufficient time to observe a response, etc. From that perspective, I think the fact that binary classification models make some of these things very explicit is an advantage |
TL;DR: I like the nuanced modelling of the task but am afraid that nuance will get lost in the paper, making things worse than just going for Option 3. Long version: Similar to Tom's comment, I think several of the options outlined above are valid and our choice depends on our intended goals and how well we can explain them. For example, Option 1 (the current default) estimates
because it gives us an incomplete picture. By modelling We don't always need all of those additional distributions, though. To see how our underlying prediction goal changes the task, let's contrast this with a setting where we aren't actually interested in the readmission itself but instead are one of those users who just want to create a predictor for acuity as "readmission or death". Now we don't care about the conditional Finally, if all we care about is whether the patient turns up at our doorstep (e.g., because our model is not used for medical decision making but solely for planning of staff workload), life get's even simpler and we just need So which one of the above is the right task to use for our benchmark? I personally think that either would be fine and may be appropriate in practice depending on the goal for our model. As long as we make very clear in the benchmark what we model and why, all is good. There is a but, though. I think the discussion so far has shown that there is a lot of nuance to Option 1 including issues of both competing risks and censoring. My primary fear is that this crucial nuance may very well get lost in a benchmarking paper with multiple tasks, which is why I think we may want to consider Option 3 with the explicit assumption that any (or at least most) readmissions within 30 days are likely to the same hospital (as per Tom's suggestion). Option 1 would work best if we can show the entire modelling task including all the individual probabilities and devote a lot of space to discussing the intricacies. I also think that this may be an even bigger problem for some of the other tasks. If we look at the MI task, the current setup estimates the probability of having an MI in the next five years conditional on the patient surviving said MI. Shown in isolation, this is a very odd task. If we are worried about users misinterpreting the predictor (which is listed as a con in both Options 2&3), I am not sure we are making things better by adding this complexity. |
You're right @prockenschaub, I spoke too carelessly. What I meant (but did not say), was that if we predict |
You're right. We should have separate issues for each task. I'm going to re-title this issue to focus specifically on readmission, and we can have more targeted discussions for other tasks in new issues. Here's a task for MI: #10 |
I looked into this a bit. Data on how frequently this happens is sparse, but it is not insignificant. The only study I found pegs it at around 20% though this will obviously vary widely. Regardless, I don't think we can assume the fraction that get readmitted to a different hospital is negligible. That said, it is not clear even if someone is readmitted to a different hospital whether our criteria here can actually catch that, because going to a different hospital for your readmission is very different than never coming back to the health system of the original admission. |
So, some overall thoughts. Thought 1: We should (as a field, and in the long term for this benchmark, not necessarily in the immediate term) make our tasks more aligned with prospective use cases in general.Right now, a lot of the confusion and apparent disagreement in this discussion I think stems from us having different views in mind of what "readmission risk prediction" means -- or, rather, what we are actually trying to predict, meaning explicitly what operationalized value of information we expect such a prediction to offer. This is a general, systemic problem in the field, I think, but one that we are well poised to try to target in the long term. This discussion is already diving more deeply into important questions and aspects of this task than I have seen in many more traditional resources in ML for health. In particular, my actionable take away from this is that while simplicity is extremely valuable, I think we should prioritize first providing task definitions that, insofar as we are able within the confines of ACES' syntax and the datasets we have access to, reflect in part the complexity of their use cases and clearly document and describe why the tasks are configured the way they are. While this will make our tasks harder to understand, it will make it easier for our community to iterate on these tasks and narrow them down to the tasks that truly matter in this space. Obviously this is a balancing act, but I think simply defaulting to the simplest possible cohort would be a mistake. Thought 2: For Readmission Risk Specifically, a good "operationalized value proposition" we could consider trying to target is the U.S. Hospital Readmissions Reduction Program (HRRP)The HRRP is aimed to reduce unplanned and avoidable readmissions by applying financial penalties to hospitals that have higher than average readmission rates for patients with certain conditions. These financial incentives are, to the best of my (admittedly relatively limited) knowledge, the main driver behind the readmission risk prediction task being of such wide interest in the U.S. in particular. Other countries also have similar readmission program, though the exact criteria differ. Similar financial penalties do not currently exist to the best of my knowledge for mortality within 30 days after discharge, but other metrics do capture and penalize unplanned mortality on more indirect measures. Here is a ChatGPT summary of unknown veracity of these programs: https://chatgpt.com/share/15f740dc-ba52-47c6-a059-d5e1a8043eac If we wanted to treat the HRRP (or analogs of the HRRP) as our guiding principle, there are two larger areas of change we should consider: Change 1: Have "initial admission disease" based inclusion criteria and "potential readmission disease" based exclusion criteria.Under most of these programs, a readmission is only viable for penalization if it is thought to be both unplanned and avoidable. These are often codified in that
We should consider including both of these kinds of conditions in our readmission task (or in a variant of a readmission task). Change 2: We should adjust our exclusion criteria to mirror this prospective use cases.If we imagine that this model is used to delay discharge for patients who will have a subsequent readmission in a manner that causes a financial penalty, we can break patient populations down into a few groups to examine the possible failure modes. To characterize these groups, imagine that each patient has a known time-to-mortality as of discharge assuming no subsequent admission that is given by a random variable
Thought 3: We should open github discussions, or a wiki, or something other than just issues to curate discussions on tasks. |
To expand on Thought 2 above, how do those possible changes reflect in what we might want in our readmission task? We should only count unplanned readmissionsThe desire to count only admissions that are not elective in nature may reveal a limitation in ACES' configuration. @justin13601 and I will consider. If this is expressible (I do not think it currently is, but likely would be were justin13601/ACES#54 solved), I would advocate we include it, but for now given I do not think it is expressible we should consider all admissions. Neither Option 1 nor Option 3 is bestBased on the analysis above, I think the following assignment of labels to settings would best align with hospital use case (though it may not be expressible simply with ACES):
I think that perhaps this would be best because:
However, I don't think at first glance that this configuration is capturable in ACES without solving justin13601/ACES#54. Given ACES' current limitations, I advocate we proceed with Option 1This is not because I think it is the best configuration, but rather for a more tactical reason. In particular, I think it will be helpful if we have at least one task that has some justifiable, but complex and slightly non-standard inclusion/exclusion criteria, that way we can emphasize to readers and users that a big part of this benchmark is defining and improving task definitions as a community. Of the tasks in our set currently, this task has the lowest potential for dangerous errors due to misconfiguration as it would likely be used operationally more than clinically, so of tasks to include more complexity in, even if we're not fully sure of or fully able to express the best version of the criteria, this task is the one we should pick. It is also the best choice tactically, because it is a common task so if we can convincingly show that it is actually much more complex than we typically consider, that will be the most valuable for the community. If we take that as a goal, then Option 3 is eliminated for being too simple. Option 2 is not viable within ACES, nor is the new option I proposed above. Frankly, we also don't have the right expertise to really decide which of the various options is truly "most aligned" with how hospitals are likely to care about this. Given all that, until we can both get a true expert on this task to weigh in and express more complex relationships with ACES, our capability to produce the "right" cohort is limited, Option 1 is the only choice that meets our needs as being both not the simplest version of this task possible and being a reasonable expression of this task in a way that aligns with hospital needs. |
That all being said, practically for now we should just decide and stick with it, so I'm going to post a poll on slack between option 1 and 3 so we can just vote and be done for the ML4H push, and we can relegate further improvements to after the benchmark. At this point I'd be fine with either option, and see clear strengths and weaknesses to both. |
I agree with these points. I think if we devote some space to it in the manuscript, highlighting the complexity of even common tasks like readmission prediciton is a strength. The slack poll shows a clear preference for Option 1 anyway :) |
I also like the real-world use case with HRRP - something to revisit once the initial benchmark push is done. |
For example, consider the
readmission/general_hospital/30d
task.Tagging for comments: @prockenschaub @shalmalijoshi @justin13601 @tompollard @Jwoo5
As of the commit referenced in the link above, this task currently excludes all patients who do not have one data element after 30 days. That may or may not be advisable.
Let's define$x$ to be a patient's data as of a prediction time, $R$ to be the label of whether or not there is an admission event within 30 days ($R=1$ if so, $R=0$ otherwise), $E$ to be a binary variable indicating whether or not we have data after 30 days ($E=1$ if there is data after 30 days, $E=0$ otherwise), and $M$ to be a binary variable indicating whether or not the patient dies within 30 days ($M=1$ if they do, $M=0$ otherwise).
Note that our constraints here are that we are limited to only a binary classification task. We can't, within this current scope, change frameworks to a survival analysis or something.
The question here is whether to include, exclude, and what label to assign for training this task. There are a few options:
Option 1: Only predict for patients with data observed more than 30 days out
In this setting, we combine the notions of "death" and "loss of follow up" and try to predict readmission only for the population of patients who will still have data after 30 days. One can use this in a clinical pipeline appropriately by also including predictors for a patient's likelihood to leave the dataset within 30 days, either due to death or lack of follow up (either jointly or via individual predictors), giving a nuanced picture of the patient's state (e.g., this patient is likely to die within 30 days, vs. this patient is likely to still have data for the full next 30 days but within that period to need a readmission).
Pros:
Cons:
Option 2: Predict on all patients where ground truth is known; omit patients where it is not.
In this setting, whenever we know a definite answer, we include the patient. If the patient is readmitted within 30 days, they get a 1. If they die within 30 days before readmission, they get a 0. If they have the full 30 days observed without a readmission, they get a 0. If they don't meet any of those criteria, they are excluded.
Pros:
Cons:
Option 3: Include all patients, assume data is complete.
In this option, we don't exclude anybody on the basis of future information (either due to death or loss of follow up). If$R=1$ , we label $y=1$ . If $R=0$ , we label $y=0$ .
Pros:
Cons:
Option 4: ???
Other suggestions or options are welcome.
The text was updated successfully, but these errors were encountered: