You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inference of joining keys by parent seems insufficient.
Consider a case where two child datasets are joined. ADLB and ADRS have the same primary keys, "STUDYID", "USUBJID", "PARAMCD", "AVISIT".
> library(teal.data)
> library(dplyr)
>
> ADLB <- rADLB
> ADRS <- rADRS
>
> jk <- default_cdisc_join_keys["ADLB", "ADRS"]
>
> full_join(ADLB, ADRS) |> dim()
Joining with `by = join_by(STUDYID, USUBJID, SUBJID, SITEID, AGE, AGEU, SEX, RACE, ETHNIC, COUNTRY, DTHFL, INVID, INVNAM, ARM, ARMCD, ACTARM, ACTARMCD, TRT01P, TRT01A, TRT02P, TRT02A, REGION1, STRATA1,
STRATA2, BMRKR1, BMRKR2, ITTFL, SAFFL, BMEASIFL, BEP01FL, AEWITHFL, RANDDT, TRTSDTM, TRTEDTM, TRT01SDTM, TRT01EDTM, TRT02SDTM, TRT02EDTM, AP01SDTM, AP01EDTM, AP02SDTM, AP02EDTM, EOSSTT, EOTSTT, EOSDT,
EOSDY, DCSREAS, DTHDT, DTHCAUS, DTHCAT, LDDTHELD, LDDTHGR1, LSTALVDT, DTHADY, ADTHAUT, ASEQ, PARAM, PARAMCD, AVAL, ADTM, ADY, AVISIT, AVISITN)`
[1] 11600 104
> full_join(ADLB, ADRS, by = jk) |> dim()
[1] 67200 165
Warning message:
In full_join(ADLB, ADRS, by = jk) :
Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.
>
Joining by default, i.e. using intersect(names(x), names(y)) correctly uses all primary keys as joining keys. Extracting a join_key_set from default_cdisc_join_keys results in a cartesian product.
sessionInfo()
> packageVersion("teal.data")
[1] '0.6.0'
Relevant log output
No response
Code of Conduct
I agree to follow this project's Code of Conduct.
Contribution Guidelines
I agree to follow this project's Contribution Guidelines.
Security Policy
I agree to follow this project's Security Policy.
The text was updated successfully, but these errors were encountered:
This is a byproduct of the incomplete specification of the CDISC datasets file and the implicit relationship inference.
The implicit relationship inference currently works:
This creates problem when the 2 datasets have additional shared primary keys, resulting in "unnecessary" duplicate rows when merging.
Possible solution
We could evolve the mechanism to:
(keep) only determine implicit relationship if 2 datasets have same keys
Infer the joining keys (foreign) keys by intersecting the primary keys of the 2 datasets
And union with shared parent keys
POC Patch
This would require just a few modifications.
diff --git a/R/join_keys-utils.R b/R/join_keys-utils.R
index 7ace0e74..6626755a 100644
--- a/R/join_keys-utils.R+++ b/R/join_keys-utils.R@@ -136,12 +136,14 @@ update_keys_given_parents <- function(x) {
common_ix_1 <- unname(keys_d1_parent) %in% unname(keys_d2_parent)
common_ix_2 <- unname(keys_d2_parent) %in% unname(keys_d1_parent)
+ pk_intersect <- intersect(jk[d1, d1], jk[d2, d2])+
# No common keys between datasets - leave empty
if (all(!common_ix_1)) next
fk <- structure(
- names(keys_d2_parent)[common_ix_2],- names = names(keys_d1_parent)[common_ix_1]+ union(pk_intersect, names(keys_d2_parent)[common_ix_2]),+ names = union(pk_intersect, names(keys_d1_parent)[common_ix_1])
)
jk[[d1]][[d2]] <- fk # mutate join key
}
Assumption on this:
joining keys names that are the same, connect two datasets
This fix/enhancement would change the behavior of joining keys and we would need to communicate the change to our users. It's not a breaking change though.
Alternative solution (manual)
Expand the cdisc_datasets.yaml file to include joining keys for additional relationships between datasets that are different from parent.
I would advise against this as it would greatly expand on that file and its complexity
What happened?
Inference of joining keys by parent seems insufficient.
Consider a case where two child datasets are joined.
ADLB
andADRS
have the same primary keys,"STUDYID", "USUBJID", "PARAMCD", "AVISIT"
.Joining by default, i.e. using
intersect(names(x), names(y))
correctly uses all primary keys as joining keys. Extracting ajoin_key_set
fromdefault_cdisc_join_keys
results in a cartesian product.sessionInfo()
Relevant log output
No response
Code of Conduct
Contribution Guidelines
Security Policy
The text was updated successfully, but these errors were encountered: