Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean-up of $fit slot of fitted glmnet models #266

Merged
merged 4 commits into from
Nov 30, 2023

Conversation

hfrick
Copy link
Member

@hfrick hfrick commented Apr 26, 2023

closes #265

But: the size of the total object is now bigger?? 🤔

To compare from the issue/main:

lobstr::obj_size(glmn_fit)
#> 112.16 kB
library(censored)
#> Loading required package: parsnip
#> Loading required package: survival
lung2 <- lung[-14, ]

glmn_fit <- proportional_hazards(penalty = 0.1) %>%
  set_engine("glmnet") %>%
  fit(Surv(time, status) ~ age + ph.ecog, data = lung2)

lobstr::obj_size(lung2)
#> 34.61 kB
lobstr::obj_size(glmn_fit)
#> 172.58 kB
lobstr::obj_size(glmn_fit$fit)
#> 9.54 kB
lobstr::obj_size(glmn_fit$fit$fit)
#> 0 B

Created on 2023-04-26 with reprex v2.0.2

Second thoughts

I think this change is the right call regardless of my surprise with the size because extract_fit_engine() expects the model to sit in model_fit$fit, not model_fit$fit$fit. Maybe something else also builds on this convention (but I haven't checked).

Tests are updated in tidymodels/extratests#149

With main dev version

library(censored)
#> Loading required package: parsnip
#> Loading required package: survival

lung2 <- lung[-14, ]

ph_fit <- proportional_hazards(penalty = 0.1) %>%
  set_engine("glmnet") %>%
  fit(Surv(time, status) ~ age + ph.ecog, data = lung2)

extract_fit_engine(ph_fit) 
#> $fit
#> 
#> Call:  glmnet::glmnet(x = data_obj$x, y = data_obj$y, family = "cox",      weights = weights, alpha = alpha, lambda = lambda) 
#> 
#>    Df %Dev   Lambda
#> 1   0 0.00 0.225300
#> 2   1 0.21 0.205300
[snip]
#> 
#> $preproc
#> $preproc$x
#>     age ph.ecog
#> 1    74       1
#> 2    68       0
[snip]
#> 
#> $preproc$y
#>   [1]  306   455  1010+  210   883  1022+  310   361   218   166   170   654 
#>  [13]  728   567   144   613   707    61    88   301    81   624   371   394 
[snip]

Created on 2023-11-23 with reprex v2.0.2

With this PR

library(censored)
#> Loading required package: parsnip
#> Loading required package: survival

lung2 <- lung[-14, ]

ph_fit <- proportional_hazards(penalty = 0.1) %>%
  set_engine("glmnet") %>%
  fit(Surv(time, status) ~ age + ph.ecog, data = lung2)

extract_fit_engine(ph_fit)
#> 
#> Call:  glmnet::glmnet(x = data_obj$x, y = data_obj$y, family = "cox",      weights = weights, alpha = alpha, lambda = lambda) 
#> 
#>    Df %Dev   Lambda
#> 1   0 0.00 0.225300
#> 2   1 0.21 0.205300
[snip]

Created on 2023-11-23 with reprex v2.0.2

@hfrick
Copy link
Member Author

hfrick commented Nov 23, 2023

revdep check job_name: "8d466dbc-c0f2-4516-a1f8-7092a1dfcc45"

glmnet-cleanup-training-data* > revdepcheck::cloud_summary()
ℹ Syncing results to revdep/cloud.noindex/8d466dbc-c0f2-4516-a1f8-7092a1dfcc45
ℹ Comparing results
✔ MachineShop 3.7.0                      ── E: 0     | W: 0     | N: 0    
✔ survex 1.2.0                           ── E: 0     | W: 0     | N: 0    

@simonpcouch
Copy link
Contributor

I would also be perplexed by an increase in object size, but I'm seeing different object sizes than you for some reason. I do indeed see a decrease in object sizes with this PR🤔

With main:

library(censored)
#> Loading required package: parsnip
#> Loading required package: survival

lung2 <- lung[-14, ]

glmn_fit <- proportional_hazards(penalty = 0.1) %>%
  set_engine("glmnet") %>%
  fit(Surv(time, status) ~ age + ph.ecog, data = lung2)

lobstr::obj_size(glmn_fit)
#> 112.04 kB

lobstr::obj_size(glmn_fit$fit)
#> 63.98 kB

lobstr::obj_size(glmn_fit$fit$fit)
#> 9.49 kB

Created on 2023-11-28 with reprex v2.0.2

With this PR:

library(censored)
#> Loading required package: parsnip
#> Loading required package: survival

lung2 <- lung[-14, ]

glmn_fit <- proportional_hazards(penalty = 0.1) %>%
  set_engine("glmnet") %>%
  fit(Surv(time, status) ~ age + ph.ecog, data = lung2)

lobstr::obj_size(glmn_fit)
#> 111.13 kB

lobstr::obj_size(glmn_fit$fit)
#> 9.49 kB

lobstr::obj_size(glmn_fit$fit$fit)
#> 0 B

Created on 2023-11-28 with reprex v2.0.2

Regardless, now transitioning into sinking my teeth into the substance of the PR🏄

@simonpcouch
Copy link
Contributor

Possibly related oddities: tidymodels/stacks#117.

Copy link
Contributor

@simonpcouch simonpcouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, phew. I'm with it!

A bit of my mental map that needed adjusting for me to wrap my head around this: coxnet_train() is mostly analogous to parsnip::xgb_train() or bonsai::train_lightgbm(), except that the latter two only pre-process the inputs from parsnip and return what the engine functions return, while coxnet_train() also post-processes the engine output and returns it as a slot, $fit, in the output. It still gives its output the same class()[1] as its $fit slot, though. This is all for good reasons. :)

@hfrick hfrick merged commit 789ca07 into main Nov 30, 2023
8 checks passed
@hfrick hfrick deleted the glmnet-cleanup-training-data branch November 30, 2023 12:41
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't carry glmnet training data twice
2 participants