Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PipeOp to try repair predicting with unseen factor levels #71

Closed
berndbischl opened this issue Dec 20, 2018 · 4 comments
Closed

PipeOp to try repair predicting with unseen factor levels #71

berndbischl opened this issue Dec 20, 2018 · 4 comments
Assignees
Labels
Priority: Medium Status: Contrib (unprepared) In someone's opinion, this is an issue that could be handled by a contributor with the right support Status: Needs Design Needs some thought and design decisions. Type: New PipeOp Issue suggests a new PipeOp
Milestone

Comments

@berndbischl
Copy link
Member

problem: quite often, a learner breaks, because it sees SOME prediction in a larger table, which contains new, unseen factor levels. in such a case the predict of the underlying learner fails, completely.

see reprex here:
mlr-org/mlr3#97

this is really annoying. especially as this can happen on only a few observations, but we still 100% fail the complete prediction.

current options are: the mlr3 fallback learner. that does not really help. because this produces now fallback predictions on the complete test set.

here is MAYBE a better option.

PipOpUnseenLevels

before we go into the learner, we can on-training, store which levels are present in each factor.

PipOpUnseenLevels
train: task--stored-levels--->task
predict: task-->stored-levels-->task

train: simply stores a list, one element per factor feature, with the seen level
predict: does through all observations. for each observation where we see "unseen" levels, we create a random row, by sampling from the marginals of the columns.
that is a bit hacky, but should work?

@mb706 mb706 added this to the far milestone Jan 30, 2019
@mb706 mb706 removed this from the far range milestone Aug 19, 2019
@prockenschaub
Copy link
Contributor

By "random row" you mean that for an observation with an unseen level we samplw every variable of from the marginals, even those variables for which the observed value was within the training sample, or just replace the unseen value itself with a draw from its own column marginal?

If I were to use this PipeOp, I would personally favour a more deterministic approach, either by filtering out those observations during prediction that contain a value that wasn't seen during training (ideally with warning) or by prespecifying a category that should be used if a new value is seen (the specification could say "marginals", which would include the second case in my question above)

@mb706 mb706 added Status: Needs Design Needs some thought and design decisions. Status: Contrib (unprepared) In someone's opinion, this is an issue that could be handled by a contributor with the right support Type: New PipeOp Issue suggests a new PipeOp labels Feb 10, 2020
@mb706
Copy link
Collaborator

mb706 commented Feb 12, 2020

Note we have the POBackupLearner for something like this: #204

@pfistfl
Copy link
Member

pfistfl commented Mar 30, 2020

This should be solved via fixfactors and imputation.
See e.g.
https://mlr3gallery.mlr-org.com/basics_pipelines_titanic/

@berndbischl berndbischl self-assigned this Mar 30, 2020
@berndbischl berndbischl added this to the v0.2 milestone Mar 30, 2020
@mb706
Copy link
Collaborator

mb706 commented Jun 21, 2020

together with the robustify pipeline this is probably as good as it gets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium Status: Contrib (unprepared) In someone's opinion, this is an issue that could be handled by a contributor with the right support Status: Needs Design Needs some thought and design decisions. Type: New PipeOp Issue suggests a new PipeOp
Projects
None yet
Development

No branches or pull requests

4 participants