-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/839 revision sklearn #846
Feature/839 revision sklearn #846
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loved the ArrayDataset
idea, very abstract and powerful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Ce11an! I left couple comments. I know it looks like a lot, but I have to say, I really like the ideas in this PR and general style! Keep it up! 💪 🎉
data: Any, | ||
target: Any, | ||
test_dataset: Optional[ArrayDataset] = None, | ||
val_size: Union[float, int] = 0.2, | ||
test_size: Optional[Union[float, int]] = None, | ||
random_state: Optional[int] = None, | ||
shuffle: bool = True, | ||
stratify: bool = False, | ||
num_workers: int = 0, | ||
batch_size: int = 1, | ||
pin_memory: bool = False, | ||
drop_last: bool = False, | ||
persistent_workers: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a couple of issues with this.
- it suddenly looks like there is not a single reason why it should be called
SklearnDataModule
since it doesn't really need any functionality fromscikit-learn
- it seems odd, that
data
andtarget
are supposed to be specified asx
andy
variables, whereastest_dataset
has to be specified either as a fraction, or as an instance ofdataset
(andArrayDataset
at that). I'd consider either supporting train data as a dataset or test data as multiple values. - it makes sense to not explicitly declare arguments with default values which are the same as base class arguments and their default values - and instead pass them in the call of
super().__init__(*args, **kwargs)
from pytorch_lightning import LightningDataModule | ||
from torch import Tensor | ||
from pytorch_lightning.utilities import exceptions | ||
from sklearn import model_selection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely be guarded by _SKLEARN_AVAILABLE
x_test_hold_out, y_test_holdout = x_holdout[test_i_start:], y_holdout[test_i_start:] | ||
X, y = X[hold_out_size:], y[hold_out_size:] | ||
def train_dataloader(self) -> DataLoader: | ||
return self._data_loader(self.train_dataset, shuffle=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While shuffling a train dataset is definitely a good practice, we can't force users to shuffling without the option to turn it off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely understand. I will check other datamodules to see how they handle this situation.
return x, y | ||
|
||
|
||
@under_review() | ||
class SklearnDataModule(LightningDataModule): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, one idea how this could again become sklearn
datamodule (i.e. actually use some indispensable functionality from sklearn
) might be to try to add a DataModule
, that would automatically load sklearn dataset (https://scikit-learn.org/stable/datasets.html). This would have to be named differently (something like SklearnNamedDataModule
) and it's absolutely a new feature, but it's just an idea 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely. I am happy to do this 🚀
|
||
self._init_datasets(X, y, x_val, y_val, x_test, y_test) | ||
def _sklearn_train_test_split(self, x, y, split: Optional[Union[float, int]] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to have either an error from us saying sklearn is needed. On the other hand, this is such a small amount of code, that it might also make sense to provide our own implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be happy to move away from sklearn as only one function is used. Using random_split
from torch is a potential option. I will have a think.
Thank you! Appreciate the feedback ⚡ |
Thank you @luca-medeiros @otaj for reviewing the PR 🥳 My actions from your comments are the following:
Let me know if I have missed anything. Thanks again for your feedback 😄 |
Hey @otaj and @luca-medeiros I have updated the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! I think ArrayDataset became much more powerful this time and using a Bolts Dataset allows it to be very lean.
Hi @otaj and @luca-medeiros I have done some thinking regarding a
In this example, I have created the |
Hi @Ce11an, I really like the last iteration of
|
…to feature/839_revision_sklearn
@@ -19,7 +19,7 @@ def test_linear_regression_model(tmpdir): | |||
X = np.array([[1.0, 1], [1, 2], [2, 2], [2, 3], [3, 3], [3, 4], [4, 4], [4, 5]]) | |||
y = np.dot(X, np.array([1.0, 2])) + 3 | |||
y = y[:, np.newaxis] | |||
loader = DataLoader(SklearnDataset(X, y), batch_size=2) | |||
loader = DataLoader(ArrayDataset(X, y), batch_size=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need fixing 🧰
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say, I didn't take a look at the tests or at the SklearnDataModule
again as they seem to not have changed since my last viewing. It's getting there, but I'm still nitpicking a bit 😄
data: ARRAYS | ||
transform: Optional[Callable] = None | ||
|
||
def process(self, data: ARRAYS) -> ARRAYS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it need an argument data
? And if that's there and it's to serve only as transform, then having data
as an instance attribute doesn't make much sense. Or am I missing something?
return len(self.data_models[0].data) | ||
|
||
def __getitem__(self, idx: int) -> Tuple[ARRAYS, ...]: | ||
return tuple(data_model.process(data_model.data[idx]) for data_model in self.data_models) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to call process
on the data for every __getitem__
as this could get very expensive very fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree with you. However, from what I have seen every Dataset
performs the transform for every __getitem__
so I was being consistent. What do you recommend going forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah, that's sad. I fully agree, that keeping up with rest of the world probably makes more sense, so consistency over anything else, but I just feel it's so inefficient.
Your call ⚡ I'll be fine either way ⚡
Returns: | ||
bool: True if size of data_models are equal in the first dimension. False, if not. | ||
""" | ||
return all(len(data_model.data) == len(self.data_models[0].data) for data_model in self.data_models) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can data processing change the shape of the data? If so, then this won't really work. If not, then it should be mentioned somewhere in the docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I will mention that in the docstring. Unless you want a check after the processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mentioning it in the docstring will suffice
Hi both, I am closing this PR and splitting it out into several different ones as the |
What does this PR do?
Part of #839
pl_bolts.datamodules.sklearn_datamodule.SklearnDataModule
pl_bolts.datamodules.sklearn_datamodule.SklearnDataset
pl_bolts.datamodules.sklearn_datamodule.TensorDataset
Summary
sklearn_datamodule.py
. (breaking change ❗ )SlearnDataset
withArrayDataset
. (breaking change ❗ )ArrayDataset
todatasets
module.Instead of a
SklearnDataset
, I purpose theArrayDataset
that can take any array-like input, such as lists,numpy
arrays ortorch
tensors. On initialisation, unless the array-like input objects torch tensors they are converted to torch tensors. Therefore, replacing the need for theTensorDataset
.Regarding the
SklearnDataModule
, as discussed on Slack with @otaj, we can assume if theSklearnDataModule
is being used thenscikit-learn
would be available. Therefore, we can take advantage of thetrain_test_split
function to split the input data into train, validation, and testArrayDataset
s for theDataLoader
s.As there are a few breaking changes and new implementation of previous features, I have yet to write unit tests and corrected the documentation. However, I intend, too! I wanted to create a draft PR and have a discussion on whether this PR is reasonable. Please provided feedback 😃
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Of course! 🥳