New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Feature/839 revision sklearn #846

Closed

Ce11an wants to merge 15 commits into Lightning-Universe:master from Ce11an:feature/839_revision_sklearn

Contributor

Ce11an commented Jul 31, 2022 •

edited by otaj

Loading

What does this PR do?

Part of #839

pl_bolts.datamodules.sklearn_datamodule.SklearnDataModule
pl_bolts.datamodules.sklearn_datamodule.SklearnDataset
pl_bolts.datamodules.sklearn_datamodule.TensorDataset

Summary

Refactored sklearn_datamodule.py. (breaking change ❗ )
Replaced SlearnDataset with ArrayDataset. (breaking change ❗ )
Moved ArrayDataset to datasets module.

Instead of a SklearnDataset, I purpose the ArrayDataset that can take any array-like input, such as lists, numpy arrays or torch tensors. On initialisation, unless the array-like input objects torch tensors they are converted to torch tensors. Therefore, replacing the need for the TensorDataset.

Regarding the SklearnDataModule, as discussed on Slack with @otaj, we can assume if the SklearnDataModule is being used then scikit-learn would be available. Therefore, we can take advantage of the train_test_split function to split the input data into train, validation, and test ArrayDatasets for the DataLoaders.

As there are a few breaking changes and new implementation of previous features, I have yet to write unit tests and corrected the documentation. However, I intend, too! I wanted to create a draft PR and have a discussion on whether this PR is reasonable. Please provided feedback 😃

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests? [not needed for typos/docs]
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Of course! 🥳

Ce11an added 4 commits

July 31, 2022 12:34


          Created ArrayDataset

4158fa8


          Refactored sklearn_datamodule.py

ffb323e


          removed setup method

e781abd


          corrected docstring

3dc3199

github-actions bot added the datamodule label

otaj mentioned this pull request

Major revision of Bolts #839

Open

luca-medeiros reviewed

View reviewed changes

Contributor

luca-medeiros left a comment •

edited

Loading

Loved the ArrayDataset idea, very abstract and powerful.

pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved

pl_bolts/datasets/array_dataset.py Show resolved Hide resolved

pl_bolts/datamodules/sklearn_datamodule.py Show resolved Hide resolved

otaj reviewed

View reviewed changes

Contributor

otaj left a comment

Hi @Ce11an! I left couple comments. I know it looks like a lot, but I have to say, I really like the ideas in this PR and general style! Keep it up! 💪 🎉

pl_bolts/datasets/array_dataset.py Show resolved Hide resolved

pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved

pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved

pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved

pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved

pl_bolts/datamodules/sklearn_datamodule.py

Comment on lines +56 to +68

+                      data: Any,
+                      target: Any,
+                      test_dataset: Optional[ArrayDataset] = None,
+                      val_size: Union[float, int] = 0.2,
+                      test_size: Optional[Union[float, int]] = None,
+                      random_state: Optional[int] = None,
+                      shuffle: bool = True,
+                      stratify: bool = False,
+                      num_workers: int = 0,
+                      batch_size: int = 1,
+                      pin_memory: bool = False,
+                      drop_last: bool = False,
+                      persistent_workers: bool = False,

Contributor

otaj Aug 2, 2022

There's a couple of issues with this.

it suddenly looks like there is not a single reason why it should be called SklearnDataModule since it doesn't really need any functionality from scikit-learn
it seems odd, that data and target are supposed to be specified as x and y variables, whereas test_dataset has to be specified either as a fraction, or as an instance of dataset (and ArrayDataset at that). I'd consider either supporting train data as a dataset or test data as multiple values.
it makes sense to not explicitly declare arguments with default values which are the same as base class arguments and their default values - and instead pass them in the call of super().__init__(*args, **kwargs)

pl_bolts/datamodules/sklearn_datamodule.py

               from pytorch_lightning import LightningDataModule
-              from torch import Tensor
+              from pytorch_lightning.utilities import exceptions
+              from sklearn import model_selection

Contributor

otaj Aug 2, 2022

This should definitely be guarded by _SKLEARN_AVAILABLE

pl_bolts/datamodules/sklearn_datamodule.py

-                          x_test_hold_out, y_test_holdout = x_holdout[test_i_start:], y_holdout[test_i_start:]
-                          X, y = X[hold_out_size:], y[hold_out_size:]
+                  def train_dataloader(self) -> DataLoader:
+                      return self._data_loader(self.train_dataset, shuffle=True)

Contributor

otaj Aug 2, 2022

While shuffling a train dataset is definitely a good practice, we can't force users to shuffling without the option to turn it off.

Contributor Author

Ce11an Aug 2, 2022

Completely understand. I will check other datamodules to see how they handle this situation.

pl_bolts/datamodules/sklearn_datamodule.py

-                      return x, y
-              @under_review()
               class SklearnDataModule(LightningDataModule):

Contributor

otaj Aug 2, 2022

Btw, one idea how this could again become sklearn datamodule (i.e. actually use some indispensable functionality from sklearn) might be to try to add a DataModule, that would automatically load sklearn dataset (https://scikit-learn.org/stable/datasets.html). This would have to be named differently (something like SklearnNamedDataModule) and it's absolutely a new feature, but it's just an idea 🚀

Contributor Author

Ce11an Aug 2, 2022

Definitely. I am happy to do this 🚀

pl_bolts/datamodules/sklearn_datamodule.py


		self._init_datasets(X, y, x_val, y_val, x_test, y_test)
		def _sklearn_train_test_split(self, x, y, split: Optional[Union[float, int]] = None):

Contributor

otaj Aug 2, 2022

It might make sense to have either an error from us saying sklearn is needed. On the other hand, this is such a small amount of code, that it might also make sense to provide our own implementation

Contributor Author

Ce11an Aug 2, 2022

I would be happy to move away from sklearn as only one function is used. Using random_split from torch is a potential option. I will have a think.

Contributor Author

Ce11an commented Aug 2, 2022

Hi @Ce11an! I left couple comments. I know it looks like a lot, but I have to say, I really like the ideas in this PR and general style! Keep it up! 💪 🎉

Thank you! Appreciate the feedback ⚡

Contributor Author

Ce11an commented Aug 2, 2022 •

edited

Loading

Thank you @luca-medeiros @otaj for reviewing the PR 🥳

My actions from your comments are the following:

Remove the dependence of x/data and y/target
Remove the conversion of dtypes
Explore using apply_to_collection
Explore options to remove the dependency of using sklearn and train_test_split
Explore creating a SklearnNamedDataModule
Add custom type hints
Explore creating a "bolts" Dataset base class.

Let me know if I have missed anything. Thanks again for your feedback 😄

Ce11an added 3 commits

August 2, 2022 23:25


          utilising apply_func.apply_to_collections

aa34258


          adjusted import

263f8cc


          created BoltsDataset

Contributor Author

Ce11an commented Aug 9, 2022

Thank you @luca-medeiros @otaj for reviewing the PR 🥳

My actions from your comments are the following:

Remove the dependence of x/data and y/target

Remove the conversion of dtypes

Explore using apply_to_collection

Explore options to remove the dependency of using sklearn and train_test_split

Explore creating a SklearnNamedDataModule

Add custom type hints

Explore creating a "bolts" Dataset base class.

Let me know if I have missed anything. Thanks again for your feedback 😄

Hey @otaj and @luca-medeiros

I have updated the ArrayDataset along with creating a BoltsDataset base class. Let me know your thoughts. Once we are happy with the dataset, I will move on to resolving the comments for the SklearnDataModule. Thanks!!


          re-ordered if-statements

9d9290d

Ce11an commented

View reviewed changes

pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved

luca-medeiros reviewed

View reviewed changes

Contributor

luca-medeiros left a comment

Good stuff! I think ArrayDataset became much more powerful this time and using a Bolts Dataset allows it to be very lean.

pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved

pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved

pl_bolts/datasets/base_dataset.py Show resolved Hide resolved

pl_bolts/datasets/base_dataset.py Outdated Show resolved Hide resolved

Contributor Author

Ce11an commented Aug 12, 2022

Hi @otaj and @luca-medeiros

I have done some thinking regarding a BoltsDataset and I believe I have a solution that fixes some of the issues:

from typing import Optional, Callable, Tuple, List, Union

import numpy as np
import torch
from pytorch_lightning.utilities import exceptions
from torch.utils.data import Dataset

ARRAYS = Union[torch.Tensor, np.ndarray, List[Union[float, int]]]


class DataModel:
    """Base class for DataModel."""

    def __init__(self, data: ARRAYS, transform: Optional[Callable] = None) -> None:
        self.data = data
        self.transform = transform

    def process(self):
        if self.transform is not None:
            self.data = self.transform(self.data)
        return self.data


class BoltsDataset(Dataset):
    """Base class for Bolts datasets.

    Args:
        data_models: data models to use to create a Dataset.
    """

    def __init__(self, *data_models: DataModel) -> None:
        self.data_models = data_models

        if not self._equal_size():
            raise exceptions.MisconfigurationException("Shape mismatch between arrays in the first dimension")

    def __getitem__(self, idx: int):
        raise NotImplementedError

    def __len__(self) -> int:
        raise NotImplementedError

    def _equal_size(self) -> bool:
        """Check the size of the tensors are equal in the first dimension."""
        return all(len(data_model.data) == len(self.data_models[0].data) for data_model in self.data_models)


class ArrayDataset(BoltsDataset):
    def __init__(self, *data_models: DataModel) -> None:
        super().__init__(*data_models)

    def __len__(self) -> int:
        return len(self.data_models[0].data)

    def __getitem__(self, idx: int) -> Tuple[ARRAYS, ...]:
        return tuple(data_model.process() for data_model in self.data_models)


def add_one(integers: np.ndarray) -> np.ndarray:
    output = []
    for data in integers:
        output.append(data + 1)
    return np.array(output)


if __name__ == '__main__':
    dm1 = DataModel(np.array([[0, 0, 1]]))
    dm2 = DataModel(np.array([[1, 0, 3]]), add_one)
    ds = ArrayDataset(dm1, dm2)
    print(ds[0])

In this example, I have created the DataModel class that ties together an array-like object and a transformer. This means we do not have to restrict ourselves to just x and y. We can have any number of data and targets and perform transformations. This example definitely needs some refinement, but I wanted to get your thoughts. Let me know what uou think! 👍🏻


          Merge branch 'master' into feature/839_revision_sklearn

8020a39

Contributor

otaj commented Aug 15, 2022

Hi @Ce11an, I really like the last iteration of DataModel and BoltsDataset. I have couple comments, but in general it's a very cool thing to do 👍

Type of ARRAYS is a bit incomplete, as we can have nested lists. Not a big issue, however it's probably needed to resolve properly
DataModel could be a dataclass. It's absolutely not needed, but I like them and it perfectly fits 😂
However, DataModel needs to be a little bit smarter - validation against types, setter at least for transform, which performs the same validation, but the biggest of them, you shouldn't overwrite self.data in process, so that you still have reference to the original data and you should run self.transform only once for every combination of self.data,self.transform
I'm not convinced by the distinction between BoltsDataset and ArrayDataset. BoltsDataset is already ArrayDataset, because you are forcing arrays through DataModel

Ce11an added 2 commits

August 15, 2022 19:06


          Using DataModel dataclass

825c106


          Merge remote-tracking branch 'origin/feature/839_revision_sklearn' in…

98c9716

…to feature/839_revision_sklearn

Ce11an requested a review from otaj

August 15, 2022 18:26


          added tests for array_dataset

d7f4c28

Ce11an commented

View reviewed changes

tests/models/test_classic_ml.py

@@ @@ -19,7 +19,7 @@ def test_linear_regression_model(tmpdir): @@
                   X = np.array([[1.0, 1], [1, 2], [2, 2], [2, 3], [3, 3], [3, 4], [4, 4], [4, 5]])
                   y = np.dot(X, np.array([1.0, 2])) + 3
                   y = y[:, np.newaxis]
-                  loader = DataLoader(SklearnDataset(X, y), batch_size=2)
+                  loader = DataLoader(ArrayDataset(X, y), batch_size=2)

Contributor Author

Ce11an Aug 15, 2022

This will need fixing 🧰

otaj reviewed

View reviewed changes

Contributor

otaj left a comment

I have to say, I didn't take a look at the tests or at the SklearnDataModule again as they seem to not have changed since my last viewing. It's getting there, but I'm still nitpicking a bit 😄

pl_bolts/datasets/base_dataset.py

+                  data: ARRAYS
+                  transform: Optional[Callable] = None
+                  def process(self, data: ARRAYS) -> ARRAYS:

Contributor

otaj Aug 18, 2022

Why does it need an argument data? And if that's there and it's to serve only as transform, then having data as an instance attribute doesn't make much sense. Or am I missing something?

pl_bolts/datasets/base_dataset.py Outdated Show resolved Hide resolved

pl_bolts/datasets/array_dataset.py

+                      return len(self.data_models[0].data)
+                  def __getitem__(self, idx: int) -> Tuple[ARRAYS, ...]:
+                      return tuple(data_model.process(data_model.data[idx]) for data_model in self.data_models)

Contributor

otaj Aug 18, 2022

I don't think we want to call process on the data for every __getitem__ as this could get very expensive very fast

Contributor Author

Ce11an Aug 20, 2022 •

edited

Loading

I do agree with you. However, from what I have seen every Dataset performs the transform for every __getitem__ so I was being consistent. What do you recommend going forward?

Contributor

otaj Aug 23, 2022

Gah, that's sad. I fully agree, that keeping up with rest of the world probably makes more sense, so consistency over anything else, but I just feel it's so inefficient.

Your call ⚡ I'll be fine either way ⚡

pl_bolts/datasets/array_dataset.py

+                      Returns:
+                          bool: True if size of data_models are equal in the first dimension. False, if not.
+                      """
+                      return all(len(data_model.data) == len(self.data_models[0].data) for data_model in self.data_models)

Contributor

otaj Aug 18, 2022

Can data processing change the shape of the data? If so, then this won't really work. If not, then it should be mentioned somewhere in the docstrings

Contributor Author

Ce11an Aug 20, 2022

Good point, I will mention that in the docstring. Unless you want a check after the processing?

Contributor

otaj Aug 23, 2022

I think mentioning it in the docstring will suffice

pl_bolts/datasets/base_dataset.py Outdated Show resolved Hide resolved

Ce11an and others added 3 commits

August 20, 2022 18:48


          moved ARRAYS type-hint to types.py

f3df371


          Merge branch 'master' into feature/839_revision_sklearn

c3fc201


          Merge remote-tracking branch 'origin/feature/839_revision_sklearn' in…

b508f97

…to feature/839_revision_sklearn

Ce11an mentioned this pull request

Feature - ArrayDataset #872

Merged

8 tasks

Contributor Author

Ce11an commented Aug 24, 2022

Hi both,

I am closing this PR and splitting it out into several different ones as the SklearnDataModule needs more refinement. I have made a PR for the ArrayDataset #872. Thanks again for your input!

Ce11an closed this

Ce11an deleted the feature/839_revision_sklearn branch

August 24, 2022 18:12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

luca-medeiros luca-medeiros left review comments

otaj otaj left review comments

Borda Awaiting requested review from Borda Borda will be requested when the pull request is marked ready for review Borda is a code owner

ethanwharris Awaiting requested review from ethanwharris ethanwharris will be requested when the pull request is marked ready for review ethanwharris is a code owner

Labels