Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/839 revision sklearn #846

Closed

Conversation

Ce11an
Copy link
Contributor

@Ce11an Ce11an commented Jul 31, 2022

What does this PR do?

Part of #839

  • pl_bolts.datamodules.sklearn_datamodule.SklearnDataModule
  • pl_bolts.datamodules.sklearn_datamodule.SklearnDataset
  • pl_bolts.datamodules.sklearn_datamodule.TensorDataset

Summary

  • Refactored sklearn_datamodule.py. (breaking change ❗ )
  • Replaced SlearnDataset with ArrayDataset. (breaking change ❗ )
  • Moved ArrayDataset to datasets module.

Instead of a SklearnDataset, I purpose the ArrayDataset that can take any array-like input, such as lists, numpy arrays or torch tensors. On initialisation, unless the array-like input objects torch tensors they are converted to torch tensors. Therefore, replacing the need for the TensorDataset.

Regarding the SklearnDataModule, as discussed on Slack with @otaj, we can assume if the SklearnDataModule is being used then scikit-learn would be available. Therefore, we can take advantage of the train_test_split function to split the input data into train, validation, and test ArrayDatasets for the DataLoaders.

As there are a few breaking changes and new implementation of previous features, I have yet to write unit tests and corrected the documentation. However, I intend, too! I wanted to create a draft PR and have a discussion on whether this PR is reasonable. Please provided feedback 😃

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests? [not needed for typos/docs]
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Of course! 🥳

@github-actions github-actions bot added the datamodule Anything related to datamodules label Jul 31, 2022
@otaj otaj mentioned this pull request Aug 2, 2022
Copy link
Contributor

@luca-medeiros luca-medeiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loved the ArrayDataset idea, very abstract and powerful.

pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved
pl_bolts/datasets/array_dataset.py Show resolved Hide resolved
pl_bolts/datamodules/sklearn_datamodule.py Show resolved Hide resolved
Copy link
Contributor

@otaj otaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ce11an! I left couple comments. I know it looks like a lot, but I have to say, I really like the ideas in this PR and general style! Keep it up! 💪 🎉

pl_bolts/datasets/array_dataset.py Show resolved Hide resolved
pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved
pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved
pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved
pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved
Comment on lines +56 to +68
data: Any,
target: Any,
test_dataset: Optional[ArrayDataset] = None,
val_size: Union[float, int] = 0.2,
test_size: Optional[Union[float, int]] = None,
random_state: Optional[int] = None,
shuffle: bool = True,
stratify: bool = False,
num_workers: int = 0,
batch_size: int = 1,
pin_memory: bool = False,
drop_last: bool = False,
persistent_workers: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple of issues with this.

  1. it suddenly looks like there is not a single reason why it should be called SklearnDataModule since it doesn't really need any functionality from scikit-learn
  2. it seems odd, that data and target are supposed to be specified as x and y variables, whereas test_dataset has to be specified either as a fraction, or as an instance of dataset (and ArrayDataset at that). I'd consider either supporting train data as a dataset or test data as multiple values.
  3. it makes sense to not explicitly declare arguments with default values which are the same as base class arguments and their default values - and instead pass them in the call of super().__init__(*args, **kwargs)

from pytorch_lightning import LightningDataModule
from torch import Tensor
from pytorch_lightning.utilities import exceptions
from sklearn import model_selection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely be guarded by _SKLEARN_AVAILABLE

x_test_hold_out, y_test_holdout = x_holdout[test_i_start:], y_holdout[test_i_start:]
X, y = X[hold_out_size:], y[hold_out_size:]
def train_dataloader(self) -> DataLoader:
return self._data_loader(self.train_dataset, shuffle=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While shuffling a train dataset is definitely a good practice, we can't force users to shuffling without the option to turn it off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely understand. I will check other datamodules to see how they handle this situation.

return x, y


@under_review()
class SklearnDataModule(LightningDataModule):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, one idea how this could again become sklearn datamodule (i.e. actually use some indispensable functionality from sklearn) might be to try to add a DataModule, that would automatically load sklearn dataset (https://scikit-learn.org/stable/datasets.html). This would have to be named differently (something like SklearnNamedDataModule) and it's absolutely a new feature, but it's just an idea 🚀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. I am happy to do this 🚀


self._init_datasets(X, y, x_val, y_val, x_test, y_test)
def _sklearn_train_test_split(self, x, y, split: Optional[Union[float, int]] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to have either an error from us saying sklearn is needed. On the other hand, this is such a small amount of code, that it might also make sense to provide our own implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy to move away from sklearn as only one function is used. Using random_split from torch is a potential option. I will have a think.

@Ce11an
Copy link
Contributor Author

Ce11an commented Aug 2, 2022

Hi @Ce11an! I left couple comments. I know it looks like a lot, but I have to say, I really like the ideas in this PR and general style! Keep it up! 💪 🎉

Thank you! Appreciate the feedback ⚡

@Ce11an
Copy link
Contributor Author

Ce11an commented Aug 2, 2022

Thank you @luca-medeiros @otaj for reviewing the PR 🥳

My actions from your comments are the following:

  • Remove the dependence of x/data and y/target
  • Remove the conversion of dtypes
  • Explore using apply_to_collection
  • Explore options to remove the dependency of using sklearn and train_test_split
  • Explore creating a SklearnNamedDataModule
  • Add custom type hints
  • Explore creating a "bolts" Dataset base class.

Let me know if I have missed anything. Thanks again for your feedback 😄

@Ce11an
Copy link
Contributor Author

Ce11an commented Aug 9, 2022

Thank you @luca-medeiros @otaj for reviewing the PR 🥳

My actions from your comments are the following:

  • Remove the dependence of x/data and y/target
  • Remove the conversion of dtypes
  • Explore using apply_to_collection
  • Explore options to remove the dependency of using sklearn and train_test_split
  • Explore creating a SklearnNamedDataModule
  • Add custom type hints
  • Explore creating a "bolts" Dataset base class.

Let me know if I have missed anything. Thanks again for your feedback 😄

Hey @otaj and @luca-medeiros

I have updated the ArrayDataset along with creating a BoltsDataset base class. Let me know your thoughts. Once we are happy with the dataset, I will move on to resolving the comments for the SklearnDataModule. Thanks!!

Copy link
Contributor

@luca-medeiros luca-medeiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! I think ArrayDataset became much more powerful this time and using a Bolts Dataset allows it to be very lean.

pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved
pl_bolts/datasets/array_dataset.py Outdated Show resolved Hide resolved
pl_bolts/datasets/base_dataset.py Show resolved Hide resolved
pl_bolts/datasets/base_dataset.py Outdated Show resolved Hide resolved
@Ce11an
Copy link
Contributor Author

Ce11an commented Aug 12, 2022

Hi @otaj and @luca-medeiros

I have done some thinking regarding a BoltsDataset and I believe I have a solution that fixes some of the issues:

from typing import Optional, Callable, Tuple, List, Union

import numpy as np
import torch
from pytorch_lightning.utilities import exceptions
from torch.utils.data import Dataset

ARRAYS = Union[torch.Tensor, np.ndarray, List[Union[float, int]]]


class DataModel:
    """Base class for DataModel."""

    def __init__(self, data: ARRAYS, transform: Optional[Callable] = None) -> None:
        self.data = data
        self.transform = transform

    def process(self):
        if self.transform is not None:
            self.data = self.transform(self.data)
        return self.data


class BoltsDataset(Dataset):
    """Base class for Bolts datasets.

    Args:
        data_models: data models to use to create a Dataset.
    """

    def __init__(self, *data_models: DataModel) -> None:
        self.data_models = data_models

        if not self._equal_size():
            raise exceptions.MisconfigurationException("Shape mismatch between arrays in the first dimension")

    def __getitem__(self, idx: int):
        raise NotImplementedError

    def __len__(self) -> int:
        raise NotImplementedError

    def _equal_size(self) -> bool:
        """Check the size of the tensors are equal in the first dimension."""
        return all(len(data_model.data) == len(self.data_models[0].data) for data_model in self.data_models)


class ArrayDataset(BoltsDataset):
    def __init__(self, *data_models: DataModel) -> None:
        super().__init__(*data_models)

    def __len__(self) -> int:
        return len(self.data_models[0].data)

    def __getitem__(self, idx: int) -> Tuple[ARRAYS, ...]:
        return tuple(data_model.process() for data_model in self.data_models)


def add_one(integers: np.ndarray) -> np.ndarray:
    output = []
    for data in integers:
        output.append(data + 1)
    return np.array(output)


if __name__ == '__main__':
    dm1 = DataModel(np.array([[0, 0, 1]]))
    dm2 = DataModel(np.array([[1, 0, 3]]), add_one)
    ds = ArrayDataset(dm1, dm2)
    print(ds[0])

In this example, I have created the DataModel class that ties together an array-like object and a transformer. This means we do not have to restrict ourselves to just x and y. We can have any number of data and targets and perform transformations. This example definitely needs some refinement, but I wanted to get your thoughts. Let me know what uou think! 👍🏻

@otaj
Copy link
Contributor

otaj commented Aug 15, 2022

Hi @Ce11an, I really like the last iteration of DataModel and BoltsDataset. I have couple comments, but in general it's a very cool thing to do 👍

  1. Type of ARRAYS is a bit incomplete, as we can have nested lists. Not a big issue, however it's probably needed to resolve properly
  2. DataModel could be a dataclass. It's absolutely not needed, but I like them and it perfectly fits 😂
  3. However, DataModel needs to be a little bit smarter - validation against types, setter at least for transform, which performs the same validation, but the biggest of them, you shouldn't overwrite self.data in process, so that you still have reference to the original data and you should run self.transform only once for every combination of self.data,self.transform
  4. I'm not convinced by the distinction between BoltsDataset and ArrayDataset. BoltsDataset is already ArrayDataset, because you are forcing arrays through DataModel

@Ce11an Ce11an requested a review from otaj August 15, 2022 18:26
@@ -19,7 +19,7 @@ def test_linear_regression_model(tmpdir):
X = np.array([[1.0, 1], [1, 2], [2, 2], [2, 3], [3, 3], [3, 4], [4, 4], [4, 5]])
y = np.dot(X, np.array([1.0, 2])) + 3
y = y[:, np.newaxis]
loader = DataLoader(SklearnDataset(X, y), batch_size=2)
loader = DataLoader(ArrayDataset(X, y), batch_size=2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need fixing 🧰

Copy link
Contributor

@otaj otaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say, I didn't take a look at the tests or at the SklearnDataModule again as they seem to not have changed since my last viewing. It's getting there, but I'm still nitpicking a bit 😄

data: ARRAYS
transform: Optional[Callable] = None

def process(self, data: ARRAYS) -> ARRAYS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need an argument data? And if that's there and it's to serve only as transform, then having data as an instance attribute doesn't make much sense. Or am I missing something?

pl_bolts/datasets/base_dataset.py Outdated Show resolved Hide resolved
return len(self.data_models[0].data)

def __getitem__(self, idx: int) -> Tuple[ARRAYS, ...]:
return tuple(data_model.process(data_model.data[idx]) for data_model in self.data_models)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to call process on the data for every __getitem__ as this could get very expensive very fast

Copy link
Contributor Author

@Ce11an Ce11an Aug 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with you. However, from what I have seen every Dataset performs the transform for every __getitem__ so I was being consistent. What do you recommend going forward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, that's sad. I fully agree, that keeping up with rest of the world probably makes more sense, so consistency over anything else, but I just feel it's so inefficient.

Your call ⚡ I'll be fine either way ⚡

Returns:
bool: True if size of data_models are equal in the first dimension. False, if not.
"""
return all(len(data_model.data) == len(self.data_models[0].data) for data_model in self.data_models)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can data processing change the shape of the data? If so, then this won't really work. If not, then it should be mentioned somewhere in the docstrings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will mention that in the docstring. Unless you want a check after the processing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mentioning it in the docstring will suffice

pl_bolts/datasets/base_dataset.py Outdated Show resolved Hide resolved
@Ce11an Ce11an mentioned this pull request Aug 24, 2022
8 tasks
@Ce11an
Copy link
Contributor Author

Ce11an commented Aug 24, 2022

Hi both,

I am closing this PR and splitting it out into several different ones as the SklearnDataModule needs more refinement. I have made a PR for the ArrayDataset #872. Thanks again for your input!

@Ce11an Ce11an closed this Aug 24, 2022
@Ce11an Ce11an deleted the feature/839_revision_sklearn branch August 24, 2022 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodule Anything related to datamodules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants