diff --git a/.github/actions/install-python-and-package/action.yml b/.github/actions/install-python-and-package/action.yml index 510da80a7..fd9f1a206 100644 --- a/.github/actions/install-python-and-package/action.yml +++ b/.github/actions/install-python-and-package/action.yml @@ -50,8 +50,8 @@ runs: conda install -c bioconda msms ## PyTorch, PyG, PyG adds ### Installing for CPU only on the CI - conda install pytorch torchvision torchaudio cpuonly -c pytorch - conda install pyg -c pyg + conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 -c pytorch + pip install torch_geometric==2.3.1 pip install torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-$(python3 -c "import torch; print(torch.__version__)")+cpu.html - name: Install dependencies on MacOS shell: bash {0} diff --git a/README.md b/README.md index c41c9c636..2407ba66a 100644 --- a/README.md +++ b/README.md @@ -38,7 +38,7 @@ DeepRank2 extensive documentation can be found [here](https://deeprank2.rtfd.io/ - [Table of contents](#table-of-contents) - [Installation](#installation) - [Dependencies](#dependencies) - - [Deeprank2 Package](#deeprank2-package) + - [Deeprank2 Package](#deeprank2-package) - [Test installation](#test-installation) - [Contributing](#contributing) - [Data generation](#data-generation) @@ -46,6 +46,7 @@ DeepRank2 extensive documentation can be found [here](https://deeprank2.rtfd.io/ - [GraphDataset](#graphdataset) - [GridDataset](#griddataset) - [Training](#training) + - [Run a pre-trained model on new data](#run-a-pre-trained-model-on-new-data) - [Computational performances](#computational-performances) - [Package development](#package-development) @@ -61,7 +62,8 @@ Before installing deeprank2 you need to install some dependencies. We advise to * [Here](https://ssbio.readthedocs.io/en/latest/instructions/msms.html) for MacOS with M1 chip users. * [PyTorch](https://pytorch.org/get-started/locally/) * We support torch's CPU library as well as CUDA. -* [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html) and its optional dependencies: `torch_scatter`, `torch_sparse`, `torch_cluster`, `torch_spline_conv`. + * Currently, the package is tested using [PyTorch 2.0.1](https://pytorch.org/get-started/previous-versions/#v201). +* [PyG](https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html) and its optional dependencies: `torch_scatter`, `torch_sparse`, `torch_cluster`, `torch_spline_conv`. * [DSSP 4](https://swift.cmbi.umcn.nl/gv/dssp/) * Check if `dssp` is installed: `dssp --version`. If this gives an error or shows a version lower than 4: * on ubuntu 22.04 or newer: `sudo apt-get install dssp`. If the package cannot be located, first run `sudo apt-get update`. @@ -70,7 +72,7 @@ Before installing deeprank2 you need to install some dependencies. We advise to * Check if gcc is installed: `gcc --version`. If this gives an error, run `sudo apt-get install gcc`. * For MacOS with M1 chip users only install [the conda version of PyTables](https://www.pytables.org/usersguide/installation.html). -### Deeprank2 Package +## Deeprank2 Package Once the dependencies are installed, you can install the latest stable release of deeprank2 using the PyPi package manager: @@ -214,14 +216,12 @@ dataset_train = GraphDataset( dataset_val = GraphDataset( hdf5_path = hdf5_paths, subset = valid_ids, - train = False, - dataset_train = dataset_train + train_source = dataset_train ) dataset_test = GraphDataset( hdf5_path = hdf5_paths, subset = test_ids, - train = False, - dataset_train = dataset_train + train_source = dataset_train ) ``` @@ -248,14 +248,12 @@ dataset_train = GridDataset( dataset_val = GridDataset( hdf5_path = hdf5_paths, subset = valid_ids, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, ) dataset_test = GridDataset( hdf5_path = hdf5_paths, subset = test_ids, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, ) ``` @@ -313,6 +311,40 @@ trainer.test() ``` +### Run a pre-trained model on new data + +If you want to analyze new PDB files using a pre-trained model, the first step is to process and save them into HDF5 files [as we have done above](#data-generation). + +Then, the `DeeprankDataset` instance for the newly processed data can be created. Do this by specifying the path for the pre-trained model in `train_source`, together with the path to the HDF5 files just created. Note that there is no need of setting the dataset's parameters, since they are inherited from the information saved in the pre-trained model. Let's suppose that the model has been trained with `GraphDataset` objects: + +```python +from deeprank2.dataset import GraphDataset + +dataset_test = GraphDataset( + hdf5_path = "/", + train_source = "" +) +``` + +Finally, the `Trainer` instance can be defined and the new data can be tested: + +```python +from deeprank2.trainer import Trainer +from deeprank2.neuralnets.gnn.naive_gnn import NaiveNetwork +from deeprank2.utils.exporters import HDF5OutputExporter + +trainer = Trainer( + NaiveNetwork, + dataset_test = dataset_test, + pretrained_model = "", + output_exporters = [HDF5OutputExporter("")] +) + +trainer.test() +``` + +For more details about how to run a pre-trained model on new data, see the [docs](https://deeprank2.readthedocs.io/en/latest/getstarted.html#run-a-pre-trained-model-on-new-data). + ## Computational performances We measured the efficiency of data generation in DeepRank2 using the tutorials' [PDB files](https://zenodo.org/record/8187806) (~100 data points per data set), averaging the results run on Apple M1 Pro, using a single CPU. diff --git a/deeprank2/dataset.py b/deeprank2/dataset.py index 53e90890a..c78ef9d67 100644 --- a/deeprank2/dataset.py +++ b/deeprank2/dataset.py @@ -3,11 +3,11 @@ import inspect import logging import os +import pickle import re import sys import warnings -from ast import literal_eval -from typing import Literal +from typing import Literal, Union import h5py import matplotlib.pyplot as plt @@ -27,24 +27,25 @@ class DeeprankDataset(Dataset): - def __init__( # pylint: disable=too-many-arguments - self, - hdf5_path: str | list[str], - subset: list[str] | None, - target: str | None, - task: str | None, - classes: list[str] | list[int] | list[float] | None, - use_tqdm: bool, - root_directory_path: str, - target_filter: dict[str, str] | None, - check_integrity: bool + def __init__(self, # pylint: disable=too-many-arguments + hdf5_path: str | list[str], + subset: list[str] | None, + train_source: str | GridDataset | GraphDataset | None, + target: str | None, + target_transform: bool | None, + target_filter: dict[str, str] | None, + task: str | None, + classes: list[str] | list[int] | list[float] | None, + use_tqdm: bool, + root: str, + check_integrity: bool ): """Parent class of :class:`GridDataset` and :class:`GraphDataset` which inherits from :class:`torch_geometric.data.dataset.Dataset`. More detailed information about the parameters can be found in :class:`GridDataset` and :class:`GraphDataset`. """ - super().__init__(root_directory_path) + super().__init__(root) if isinstance(hdf5_path, str): self.hdf5_paths = [hdf5_path] @@ -55,17 +56,17 @@ def __init__( # pylint: disable=too-many-arguments else: raise TypeError(f"hdf5_path: unexpected type: {type(hdf5_path)}") - self.use_tqdm = use_tqdm - - self.target = target self.subset = subset - + self.train_source = train_source + self.target = target + self.target_transform = target_transform self.target_filter = target_filter if check_integrity: self._check_hdf5_files() self._check_task_and_classes(task, classes) + self.use_tqdm = use_tqdm # create the indexing system # alows to associate each mol to an index @@ -75,10 +76,49 @@ def __init__( # pylint: disable=too-many-arguments self.df = None self.means = None self.devs = None + self.train_means = None + self.train_devs = None # get the device self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + def _check_and_inherit_train(self, data_type: Union[GridDataset, GraphDataset], inherited_params): + """Check if the pre-trained model or training set provided are valid for validation and/or testing, and inherit the parameters. + """ + if isinstance(self.train_source, str): # pylint: disable=too-many-nested-blocks + try: + if torch.cuda.is_available(): + data = torch.load(self.train_source) + else: + data = torch.load(self.train_source, map_location=torch.device('cpu')) + if data["data_type"] is not data_type: + raise TypeError (f"""The pre-trained model has been trained with data of type {data["data_type"]}, but you are trying + to define a {data_type}-class validation/testing dataset. Please provide a valid DeepRank2 + model trained with {data_type}-class type data, or define the dataset using the appropriate class.""") + if data_type is GraphDataset: + self.train_means = data["means"] + self.train_devs = data["devs"] + # convert strings in 'transform' key to lambda functions + if data["features_transform"]: + for _, key in data["features_transform"].items(): + if key['transform'] is None: + continue + key['transform'] = eval(key['transform']) # pylint: disable=eval-used + except pickle.UnpicklingError as e: + raise ValueError("""The path provided to `train_source` is not a valid DeepRank2 pre-trained model. + Please provide a valid path to a DeepRank2 pre-trained model.""") from e + elif isinstance(self.train_source, data_type): + data = self.train_source + if data_type is GraphDataset: + self.train_means = self.train_source.means + self.train_devs = self.train_source.devs + else: + raise TypeError(f"""The train data provided is type: {type(self.train_source)} + Please provide a valid training {data_type} or the path to a valid DeepRank2 pre-trained model.""") + + #match parameters with the ones in the training set + self._check_inherited_params(inherited_params, data) + def _check_hdf5_files(self): """Checks if the data contained in the .HDF5 file is valid.""" _log.info("\nChecking dataset Integrity...") @@ -120,7 +160,7 @@ def _check_task_and_classes(self, task: str, classes: str | None = None): if self.task == targets.CLASSIF: if classes is None: self.classes = [0, 1] - _log.info(f'Target classes set up to: {self.classes}') + _log.info(f'Target classes set to: {self.classes}') else: self.classes = classes @@ -134,25 +174,27 @@ def _check_task_and_classes(self, task: str, classes: str | None = None): def _check_inherited_params( self, inherited_params: list[str], - dataset_train: GraphDataset | GridDataset, + data: dict | GraphDataset | GridDataset, ): - """"Check if the parameters for validation and/or testing are the same as in the training set. + """"Check if the parameters for validation and/or testing are the same as in the pre-trained model or training set provided. Args: - inherited_params (list[str]): list of parameters that need to be checked for inheritance. - dataset_train (class:`GraphDataset` | class:`GridDataset`): The parameters in `inherited_param` will be inherited from `dataset_train`. + inherited_params (List[str]): List of parameters that need to be checked for inheritance. + data (Union[dict, class:`GraphDataset`, class:`GridDataset`]): The parameters in `inherited_param` will be inherited + from the information contained in `data`. """ self_vars = vars(self) - dataset_train_vars = vars(dataset_train) + if not isinstance(data, dict): + data = vars(data) for param in inherited_params: - if (self_vars[param] != dataset_train_vars[param]): + if (self_vars[param] != data[param]): if (self_vars[param] != self.default_vars[param]): _log.warning(f"The {param} parameter set here is: {self_vars[param]}, " + - f"which is not equivalent to the one in the training phase: {dataset_train_vars[param]}./n" + + f"which is not equivalent to the one in the training phase: {data[param]}./n" + f"Overwriting {param} parameter with the one used in the training phase.") - setattr(self, param, dataset_train_vars[param]) + setattr(self, param, data[param]) def _create_index_entries(self): """Creates the indexing of each molecule in the dataset. @@ -192,14 +234,14 @@ def _create_index_entries(self): except Exception: _log.exception(f"on {hdf5_path}") - def _filter_targets(self, entry_group: h5py.Group) -> bool: + def _filter_targets(self, grp: h5py.Group) -> bool: """Filters the entry according to a dictionary. The filter is based on the attribute self.target_filter that must be either of the form: { target_name : target_condition } or None. Args: - entry_group (:class:`h5py.Group`): The entry group in the .HDF5 file. + grp (:class:`h5py.Group`): The entry group in the .HDF5 file. Returns: bool: True if we keep the entry False otherwise. @@ -213,7 +255,7 @@ def _filter_targets(self, entry_group: h5py.Group) -> bool: for target_name, target_condition in self.target_filter.items(): - present_target_names = list(entry_group[targets.VALUES].keys()) + present_target_names = list(grp[targets.VALUES].keys()) if target_name in present_target_names: @@ -221,17 +263,18 @@ def _filter_targets(self, entry_group: h5py.Group) -> bool: if isinstance(target_condition, str): operation = target_condition + target_value = grp[targets.VALUES][target_name][()] for operator_string in [">", "<", "==", "<=", ">=", "!="]: - operation = operation.replace(operator_string, "target_value" + operator_string) + operation = operation.replace(operator_string, f"{target_value}" + operator_string) - if not literal_eval(operation): + if not eval(operation): # pylint: disable=eval-used return False elif target_condition is not None: raise ValueError("Conditions not supported", target_condition) else: - _log.warning(f" :Filter {target_name} not found for entry {entry_group}\n" + _log.warning(f" :Filter {target_name} not found for entry {grp}\n" f" :Filter options are: {present_target_names}") return True @@ -408,19 +451,18 @@ def _compute_mean_std(self): class GridDataset(DeeprankDataset): - def __init__( # pylint: disable=too-many-arguments + def __init__( # pylint: disable=too-many-arguments, too-many-locals self, hdf5_path: str | list, subset: list[str] | None = None, - train: bool = True, - dataset_train: GridDataset | None = None, + train_source: str | GridDataset | None = None, features: list[str] | str | Literal["all"] | None = "all", target: str | None = None, target_transform: bool = False, target_filter: dict[str, str] | None = None, task: Literal["regress", "classif"] | None = None, classes: list[str] | list[int] | list[float] | None = None, - tqdm: bool = True, + use_tqdm: bool = True, root: str = "./", check_integrity: bool = True ): @@ -429,17 +471,14 @@ def __init__( # pylint: disable=too-many-arguments Args: hdf5_path (str | list): Path to .HDF5 file(s). For multiple .HDF5 files, insert the paths in a list. Defaults to None. subset (list[str] | None, optional): list of keys from .HDF5 file to include. Defaults to None (meaning include all). - train (bool, optional): Boolean flag to determine if the instance represents the training set. - If False, a dataset_train of the same class must be provided as well. - The latter will be used to scale the validation/testing set according to its features values and to match the datasets' parameters. - Defaults to True. - dataset_train (class:`GridDataset` | None, optional): If `train` is True, assign here the training set. - If `train` is False and `dataset_train` is assigned, - the parameters `features`, `target`, `traget_transform`, `task`, and `classes` will be inherited from `dataset_train`. + train_source (str | class:`GridDataset` | None, optional): data to inherit information from the training dataset or the pre-trained model. + If None, the current dataset is considered as the training set. Otherwise, `train_source` needs to be a dataset of the same class or + the path of a DeepRank2 pre-trained model. If set, the parameters `features`, `target`, `traget_transform`, `task`, and `classes` + will be inherited from `train_source`. Defaults to None. features (list[str] | str | Literal["all"] | None, optional): Consider all pre-computed features ("all") or some defined node features (provide a list, example: ["res_type", "polarity", "bsa"]). The complete list can be found in `deeprank2.domain.gridstorage`. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to "all". target (str | None, optional): Default options are irmsd, lrmsd, fnat, binary, capri_class, and dockq. It can also be a custom-defined target given to the Query class as input (see: `deeprank2.query`); in this case, @@ -447,11 +486,11 @@ def __init__( # pylint: disable=too-many-arguments Only numerical target variables are supported, not categorical. If the latter is your case, please convert the categorical classes into numerical class indices before defining the :class:`GraphDataset` instance. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to None. target_transform (bool, optional): Apply a log and then a sigmoid transformation to the target (for regression only). This puts the target value between 0 and 1, and can result in a more uniform target distribution and speed up the optimization. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to False. target_filter (dict[str, str] | None, optional): Dictionary of type [target: cond] to filter the molecules. Note that the you can filter on a different target than the one selected as the dataset target. @@ -460,43 +499,49 @@ def __init__( # pylint: disable=too-many-arguments ['irmsd', 'lrmsd', 'fnat', 'binary', 'capri_class', or 'dockq'], otherwise this setting is ignored. Automatically set to 'classif' if the target is 'binary' or 'capri_classes'. Automatically set to 'regress' if the target is 'irmsd', 'lrmsd', 'fnat', or 'dockq'. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to None. classes (list[str] | list[int] | list[float] | None): Define the dataset target classes in classification mode. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to None. - tqdm (bool, optional): Show progress bar. + use_tqdm (bool, optional): Show progress bar. Defaults to True. root (str, optional): Root directory where the dataset should be saved. Defaults to "./". check_integrity (bool, optional): Whether to check the integrity of the hdf5 files. Defaults to True. """ - super().__init__(hdf5_path, subset, target, task, classes, tqdm, root, target_filter, check_integrity) - + super().__init__(hdf5_path, subset, train_source, target, target_transform, + target_filter, task, classes, use_tqdm, root, check_integrity) self.default_vars = { k: v.default for k, v in inspect.signature(self.__init__).parameters.items() if v.default is not inspect.Parameter.empty } - self.train = train - self.dataset_train = dataset_train + self.default_vars["classes_to_index"] = None self.features = features self.target_transform = target_transform - self._check_features() - if not train: - if not isinstance(dataset_train, GridDataset): - raise TypeError(f"""The train dataset provided is type: {type(dataset_train)} - Please provide a valid training GridDataset.""") + if train_source is not None: + self.inherited_params = ["features", "target", "target_transform", "task", "classes", "classes_to_index"] + self._check_and_inherit_train(GridDataset, self.inherited_params) + self._check_features() - #check inherited parameter with the ones in the training set - inherited_params = ["features", "target", "target_transform", "task", "classes"] - self._check_inherited_params(inherited_params, dataset_train) + else: + self._check_features() + self.inherited_params = None - elif train and dataset_train: - _log.warning("""dataset_train has been set but train flag was set to True. - dataset_train will be ignored since the current dataset will be considered as training set.""") + try: + fname, mol = self.index_entries[0] + except IndexError as exc: + raise IndexError("No entries found in the dataset. Please check the dataset parameters.") from exc + with h5py.File(fname, 'r') as f5: + grp = f5[mol] + possible_targets = grp[targets.VALUES].keys() + if self.target is None: + raise ValueError(f"Please set the target during training dataset definition; targets present in the file/s are {possible_targets}.") + if self.target not in possible_targets: + raise ValueError(f"Target {self.target} not present in the file/s; targets present in the file/s are {possible_targets}.") self.features_dict = {} self.features_dict[gridstorage.MAPPED_FEATURES] = self.features @@ -512,25 +557,26 @@ def _check_features(self): hdf5_path = self.hdf5_paths[0] # read available features - with h5py.File(hdf5_path, "r") as hdf5_file: - entry_name = list(hdf5_file.keys())[0] - - hdf5_all_feature_names = list(hdf5_file[f"{entry_name}/{gridstorage.MAPPED_FEATURES}"].keys()) + with h5py.File(hdf5_path, "r") as f: + mol_key = list(f.keys())[0] + if isinstance(self.features, list): + self.features = [GRID_PARTIAL_FEATURE_NAME_PATTERN.match(feature_name).group(1) + if GRID_PARTIAL_FEATURE_NAME_PATTERN.match(feature_name) is not None + else feature_name for feature_name in self.features] # be sure to remove the dimension number suffix + self.features = list(set(self.features)) # remove duplicates + available_features = list(f[f"{mol_key}/{gridstorage.MAPPED_FEATURES}"].keys()) + available_features = [key for key in available_features if key[0] != '_'] # ignore metafeatures hdf5_matching_feature_names = [] # feature names that match with the requested list of names unpartial_feature_names = [] # feature names without their dimension number suffix - - for feature_name in hdf5_all_feature_names: - - if feature_name.startswith("_"): - continue # ignore metafeatures + for feature_name in available_features: partial_feature_match = GRID_PARTIAL_FEATURE_NAME_PATTERN.match(feature_name) if partial_feature_match is not None: # there's a dimension number in the feature name unpartial_feature_name = partial_feature_match.group(1) - if self.features == "all" or isinstance(self.features, list) and unpartial_feature_name in self.features: + if self.features == "all" or (isinstance(self.features, list) and unpartial_feature_name in self.features): hdf5_matching_feature_names.append(feature_name) @@ -538,7 +584,7 @@ def _check_features(self): else: # no numbers, it's a one-dimensional feature name - if self.features == "all" or isinstance(self.features, list) and feature_name in self.features: + if self.features == "all" or (isinstance(self.features, list) and feature_name in self.features): hdf5_matching_feature_names.append(feature_name) @@ -547,7 +593,7 @@ def _check_features(self): # check for the requested features missing_features = [] if self.features == "all": - self.features = sorted(hdf5_all_feature_names) + self.features = sorted(available_features) self.default_vars["features"] = self.features else: if not isinstance(self.features, list): @@ -565,11 +611,11 @@ def _check_features(self): # raise error if any features are missing if len(missing_features) > 0: raise ValueError( - f"Not all features could be found in the file {hdf5_path} under entry {entry_name}.\ + f"Not all features could be found in the file {hdf5_path} under entry {mol_key}.\ \nMissing features are: {missing_features} \ \nCheck feature_modules passed to the preprocess function. \ \nProbably, the feature wasn't generated during the preprocessing step. \ - Available features: {hdf5_all_feature_names}") + Available features: {available_features}") def get(self, idx: int) -> Data: """Gets one grid item from its unique index. @@ -596,22 +642,37 @@ def load_one_grid(self, hdf5_path: str, entry_name: str) -> Data: """ feature_data = [] - target_value = None with h5py.File(hdf5_path, 'r') as hdf5_file: - entry_group = hdf5_file[entry_name] + grp = hdf5_file[entry_name] - mapped_features_group = entry_group[gridstorage.MAPPED_FEATURES] + mapped_features_group = grp[gridstorage.MAPPED_FEATURES] for feature_name in self.features: if feature_name[0] != '_': # ignore metafeatures feature_data.append(mapped_features_group[feature_name][:]) + x=torch.tensor(np.expand_dims(np.array(feature_data), axis=0), dtype=torch.float) - target_value = entry_group[targets.VALUES][self.target][()] + # target + if self.target is None: + y = None + else: + if targets.VALUES in grp and self.target in grp[targets.VALUES]: + y = torch.tensor([grp[targets.VALUES][self.target][()]], dtype=torch.float) - # Wrap up the data in this object, for the collate_fn to handle it properly: - data = Data(x=torch.tensor(np.expand_dims(np.array(feature_data), axis=0), dtype=torch.float), - y=torch.tensor([target_value], dtype=torch.float)) + if self.task == targets.REGRESS and self.target_transform is True: + y = torch.sigmoid(torch.log(y)) + elif self.task is not targets.REGRESS and self.target_transform is True: + raise ValueError(f"Sigmoid transformation is not possible for {self.task} tasks. \ + Please change `task` to \"regress\" or set `target_transform` to `False`.") + else: + y = None + possible_targets = grp[targets.VALUES].keys() + if self.train_source is None: + raise ValueError(f"Target {self.target} missing in entry {entry_name} in file {hdf5_path}, possible targets are {possible_targets}." + + "\n Use the query class to add more target values to input data.") + # Wrap up the data in this object, for the collate_fn to handle it properly: + data = Data(x=x, y=y) data.entry_names = entry_name return data @@ -622,8 +683,7 @@ def __init__( # noqa: MC0001, pylint: disable=too-many-arguments, too-many-local self, hdf5_path: str | list, subset: list[str] | None = None, - train: bool = True, - dataset_train: GridDataset | None = None, + train_source: str | GridDataset | None = None, node_features: list[str] | str | Literal["all"] | None = "all", edge_features: list[str] | str | Literal["all"] | None = "all", features_transform: dict | None = None, @@ -633,32 +693,31 @@ def __init__( # noqa: MC0001, pylint: disable=too-many-arguments, too-many-local target_filter: dict[str, str] | None = None, task: Literal["regress", "classif"] | None = None, classes: list[str] | list[int] | list[float] | None = None, - tqdm: bool = True, + use_tqdm: bool = True, root: str = "./", - check_integrity: bool = True + check_integrity: bool = True, ): """Class to load the .HDF5 files data into graphs. Args: hdf5_path (str | list): Path to .HDF5 file(s). For multiple .HDF5 files, insert the paths in a list. Defaults to None. subset (list[str] | None, optional): list of keys from .HDF5 file to include. Defaults to None (meaning include all). - train (bool, optional): Boolean flag to determine if the instance represents the training set. - If False, a dataset_train of the same class must be provided as well. - The latter will be used to scale the validation/testing set according to its features values and to match the datasets' parameters. - Defaults to True. - dataset_train (class:`GridDataset` | None, optional): If `train` is True, assign here the training set. - If `train` is False and `dataset_train` is assigned, - the parameters `features`, `target`, `traget_transform`, `task`, and `classes` will be inherited from `dataset_train`. + train_source (str | class:`GraphDataset` | None, optional): data to inherit information from the training dataset or the pre-trained model. + If None, the current dataset is considered as the training set. Otherwise, `train_source` needs to be a dataset of the same class or + the path of a DeepRank2 pre-trained model. If set, the parameters `node_features`, `edge_features`, `features_transform`, + `target`, `target_transform`, `task`, and `classes` will be inherited from `train_source`. If standardization was performed in the + training dataset/step, also the attributes `means` and `devs` will be inherited from `train_source`, and they will be used to scale + the validation/testing set. Defaults to None. node_features (list[str] | str | Literal["all"] | None, optional): Consider all pre-computed node features ("all") or some defined node features (provide a list, example: ["res_type", "polarity", "bsa"]). The complete list can be found in `deeprank2.domain.nodestorage`. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to "all". edge_features (list[str] | str | Literal["all"] | None, optional): Consider all pre-computed edge features ("all") or some defined edge features (provide a list, example: ["dist", "coulomb"]). The complete list can be found in `deeprank2.domain.edgestorage`. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to "all". features_transform (dict | None, optional): Dictionary to indicate the transformations to apply to each feature in the dictionary, being the transformations lambda functions and/or standardization. @@ -666,7 +725,7 @@ def __init__( # noqa: MC0001, pylint: disable=too-many-arguments, too-many-local An `all` key can be set in the dictionary for indicating to apply the same `standardize` and `transform` to all the features. Example: `features_transform = {'all': {'transform': lambda t:np.log(t+1), 'standardize': True}}`. If both `all` and feature name/s are present, the latter have the priority over what indicated in `all`. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to None. clustering_method (str | None, optional): "mcl" for Markov cluster algorithm (see https://micans.org/mcl/), or "louvain" for Louvain method (see https://en.wikipedia.org/wiki/Louvain_method). @@ -683,11 +742,11 @@ def __init__( # noqa: MC0001, pylint: disable=too-many-arguments, too-many-local Only numerical target variables are supported, not categorical. If the latter is your case, please convert the categorical classes into numerical class indices before defining the :class:`GraphDataset` instance. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to None. target_transform (bool, optional): Apply a log and then a sigmoid transformation to the target (for regression only). This puts the target value between 0 and 1, and can result in a more uniform target distribution and speed up the optimization. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to False. target_filter (dict[str, str] | None, optional): Dictionary of type [target: cond] to filter the molecules. Note that the you can filter on a different target than the one selected as the dataset target. @@ -696,47 +755,53 @@ def __init__( # noqa: MC0001, pylint: disable=too-many-arguments, too-many-local ['irmsd', 'lrmsd', 'fnat', 'binary', 'capri_class', or 'dockq'], otherwise this setting is ignored. Automatically set to 'classif' if the target is 'binary' or 'capri_classes'. Automatically set to 'regress' if the target is 'irmsd', 'lrmsd', 'fnat', or 'dockq'. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to None. classes (list[str] | list[int] | list[float] | None): Define the dataset target classes in classification mode. - Value will be ignored and inherited from `dataset_train` if `train` is set as False and `dataset_train` is assigned. + Value will be ignored and inherited from `train_source` if `train_source` is assigned. Defaults to None. - tqdm (bool, optional): Show progress bar. - Defaults to True. - root (str, optional): Root directory where the dataset should be saved. - Defaults to "./". + use_tqdm (bool, optional): Show progress bar. Defaults to True. + root (str, optional): Root directory where the dataset should be saved. Defaults to "./". check_integrity (bool, optional): Whether to check the integrity of the hdf5 files. Defaults to True. """ - super().__init__(hdf5_path, subset, target, task, classes, tqdm, root, target_filter, check_integrity) + super().__init__(hdf5_path, subset, train_source, target, target_transform, + target_filter, task, classes, use_tqdm, root, check_integrity) self.default_vars = { k: v.default for k, v in inspect.signature(self.__init__).parameters.items() if v.default is not inspect.Parameter.empty } - self.train = train - self.dataset_train = dataset_train + self.default_vars["classes_to_index"] = None self.node_features = node_features self.edge_features = edge_features self.clustering_method = clustering_method self.target_transform = target_transform self.features_transform = features_transform - self._check_features() - if not train: - if not isinstance(dataset_train, GraphDataset): - raise TypeError(f"""The train dataset provided is type: {type(dataset_train)} - Please provide a valid training GraphDataset.""") + if train_source is not None: + self.inherited_params = ["node_features", "edge_features", "features_transform", "target", + "target_transform", "task", "classes", "classes_to_index"] + self._check_and_inherit_train(GraphDataset, self.inherited_params) + self._check_features() - #check inherited parameter with the ones in the training set - inherited_params = ["node_features", "edge_features", "features_transform", "target", "target_transform", "task", "classes"] - self._check_inherited_params(inherited_params, dataset_train) + else: + self._check_features() + self.inherited_params = None - elif train and dataset_train: - _log.warning("""dataset_train has been set but train flag was set to True. - dataset_train will be ignored since the current dataset will be considered as training set.""") + try: + fname, mol = self.index_entries[0] + except IndexError as exc: + raise IndexError("No entries found in the dataset. Please check the dataset parameters.") from exc + with h5py.File(fname, 'r') as f5: + grp = f5[mol] + possible_targets = grp[targets.VALUES].keys() + if self.target is None: + raise ValueError(f"Please set the target during training dataset definition; targets present in the file/s are {possible_targets}.") + if self.target not in possible_targets: + raise ValueError(f"Target {self.target} not present in the file/s; targets present in the file/s are {possible_targets}.") self.features_dict = {} self.features_dict[Nfeat.NODE] = self.node_features @@ -751,18 +816,14 @@ def __init__( # noqa: MC0001, pylint: disable=too-many-arguments, too-many-local if self.features_transform: standardize = any(self.features_transform[key].get("standardize") for key, _ in self.features_transform.items()) - if standardize and train: + if standardize and (train_source is None): if self.means or self.devs is None: if self.df is None: self.hdf5_to_pandas() self._compute_mean_std() - elif standardize and (not train): - if (dataset_train.means is None) or (dataset_train.devs is None): - if dataset_train.df is None: - dataset_train.hdf5_to_pandas() - dataset_train._compute_mean_std() - self.means = dataset_train.means - self.devs = dataset_train.devs + elif standardize and (train_source is not None): + self.means = self.train_means + self.devs = self.train_devs def get(self, idx: int) -> Data: """Gets one graph item from its unique index. @@ -905,12 +966,15 @@ def load_one_graph(self, fname: str, entry_name: str) -> Data: # pylint: disabl if self.task == targets.REGRESS and self.target_transform is True: y = torch.sigmoid(torch.log(y)) elif self.task is not targets.REGRESS and self.target_transform is True: - raise ValueError(f"Task is set to {self.task}. Please set it to regress to transform the target with a sigmoid.") + raise ValueError(f"Sigmoid transformation is not possible for {self.task} tasks. \ + Please change `task` to \"regress\" or set `target_transform` to `False`.") else: + y = None possible_targets = grp[targets.VALUES].keys() - raise ValueError(f"Target {self.target} missing in entry {entry_name} in file {fname}, possible targets are {possible_targets}." + - "\n Use the query class to add more target values to input data.") + if self.train_source is None: + raise ValueError(f"Target {self.target} missing in entry {entry_name} in file {fname}, possible targets are {possible_targets}." + + "\n Use the query class to add more target values to input data.") # positions pos = torch.tensor(grp[f"{Nfeat.NODE}/{Nfeat.POSITION}/"][()], dtype=torch.float).contiguous() diff --git a/deeprank2/trainer.py b/deeprank2/trainer.py index 70260d9e1..c5f0fc030 100644 --- a/deeprank2/trainer.py +++ b/deeprank2/trainer.py @@ -1,5 +1,7 @@ import copy +import inspect import logging +import re import warnings from time import time @@ -64,14 +66,8 @@ def __init__( # pylint: disable=too-many-arguments # noqa: MC0001 over the epochs. If None, defaults to :class:`HDF5OutputExporter`, which saves all the results in an .HDF5 file stored in ./output directory. Defaults to None. """ - - self.batch_size_train = None - self.batch_size_test = None - self.shuffle = None - - self._init_output_exporters(output_exporters) - self.neuralnet = neuralnet + self.pretrained_model = pretrained_model self._init_datasets(dataset_train, dataset_val, dataset_test, val_size, test_size) @@ -120,14 +116,22 @@ def __init__( # pylint: disable=too-many-arguments # noqa: MC0001 _log.info(f"CUDA device name is {torch.cuda.get_device_name(0)}.") _log.info(f"Number of GPUs set to {self.ngpu}.") - if pretrained_model is None: + self._init_output_exporters(output_exporters) + + # other attributes not set in init + self.data_type = None + self.batch_size_train = None + self.batch_size_test = None + self.shuffle = None + self.model_load_state_dict = None + + if self.pretrained_model is None: if self.dataset_train is None: raise ValueError("No training data specified. Training data is required if there is no pretrained model.") if self.neuralnet is None: raise ValueError("No neural network specified. Specifying a model framework is required if there is no pretrained model.") - self.classes = self.dataset_train.classes - self.classes_to_index = self.dataset_train.classes_to_index + self._init_from_dataset(self.dataset_train) self.optimizer = None self.class_weights = class_weights self.subset = self.dataset_train.subset @@ -159,20 +163,17 @@ def __init__( # pylint: disable=too-many-arguments # noqa: MC0001 "Please set clustering_method to 'mcl', 'louvain' or None. Default to 'mcl' \n\t") else: - if self.dataset_train is not None: - _log.warning("Pretrained model loaded: dataset_train will be ignored.") - if self.dataset_val is not None: - _log.warning("Pretrained model loaded: dataset_val will be ignored.") if self.neuralnet is None: raise ValueError("No neural network class found. Please add it to complete loading the pretrained model.") if self.dataset_test is None: raise ValueError("No dataset_test found. Please add it to evaluate the pretrained model.") - if self.target is None: - raise ValueError("No target set. Make sure the pretrained model explicitly defines the target to train against.") - - self.pretrained_model_path = pretrained_model - self.classes_to_index = self.dataset_test.classes_to_index - + if self.dataset_train is not None: + self.dataset_train = None + _log.warning("Pretrained model loaded: dataset_train will be ignored.") + if self.dataset_val is not None: + self.dataset_val = None + _log.warning("Pretrained model loaded: dataset_val will be ignored.") + self._init_from_dataset(self.dataset_test) self._load_params() self._load_pretrained_model() @@ -212,12 +213,6 @@ def _init_datasets( # pylint: disable=too-many-arguments else: _log.warning("Validation dataset was provided to Trainer; val_size parameter is ignored.") - # Copy settings from the dataset that we will use. - if self.dataset_train is not None: - self._init_from_dataset(self.dataset_train) - else: - self._init_from_dataset(self.dataset_test) - def _init_from_dataset(self, dataset: GraphDataset | GridDataset): if isinstance(dataset, GraphDataset): @@ -225,17 +220,26 @@ def _init_from_dataset(self, dataset: GraphDataset | GridDataset): self.node_features = dataset.node_features self.edge_features = dataset.edge_features self.features = None + self.features_transform = dataset.features_transform + self.means = dataset.means + self.devs = dataset.devs elif isinstance(dataset, GridDataset): self.clustering_method = None self.node_features = None self.edge_features = None self.features = dataset.features + self.features_transform = None + self.means = None + self.devs = None else: - raise TypeError(type(dataset)) + raise TypeError(f"Incorrect `dataset` type provided: {type(dataset)}. Please provide a `GridDataset` or `GraphDataset` object instead.") self.target = dataset.target + self.target_transform = dataset.target_transform self.task = dataset.task + self.classes = dataset.classes + self.classes_to_index = dataset.classes_to_index def _load_model(self): """Loads the neural network model.""" @@ -245,7 +249,7 @@ def _load_model(self): self.set_lossfunction() def _check_dataset_equivalence(self, dataset_train, dataset_val, dataset_test): - """Check train_dataset type, train parameter and dataset_train parameter settings.""" + """Check dataset_train type and train_source parameter settings.""" # dataset_train is None when pretrained_model is set if dataset_train is None: @@ -267,14 +271,14 @@ def _check_dataset_equivalence(self, dataset_train, dataset_val, dataset_test): def _check_dataset_value(self, dataset_train, dataset_check, type_dataset): """Check valid/test dataset settings.""" - # Check train parameter in valid/test is set as False. - if dataset_check.train is not False: - raise ValueError(f"""{type_dataset} dataset has train parameter {dataset_check.train} - Make sure to set it as False""") - # Check dataset_train parameter in valid/test is equivalent to train which passed to Trainer. - if dataset_check.dataset_train != dataset_train: - raise ValueError(f"""{type_dataset} dataset has different dataset_train parameter compared to the one given in Trainer. - Make sure to assign equivalent dataset_train in Trainer""") + # Check train_source parameter in valid/test is set. + if dataset_check.train_source is None: + raise ValueError(f"""{type_dataset} dataset has train_source parameter set to None. + Make sure to set it as a valid training data source.""") + # Check train_source parameter in valid/test is equivalent to train which passed to Trainer. + if dataset_check.train_source != dataset_train: + raise ValueError(f"""{type_dataset} dataset has different train_source parameter compared to the one given in Trainer. + Make sure to assign equivalent train_source in Trainer""") def _load_pretrained_model(self): """ @@ -288,6 +292,7 @@ def _load_pretrained_model(self): self._put_model_to_device(self.dataset_test) # load the model and the optimizer state + self.optimizer = self.optimizer(self.model.parameters(), lr=self.lr, weight_decay = self.weight_decay) self.optimizer.load_state_dict(self.opt_loaded_state_dict) self.model.load_state_dict(self.model_load_state_dict) @@ -527,6 +532,10 @@ def train( # pylint: disable=too-many-arguments, too-many-branches, too-many-loc filename (str, optional): Name of the file where to save the selected model. If not None, the model is saved to `filename`. If None, the model is not saved. Defaults to 'model.pth.tar'. """ + if self.dataset_train is None: + raise ValueError("No training dataset provided.") + + self.data_type = type(self.dataset_train) self.batch_size_train = batch_size self.shuffle = shuffle @@ -551,6 +560,9 @@ def train( # pylint: disable=too-many-arguments, too-many-branches, too-many-loc else: self.valid_loader = None _log.info("No validation set provided\n") + _log.warning( + "Training data will be used both for learning and model selection, which may lead to overfitting." + + "\nIt is usually preferable to use a validation set during the training phase.") # Assign weights to each class if self.task == targets.CLASSIF and self.class_weights: @@ -626,9 +638,6 @@ def train( # pylint: disable=too-many-arguments, too-many-branches, too-many-loc # if no validation set, save the best performing model on the training set if best_model: if min(train_losses) == loss_: - _log.warning( - "Training data is used both for learning and model selection, which will to overfitting." + - "\n\tIt is preferable to use an independent training and validation data sets.") checkpoint_model = self._save_model() saved_model = True self.epoch_saved_model = epoch @@ -652,7 +661,7 @@ def train( # pylint: disable=too-many-arguments, too-many-branches, too-many-loc self.optimizer.load_state_dict(self.opt_loaded_state_dict) self.model.load_state_dict(self.model_load_state_dict) - def _epoch(self, epoch_number: int, pass_name: str) -> float: + def _epoch(self, epoch_number: int, pass_name: str) -> float | None: """ Runs a single epoch @@ -700,7 +709,7 @@ def _epoch(self, epoch_number: int, pass_name: str) -> float: if count_predictions > 0: epoch_loss = sum_of_losses / count_predictions else: - epoch_loss = 0.0 + epoch_loss = None self._output_exporters.process( pass_name, epoch_number, entry_names, outputs, target_vals, epoch_loss) @@ -713,7 +722,7 @@ def _eval( # pylint: disable=too-many-locals loader: DataLoader, epoch_number: int, pass_name: str - ) -> float: + ) -> float | None: """ Evaluates the model @@ -748,6 +757,9 @@ def _eval( # pylint: disable=too-many-locals loss_ = loss_func(pred, y) count_predictions += pred.shape[0] sum_of_losses += loss_.detach().item() * pred.shape[0] + else: + target_vals += [None] * pred.shape[0] + eval_loss = None # Get the outputs for export # Remember that non-linear activation is automatically applied in CrossEntropyLoss @@ -764,7 +776,7 @@ def _eval( # pylint: disable=too-many-locals if count_predictions > 0: eval_loss = sum_of_losses / count_predictions else: - eval_loss = 0.0 + eval_loss = None self._output_exporters.process( pass_name, epoch_number, entry_names, outputs, target_vals, eval_loss) @@ -831,6 +843,13 @@ def test( num_workers (int, optional): How many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. Defaults to 0. """ + if (not self.pretrained_model) and (not self.model_load_state_dict): + raise ValueError( + """ + No pretrained model provided and no training performed. + Please provide a pretrained model or train the model before testing.\n + """) + self.batch_size_test = batch_size if self.dataset_test is not None: @@ -857,28 +876,38 @@ def _load_params(self): Loads the parameters of a pretrained model """ - state = torch.load(self.pretrained_model_path) + if torch.cuda.is_available(): + state = torch.load(self.pretrained_model) + else: + state = torch.load(self.pretrained_model, map_location=torch.device('cpu')) + self.data_type = state["data_type"] + self.model_load_state_dict = state["model_state"] + self.optimizer = type(state["optimizer"]) + self.opt_loaded_state_dict = state["optimizer_state"] + self.lossfunction = state["lossfunction"] self.target = state["target"] + self.target_transform = state["target_transform"] + self.task = state["task"] + self.classes = state["classes"] + self.classes_to_index = state["classes_to_index"] + self.class_weights = state["class_weights"] self.batch_size_train = state["batch_size_train"] self.batch_size_test = state["batch_size_test"] self.val_size = state["val_size"] self.test_size = state["test_size"] self.lr = state["lr"] self.weight_decay = state["weight_decay"] + self.epoch_saved_model = state["epoch_saved_model"] self.subset = state["subset"] - self.class_weights = state["class_weights"] - self.task = state["task"] - self.classes = state["classes"] self.shuffle = state["shuffle"] - self.optimizer = state["optimizer"] - self.opt_loaded_state_dict = state["optimizer_state"] - self.lossfunction = state["lossfunction"] - self.model_load_state_dict = state["model_state"] self.clustering_method = state["clustering_method"] self.node_features = state["node_features"] self.edge_features = state["edge_features"] self.features = state["features"] + self.features_transform = state["features_transform"] + self.means = state["means"] + self.devs = state["devs"] self.cuda = state["cuda"] self.ngpu = state["ngpu"] @@ -889,14 +918,27 @@ def _save_model(self): Args: filename (str, optional): Name of the file. Defaults to None. """ + features_transform_to_save = copy.deepcopy(self.features_transform) + # prepare transform dictionary for being saved + if features_transform_to_save: + for _, key in features_transform_to_save.items(): + if key['transform'] is None: + continue + str_expr = inspect.getsource(key['transform']) + match = re.search(r'\'transform\':.*(lambda.*).*,.*\'standardize\'.*', str_expr).group(1) + key['transform'] = match + state = { + "data_type": self.data_type, "model_state": self.model.state_dict(), "optimizer": self.optimizer, "optimizer_state": self.optimizer.state_dict(), "lossfunction": self.lossfunction, "target": self.target, + "target_transform": self.target_transform, "task": self.task, "classes": self.classes, + "classes_to_index": self.classes_to_index, "class_weights": self.class_weights, "batch_size_train": self.batch_size_train, "batch_size_test": self.batch_size_test, @@ -904,12 +946,16 @@ def _save_model(self): "test_size": self.test_size, "lr": self.lr, "weight_decay": self.weight_decay, + "epoch_saved_model": self.epoch_saved_model, "subset": self.subset, "shuffle": self.shuffle, "clustering_method": self.clustering_method, "node_features": self.node_features, "edge_features": self.edge_features, "features": self.features, + "features_transform": features_transform_to_save, + "means": self.means, + "devs": self.devs, "cuda": self.cuda, "ngpu": self.ngpu } diff --git a/deeprank2/utils/graph.py b/deeprank2/utils/graph.py index fafa7519d..5d2f540c9 100644 --- a/deeprank2/utils/graph.py +++ b/deeprank2/utils/graph.py @@ -299,9 +299,9 @@ def write_as_grid_to_hdf5( # store target values with h5py.File(hdf5_path, 'a') as hdf5_file: - entry_group = hdf5_file[id_] + grp = hdf5_file[id_] - targets_group = entry_group.require_group(targets.VALUES) + targets_group = grp.require_group(targets.VALUES) for target_name, target_data in self.targets.items(): if target_name not in targets_group: targets_group.create_dataset(target_name, data=target_data) diff --git a/docs/getstarted.md b/docs/getstarted.md index e028d8558..893fff15e 100644 --- a/docs/getstarted.md +++ b/docs/getstarted.md @@ -187,14 +187,12 @@ dataset_train = GraphDataset( dataset_val = GraphDataset( hdf5_path = hdf5_paths, subset = valid_ids, - train = False, - dataset_train = dataset_train + train_source = dataset_train ) dataset_test = GraphDataset( hdf5_path = hdf5_paths, subset = test_ids, - train = False, - dataset_train = dataset_train + train_source = dataset_train ) ``` @@ -243,7 +241,7 @@ dataset_train = GraphDataset( ) ``` -If `standardize` functionality is used, validation and testing sets need to know the interested features' means and standard deviations in order to use the same values for standardizing validation and testing features. This can be done using `train` and `dataset_train` parameters of the `GraphDataset` class. Example: +If `standardize` functionality is used, validation and testing sets need to know the interested features' means and standard deviations in order to use the same values for standardizing validation and testing features. This can be done using `train_source` parameter of the `GraphDataset` class. Example: ```python features_transform = {'all': @@ -252,7 +250,7 @@ features_transform = {'all': train_ids = [] valid_ids = [] test_ids = [] -# `train` defaults to `True`, and `dataset_train` defaults to `None` +# `train_source` defaults to `None` dataset_train = GraphDataset( hdf5_path = hdf5_path, subset = train_ids, @@ -264,14 +262,12 @@ dataset_train = GraphDataset( dataset_val = GraphDataset( hdf5_path = hdf5_paths, subset = valid_ids, - train = False, - dataset_train = dataset_train # dataset_train means and stds will be used + train_source = dataset_train # dataset_train means and stds will be used ) dataset_test = GraphDataset( hdf5_path = hdf5_paths, subset = test_ids, - train = False, - dataset_train = dataset_train # dataset_train means and stds will be used + train_source = dataset_train # dataset_train means and stds will be used ) ``` @@ -298,14 +294,12 @@ dataset_train = GridDataset( dataset_val = GridDataset( hdf5_path = hdf5_paths, subset = valid_ids, - train = False, - dataset_train = dataset_train + train_source = dataset_train ) dataset_test = GridDataset( hdf5_path = hdf5_paths, subset = test_ids, - train = False, - dataset_train = dataset_train + train_source = dataset_train ) ``` @@ -411,3 +405,67 @@ fig.update_layout( title='Loss vs epochs' ) ``` + +## Run a pre-trained model on new data + +If you want to run a pre-trained model on new PDB files, the first step is to process and save them into HDF5 files. Let's suppose that the model has been trained with `ProteinProteinInterfaceQuery` queries mapped to graphs: + +```python +from deeprank2.query import QueryCollection, ProteinProteinInterfaceQuery + +queries = QueryCollection() + +# Append data points +queries.add(ProteinProteinInterfaceQuery( + pdb_path = "", + chain_id1 = "A", + chain_id2 = "B" +)) +queries.add(ProteinProteinInterfaceQuery( + pdb_path = "", + chain_id1 = "A", + chain_id2 = "B" +)) + +hdf5_paths = queries.process( + "/", + feature_modules = 'all') +``` + +Then, the GraphDataset instance for the newly processed data can be created. Do this by specifying the path for the pre-trained model in `train_source`, together with the path to the HDF5 files just created. Note that there is no need of setting the dataset's parameters, since they are inherited from the information saved in the pre-trained model. + +```python +from deeprank2.dataset import GraphDataset + +dataset_test = GraphDataset( + hdf5_path = "/", + train_source = "" +) +``` + +Finally, the `Trainer` instance can be defined and the new data can be tested: + +```python +from deeprank2.trainer import Trainer +from deeprank2.neuralnets.gnn.naive_gnn import NaiveNetwork +from deeprank2.utils.exporters import HDF5OutputExporter + +trainer = Trainer( + NaiveNetwork, + dataset_test = dataset_test, + pretrained_model = "", + output_exporters = [HDF5OutputExporter("")] +) + +trainer.test() +``` + +The results can then be read in a Pandas Dataframe and visualized: + +```python +import os +import pandas as pd + +output = pd.read_hdf(os.path.join("", "output_exporter.hdf5"), key="testing") +output.head() +``` diff --git a/tests/data/hdf5/test_no_target.hdf5 b/tests/data/hdf5/test_no_target.hdf5 new file mode 100644 index 000000000..d51cbccec Binary files /dev/null and b/tests/data/hdf5/test_no_target.hdf5 differ diff --git a/tests/data/pretrained/testing_graph_model.pth.tar b/tests/data/pretrained/testing_graph_model.pth.tar new file mode 100644 index 000000000..37363794e Binary files /dev/null and b/tests/data/pretrained/testing_graph_model.pth.tar differ diff --git a/tests/data/pretrained/testing_grid_model.pth.tar b/tests/data/pretrained/testing_grid_model.pth.tar new file mode 100644 index 000000000..2f71237b1 Binary files /dev/null and b/tests/data/pretrained/testing_grid_model.pth.tar differ diff --git a/tests/test_dataset.py b/tests/test_dataset.py index 92406b205..a33995194 100644 --- a/tests/test_dataset.py +++ b/tests/test_dataset.py @@ -7,6 +7,7 @@ import h5py import numpy as np import pytest +import torch from torch_geometric.loader import DataLoader from deeprank2.dataset import GraphDataset, GridDataset, save_hdf5_keys @@ -208,7 +209,7 @@ def test_classification_griddataset(self): # 1 entry with class value assert dataset[0].y.shape == (1,) - def test_inherit_info_training_griddataset(self): + def test_inherit_info_dataset_train_griddataset(self): dataset_train = GridDataset( hdf5_path = self.hdf5_path, @@ -221,19 +222,14 @@ def test_inherit_info_training_griddataset(self): dataset_test = GridDataset( hdf5_path = self.hdf5_path, - train = False, - dataset_train = dataset_train + train_source = dataset_train ) - # features, features_dict, target, target_transform, task, and classes - # in the test should be inherited from the train - inherited_param = ["features", "features_dict", "target", "target_transform", "task", "classes"] - _check_inherited_params(inherited_param, dataset_train, dataset_test) + _check_inherited_params(dataset_test.inherited_params, dataset_train, dataset_test) dataset_test = GridDataset( hdf5_path = self.hdf5_path, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, features = [Efeat.DISTANCE, Efeat.COVALENT, Efeat.SAMECHAIN], target = targets.IRMSD, target_transform = True, @@ -241,19 +237,107 @@ def test_inherit_info_training_griddataset(self): classes = None ) - # features, features_dict, target, target_transform, task, and classes - # in the test should be inherited from the train - _check_inherited_params(inherited_param, dataset_train, dataset_test) + _check_inherited_params(dataset_test.inherited_params, dataset_train, dataset_test) + + def test_inherit_info_pretrained_model_griddataset(self): + + # Test the inheritance not giving in any parameters + pretrained_model = "tests/data/pretrained/testing_grid_model.pth.tar" + dataset_test = GridDataset( + hdf5_path = self.hdf5_path, + train_source = pretrained_model + ) + + data = torch.load(pretrained_model, map_location=torch.device('cpu')) + + dataset_test_vars = vars(dataset_test) + for param in dataset_test.inherited_params: + assert dataset_test_vars[param] == data[param] + + # Test that even when different parameters from the training data are given, the inheritance works + dataset_test = GridDataset( + hdf5_path = self.hdf5_path, + train_source = pretrained_model, + features = [Efeat.DISTANCE, Efeat.COVALENT, Efeat.SAMECHAIN], + target = targets.IRMSD, + target_transform = True, + task = targets.REGRESS, + classes = None + ) + + ## features, target, target_transform, task, and classes + ## in the test should be inherited from the pre-trained model + dataset_test_vars = vars(dataset_test) + for param in dataset_test.inherited_params: + assert dataset_test_vars[param] == data[param] + + def test_no_target_dataset_griddataset(self): + hdf5_no_target = "tests/data/hdf5/test_no_target.hdf5" + hdf5_target = "tests/data/hdf5/1ATN_ppi.hdf5" + pretrained_model = "tests/data/pretrained/testing_grid_model.pth.tar" + + dataset = GridDataset( + hdf5_path = hdf5_no_target, + train_source = pretrained_model + ) + + assert dataset.target is not None + assert dataset.get(0).y is None + + # no target set, training mode + with self.assertRaises(ValueError): + dataset = GridDataset( + hdf5_path = hdf5_no_target, + ) + + # target set, but not present in the file + with self.assertRaises(ValueError): + dataset = GridDataset( + hdf5_path = hdf5_target, + target = 'CAPRI' + ) + + def test_filter_griddataset(self): + + # filtering out all values + with self.assertRaises(IndexError): + GridDataset( + hdf5_path=self.hdf5_path, + subset=None, + target=targets.IRMSD, + target_filter={targets.IRMSD: "<10"} + ) + # filter our some values + dataset = GridDataset( + hdf5_path=self.hdf5_path, + subset=None, + target=targets.IRMSD, + target_filter={targets.IRMSD: ">15"} + ) + assert len(dataset) == 3 def test_filter_graphdataset(self): - GraphDataset( + + # filtering out all values + with self.assertRaises(IndexError): + GraphDataset( + hdf5_path=self.hdf5_path, + subset=None, + node_features=node_feats, + edge_features=[Efeat.DISTANCE], + target=targets.IRMSD, + target_filter={targets.IRMSD: "<10"} + ) + # filter our some values + dataset = GraphDataset( hdf5_path=self.hdf5_path, subset=None, node_features=node_feats, edge_features=[Efeat.DISTANCE], target=targets.IRMSD, - target_filter={targets.IRMSD: "<10"} + target_filter={targets.IRMSD: ">15"} ) + assert len(dataset) == 3 def test_multi_file_graphdataset(self): dataset = GraphDataset( @@ -304,12 +388,20 @@ def test_subset_graphdataset(self): n = 10 subset = hdf5_keys[:n] - dataset = GraphDataset( + dataset_train = GraphDataset( hdf5_path = "tests/data/hdf5/train.hdf5", subset = subset, + target = targets.BINARY ) - assert n == len(dataset) + dataset_test = GraphDataset( + hdf5_path = "tests/data/hdf5/train.hdf5", + subset = subset, + train_source = dataset_train + ) + + assert n == len(dataset_train) + assert n == len(dataset_test) hdf5.close() @@ -468,8 +560,7 @@ def test_logic_train_graphdataset(self):# noqa: MC0001, pylint: disable=too-many dataset_test = GraphDataset( hdf5_path = hdf5_path, target = 'binary', - train = False, - dataset_train = dataset_train + train_source = dataset_train ) # mean and devs should be None assert dataset_train.means == dataset_test.means @@ -477,14 +568,6 @@ def test_logic_train_graphdataset(self):# noqa: MC0001, pylint: disable=too-many assert dataset_train.means is None assert dataset_train.devs is None - # raise error if dataset_train is not provided - with self.assertRaises(TypeError): - GraphDataset( - hdf5_path = hdf5_path, - target = 'binary', - train = False - ) - # raise error if dataset_train is of the wrong type with self.assertRaises(TypeError): @@ -495,8 +578,7 @@ def test_logic_train_graphdataset(self):# noqa: MC0001, pylint: disable=too-many GraphDataset( hdf5_path = hdf5_path, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, target = 'binary', ) @@ -886,8 +968,7 @@ def test_features_transform_logic_graphdataset(self): dataset_test = GraphDataset( hdf5_path = hdf5_path, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, target = 'binary' ) @@ -901,8 +982,7 @@ def test_features_transform_logic_graphdataset(self): dataset_test = GraphDataset( hdf5_path = hdf5_path, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, features_transform = other_feature_transform, target = 'binary' ) @@ -927,7 +1007,7 @@ def test_invalid_value_features_transform(self): warnings.filterwarnings('ignore', r'divide by zero encountered in divide') _compute_features_with_get(hdf5_path, transf_dataset) - def test_inherit_info_training_graphdataset(self): + def test_inherit_info_dataset_train_graphdataset(self): hdf5_path = "tests/data/hdf5/train.hdf5" feature_transform = {'all': {'transform': None, 'standardize': True}} @@ -944,19 +1024,53 @@ def test_inherit_info_training_graphdataset(self): dataset_test = GraphDataset( hdf5_path = hdf5_path, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, + ) + + _check_inherited_params(dataset_test.inherited_params, dataset_train, dataset_test) + + dataset_test = GraphDataset( + hdf5_path = hdf5_path, + train_source = dataset_train, + node_features = "all", + edge_features = "all", + features_transform = None, + target = 'BA', + target_transform = True, + task = "regress", + classes = None + ) + + _check_inherited_params(dataset_test.inherited_params, dataset_train, dataset_test) + + def test_inherit_info_pretrained_model_graphdataset(self): + + hdf5_path = "tests/data/hdf5/test.hdf5" + pretrained_model = "tests/data/pretrained/testing_graph_model.pth.tar" + dataset_test = GraphDataset( + hdf5_path = hdf5_path, + train_source = pretrained_model ) - # node_features, edge_features, features_dict, feature_transform, target, target_transform, task, and classes - # in the test should be inherited from the train - inherited_param = ["node_features", "edge_features", "features_dict", "features_transform", "target", "target_transform", "task", "classes"] - _check_inherited_params(inherited_param, dataset_train, dataset_test) + data = torch.load(pretrained_model, map_location=torch.device('cpu')) + if data["features_transform"]: + for _, key in data["features_transform"].items(): + if key['transform'] is None: + continue + key['transform'] = eval(key['transform']) # pylint: disable=eval-used + + dataset_test_vars = vars(dataset_test) + for param in dataset_test.inherited_params: + if param == 'features_transform': + for item, key in data[param].items(): + assert key['transform'].__code__.co_code == dataset_test_vars[param][item]['transform'].__code__.co_code + assert key['standardize'] == dataset_test_vars[param][item]['standardize'] + else: + assert dataset_test_vars[param] == data[param] dataset_test = GraphDataset( hdf5_path = hdf5_path, - train = False, - dataset_train = dataset_train, + train_source = pretrained_model, node_features = "all", edge_features = "all", features_transform = None, @@ -966,9 +1080,42 @@ def test_inherit_info_training_graphdataset(self): classes = None ) - # node_features, edge_features, features_dict, feature_transform, target, target_transform, task, and classes - # in the test should be inherited from the train - _check_inherited_params(inherited_param, dataset_train, dataset_test) + # node_features, edge_features, feature_transform, target, target_transform, task, and classes + # in the test should be inherited from the pre-trained model + dataset_test_vars = vars(dataset_test) + for param in dataset_test.inherited_params: + if param == 'features_transform': + for item, key in data[param].items(): + assert key['transform'].__code__.co_code == dataset_test_vars[param][item]['transform'].__code__.co_code + assert key['standardize'] == dataset_test_vars[param][item]['standardize'] + else: + assert dataset_test_vars[param] == data[param] + + def test_no_target_dataset_graphdataset(self): + hdf5_no_target = "tests/data/hdf5/test_no_target.hdf5" + hdf5_target = "tests/data/hdf5/test.hdf5" + pretrained_model = "tests/data/pretrained/testing_graph_model.pth.tar" + + dataset = GraphDataset( + hdf5_path = hdf5_no_target, + train_source = pretrained_model + ) + + assert dataset.target is not None + assert dataset.get(0).y is None + + # no target set, training mode + with self.assertRaises(ValueError): + dataset = GraphDataset( + hdf5_path = hdf5_no_target + ) + + # target set, but not present in the file + with self.assertRaises(ValueError): + dataset = GraphDataset( + hdf5_path = hdf5_target, + target = 'CAPRI' + ) def test_incompatible_dataset_train_type(self): dataset_train = GraphDataset( @@ -981,8 +1128,41 @@ def test_incompatible_dataset_train_type(self): with pytest.raises(TypeError): GridDataset( hdf5_path = "tests/data/hdf5/1ATN_ppi.hdf5", - train = False, - dataset_train = dataset_train + train_source = dataset_train + ) + + def test_invalid_pretrained_model_path(self): + + hdf5_graph = "tests/data/hdf5/test.hdf5" + with self.assertRaises(ValueError): + GraphDataset( + hdf5_path = hdf5_graph, + train_source = hdf5_graph + ) + + hdf5_grid = "tests/data/hdf5/1ATN_ppi.hdf5" + with self.assertRaises(ValueError): + GridDataset( + hdf5_path = hdf5_grid, + train_source = hdf5_grid + ) + + def test_invalid_pretrained_model_data_type(self): + + hdf5_graph = "tests/data/hdf5/test.hdf5" + pretrained_grid_model = "tests/data/pretrained/testing_grid_model.pth.tar" + with self.assertRaises(TypeError): + GraphDataset( + hdf5_path = hdf5_graph, + train_source = pretrained_grid_model + ) + + hdf5_grid = "tests/data/hdf5/1ATN_ppi.hdf5" + pretrained_graph_model = "tests/data/pretrained/testing_graph_model.pth.tar" + with self.assertRaises(TypeError): + GridDataset( + hdf5_path = hdf5_grid, + train_source = pretrained_graph_model ) diff --git a/tests/test_integration.py b/tests/test_integration.py index 26422cd2a..579d292bc 100644 --- a/tests/test_integration.py +++ b/tests/test_integration.py @@ -82,14 +82,12 @@ def test_cnn(): # pylint: disable=too-many-locals dataset_val = GridDataset( hdf5_path = hdf5_paths, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, ) dataset_test = GridDataset( hdf5_path = hdf5_paths, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, ) output_exporters = [HDF5OutputExporter(output_directory)] @@ -165,15 +163,13 @@ def test_gnn(): # pylint: disable=too-many-locals dataset_val = GraphDataset( hdf5_path = hdf5_paths, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, clustering_method = "mcl" ) dataset_test = GraphDataset( hdf5_path = hdf5_paths, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, clustering_method = "mcl" ) @@ -247,8 +243,7 @@ def test_nan_loss_cases(validate, best_model, hdf5_files_for_nan): dataset_valid = GraphDataset( hdf5_path = hdf5_files_for_nan, subset = [mols[0]], - dataset_train=dataset_train, - train=False + train_source=dataset_train ) trainer = Trainer( diff --git a/tests/test_query.py b/tests/test_query.py index 3415a30ca..9a3a7ffb8 100644 --- a/tests/test_query.py +++ b/tests/test_query.py @@ -40,37 +40,39 @@ def _check_graph_makes_sense( os.close(f) try: + g.targets[targets.BINARY] = 0 g.write_to_hdf5(tmp_path) with h5py.File(tmp_path, "r") as f5: - entry_group = f5[list(f5.keys())[0]] + grp = f5[list(f5.keys())[0]] for feature_name in node_feature_names: assert ( - entry_group[f"{Nfeat.NODE}/{feature_name}"][()].size > 0 + grp[f"{Nfeat.NODE}/{feature_name}"][()].size > 0 ), f"no {feature_name} feature" assert ( len( np.nonzero( - entry_group[f"{Nfeat.NODE}/{feature_name}"][()] + grp[f"{Nfeat.NODE}/{feature_name}"][()] ) ) > 0 ), f"{feature_name}: all zero" - assert entry_group[f"{Efeat.EDGE}/{Efeat.INDEX}"][()].shape[1] == 2, "wrong edge index shape" - assert entry_group[f"{Efeat.EDGE}/{Efeat.INDEX}"].shape[0] > 0, "no edge indices" + assert grp[f"{Efeat.EDGE}/{Efeat.INDEX}"][()].shape[1] == 2, "wrong edge index shape" + assert grp[f"{Efeat.EDGE}/{Efeat.INDEX}"].shape[0] > 0, "no edge indices" for feature_name in edge_feature_names: assert ( - entry_group[f"{Efeat.EDGE}/{feature_name}"][()].shape[0] - == entry_group[f"{Efeat.EDGE}/{Efeat.INDEX}"].shape[0] + grp[f"{Efeat.EDGE}/{feature_name}"][()].shape[0] + == grp[f"{Efeat.EDGE}/{Efeat.INDEX}"].shape[0] ), f"not enough edge {feature_name} feature values" - count_edges_hdf5 = entry_group[f"{Efeat.EDGE}/{Efeat.INDEX}"].shape[0] + count_edges_hdf5 = grp[f"{Efeat.EDGE}/{Efeat.INDEX}"].shape[0] - dataset = GraphDataset(hdf5_path=tmp_path) + dataset = GraphDataset(hdf5_path=tmp_path, target=targets.BINARY) torch_data_entry = dataset[0] + assert torch_data_entry is not None # expecting twice as many edges, because torch is directional @@ -351,7 +353,7 @@ def test_augmentation(): assert len(entry_names) == expected_entry_count, f"Found {len(entry_names)} entries, expected {expected_entry_count}" - dataset = GridDataset(hdf5_path) + dataset = GridDataset(hdf5_path, target = 'binary') assert len(dataset) == expected_entry_count, f"Found {len(dataset)} data points, expected {expected_entry_count}" finally: diff --git a/tests/test_set_lossfunction.py b/tests/test_set_lossfunction.py index 91d099a14..1ad42a732 100644 --- a/tests/test_set_lossfunction.py +++ b/tests/test_set_lossfunction.py @@ -4,13 +4,13 @@ import warnings import pytest -from deeprank2.dataset import GraphDataset -from deeprank2.neuralnets.gnn.naive_gnn import NaiveNetwork -from deeprank2.trainer import Trainer from torch import nn +from deeprank2.dataset import GraphDataset from deeprank2.domain import losstypes as losses from deeprank2.domain import targetstorage as targets +from deeprank2.neuralnets.gnn.naive_gnn import NaiveNetwork +from deeprank2.trainer import Trainer hdf5_path = 'tests/data/hdf5/test.hdf5' diff --git a/tests/test_trainer.py b/tests/test_trainer.py index 94fa1afcc..2cf6e151f 100644 --- a/tests/test_trainer.py +++ b/tests/test_trainer.py @@ -7,9 +7,14 @@ import warnings import h5py +import pandas as pd import pytest import torch + from deeprank2.dataset import GraphDataset, GridDataset +from deeprank2.domain import edgestorage as Efeat +from deeprank2.domain import nodestorage as Nfeat +from deeprank2.domain import targetstorage as targets from deeprank2.neuralnets.cnn.model3d import CnnClassification, CnnRegression from deeprank2.neuralnets.gnn.foutnet import FoutNet from deeprank2.neuralnets.gnn.ginet import GINet @@ -19,10 +24,6 @@ from deeprank2.utils.exporters import (HDF5OutputExporter, ScatterPlotExporter, TensorboardBinaryClassificationExporter) -from deeprank2.domain import edgestorage as Efeat -from deeprank2.domain import nodestorage as Nfeat -from deeprank2.domain import targetstorage as targets - _log = logging.getLogger(__name__) default_features = [Nfeat.RESTYPE, Nfeat.POLARITY, Nfeat.BSA, Nfeat.RESDEPTH, Nfeat.HSE, Nfeat.INFOCONTENT, Nfeat.PSSM] @@ -56,8 +57,7 @@ def _model_base_test( # pylint: disable=too-many-arguments, too-many-locals if val_hdf5_path is not None: dataset_val = GraphDataset( hdf5_path = val_hdf5_path, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, clustering_method = clustering_method, ) else: @@ -66,8 +66,7 @@ def _model_base_test( # pylint: disable=too-many-arguments, too-many-locals if test_hdf5_path is not None: dataset_test = GraphDataset( hdf5_path = test_hdf5_path, - train = False, - dataset_train = dataset_train, + train_source = dataset_train, clustering_method = clustering_method, ) else: @@ -383,6 +382,29 @@ def test_incompatible_pretrained_no_Net(self): pretrained_model = self.save_path ) + def test_no_training_no_pretrained(self): + dataset_train = GraphDataset( + hdf5_path = "tests/data/hdf5/test.hdf5", + clustering_method = "mcl", + target = targets.BINARY, + ) + dataset_val = GraphDataset( + hdf5_path = "tests/data/hdf5/test.hdf5", + train_source = dataset_train + ) + dataset_test = GraphDataset( + hdf5_path = "tests/data/hdf5/test.hdf5", + train_source = dataset_train + ) + trainer = Trainer( + neuralnet = GINet, + dataset_train = dataset_train, + dataset_val = dataset_val, + dataset_test = dataset_test + ) + with pytest.raises(ValueError): + trainer.test() + def test_no_valid_provided(self): dataset = GraphDataset( hdf5_path = "tests/data/hdf5/test.hdf5", @@ -397,6 +419,25 @@ def test_no_valid_provided(self): assert len(trainer.train_loader) == int(0.75 * len(dataset)) assert len(trainer.valid_loader) == int(0.25 * len(dataset)) + def test_no_test_provided(self): + dataset_train = GraphDataset( + hdf5_path = "tests/data/hdf5/test.hdf5", + clustering_method = "mcl", + target = targets.BINARY, + ) + dataset_val = GraphDataset( + hdf5_path = "tests/data/hdf5/test.hdf5", + train_source = dataset_train + ) + trainer = Trainer( + neuralnet = GINet, + dataset_train = dataset_train, + dataset_val = dataset_val, + ) + trainer.train(batch_size = 1, best_model=False, filename=None) + with pytest.raises(ValueError): + trainer.test() + def test_no_valid_full_train(self): dataset = GraphDataset( hdf5_path = "tests/data/hdf5/test.hdf5", @@ -438,7 +479,7 @@ def test_optim(self): dataset_test=dataset, pretrained_model=self.save_path) - assert isinstance(trainer_pretrained.optimizer, optimizer) + assert str(type(trainer_pretrained.optimizer)) == "" assert trainer_pretrained.lr == lr assert trainer_pretrained.weight_decay == weight_decay @@ -497,37 +538,14 @@ def test_dataset_equivalence_no_pretrained(self): dataset_train = dataset_invalid_train, ) - # Raise error when train parameter in dataset_val/test not set as False. dataset_train = GraphDataset( hdf5_path = "tests/data/hdf5/test.hdf5", edge_features = [Efeat.DISTANCE, Efeat.COVALENT], target = targets.BINARY ) - dataset_val = GraphDataset( - hdf5_path = "tests/data/hdf5/test.hdf5", - train = True, - dataset_train = dataset_train - ) - dataset_test = GraphDataset( - hdf5_path = "tests/data/hdf5/test.hdf5", - train = True, - dataset_train = dataset_train - ) - with pytest.raises(ValueError): - Trainer( - neuralnet = GINet, - dataset_train = dataset_train, - dataset_val = dataset_val, - ) - with pytest.raises(ValueError): - Trainer( - neuralnet = GINet, - dataset_train = dataset_train, - dataset_test = dataset_test, - ) - # Raise error when dataset_train parameter in GraphDataset/GridDataset - # not equivalent to the dataset_train passed to Trainer. + # Raise error when train_source parameter in GraphDataset/GridDataset + # is not equivalent to the dataset_train passed to Trainer. dataset_train_other = GraphDataset( hdf5_path = "tests/data/hdf5/test.hdf5", edge_features = [Efeat.SAMECHAIN, Efeat.COVALENT], @@ -536,13 +554,11 @@ def test_dataset_equivalence_no_pretrained(self): ) dataset_val = GraphDataset( hdf5_path = "tests/data/hdf5/test.hdf5", - train = False, - dataset_train = dataset_train + train_source = dataset_train ) dataset_test = GraphDataset( hdf5_path = "tests/data/hdf5/test.hdf5", - train = False, - dataset_train = dataset_train + train_source = dataset_train ) with pytest.raises(ValueError): Trainer( @@ -592,7 +608,7 @@ def test_trainsize(self): for t in test_cases: dataset_train, dataset_val =_divide_dataset( - dataset = GraphDataset(hdf5_path = hdf5), + dataset = GraphDataset(hdf5_path = hdf5, target = targets.BINARY), splitsize = t, ) assert len(dataset_train) == n_train @@ -622,12 +638,12 @@ def test_invalid_trainsize(self): def test_invalid_cuda_ngpus(self): dataset_train = GraphDataset( - hdf5_path = "tests/data/hdf5/test.hdf5" + hdf5_path = "tests/data/hdf5/test.hdf5", + target = targets.BINARY ) dataset_val = GraphDataset( hdf5_path = "tests/data/hdf5/test.hdf5", - train = False, - dataset_train = dataset_train + train_source = dataset_train ) with pytest.raises(ValueError): @@ -641,12 +657,12 @@ def test_invalid_cuda_ngpus(self): def test_invalid_no_cuda_available(self): if not torch.cuda.is_available(): dataset_train = GraphDataset( - hdf5_path = "tests/data/hdf5/test.hdf5" + hdf5_path = "tests/data/hdf5/test.hdf5", + target = targets.BINARY ) dataset_val = GraphDataset( hdf5_path = "tests/data/hdf5/test.hdf5", - train = False, - dataset_train = dataset_train + train_source = dataset_train ) with pytest.raises(ValueError): @@ -661,6 +677,132 @@ def test_invalid_no_cuda_available(self): warnings.warn('CUDA is available; test_invalid_no_cuda_available was skipped') _log.info('CUDA is available; test_invalid_no_cuda_available was skipped') + def test_train_method_no_train(self): + + # Graphs data + test_data_graph = "tests/data/hdf5/test.hdf5" + pretrained_model_graph = "tests/data/pretrained/testing_graph_model.pth.tar" + + dataset_test = GraphDataset( + hdf5_path = test_data_graph, + train_source = pretrained_model_graph + ) + trainer = Trainer( + neuralnet = NaiveNetwork, + dataset_test = dataset_test, + pretrained_model = pretrained_model_graph + ) + + with pytest.raises(ValueError): + trainer.train() + + # Grids data + test_data_grid = "tests/data/hdf5/1ATN_ppi.hdf5" + pretrained_model_grid = "tests/data/pretrained/testing_grid_model.pth.tar" + + dataset_test = GridDataset( + hdf5_path = test_data_grid, + train_source = pretrained_model_grid + ) + trainer = Trainer( + neuralnet = CnnClassification, + dataset_test = dataset_test, + pretrained_model = pretrained_model_grid + ) + + with pytest.raises(ValueError): + trainer.train() + + def test_test_method_pretrained_model_on_dataset_with_target(self): + + # Graphs data + test_data_graph = "tests/data/hdf5/test.hdf5" + pretrained_model_graph = "tests/data/pretrained/testing_graph_model.pth.tar" + + dataset_test = GraphDataset( + hdf5_path = test_data_graph, + train_source = pretrained_model_graph + ) + + trainer = Trainer( + neuralnet = NaiveNetwork, + dataset_test = dataset_test, + pretrained_model = pretrained_model_graph, + output_exporters = [HDF5OutputExporter("./")] + ) + + trainer.test() + + output = pd.read_hdf("output_exporter.hdf5", key="testing") + assert len(output) == len(dataset_test) + + # Grids data + test_data_grid = "tests/data/hdf5/1ATN_ppi.hdf5" + pretrained_model_grid = "tests/data/pretrained/testing_grid_model.pth.tar" + + dataset_test = GridDataset( + hdf5_path = test_data_grid, + train_source = pretrained_model_grid + ) + + trainer = Trainer( + neuralnet = CnnClassification, + dataset_test = dataset_test, + pretrained_model = pretrained_model_grid, + output_exporters = [HDF5OutputExporter("./")] + ) + + trainer.test() + + output = pd.read_hdf("output_exporter.hdf5", key="testing") + assert len(output) == len(dataset_test) + + def test_test_method_pretrained_model_on_dataset_without_target(self): + # Graphs data + test_data_graph = "tests/data/hdf5/test_no_target.hdf5" + pretrained_model_graph = "tests/data/pretrained/testing_graph_model.pth.tar" + + dataset_test = GraphDataset( + hdf5_path = test_data_graph, + train_source = pretrained_model_graph + ) + + trainer = Trainer( + neuralnet = NaiveNetwork, + dataset_test = dataset_test, + pretrained_model = pretrained_model_graph, + output_exporters = [HDF5OutputExporter("./")] + ) + + trainer.test() + + output = pd.read_hdf("output_exporter.hdf5", key="testing") + assert len(output) == len(dataset_test) + assert output.target.unique().tolist()[0] is None + assert output.loss.unique().tolist()[0] is None + + # Grids data + test_data_grid = "tests/data/hdf5/test_no_target.hdf5" + pretrained_model_grid = "tests/data/pretrained/testing_grid_model.pth.tar" + + dataset_test = GridDataset( + hdf5_path = test_data_grid, + train_source = pretrained_model_grid + ) + + trainer = Trainer( + neuralnet = CnnClassification, + dataset_test = dataset_test, + pretrained_model = pretrained_model_grid, + output_exporters = [HDF5OutputExporter("./")] + ) + + trainer.test() + + output = pd.read_hdf("output_exporter.hdf5", key="testing") + assert len(output) == len(dataset_test) + assert output.target.unique().tolist()[0] is None + assert output.loss.unique().tolist()[0] is None if __name__ == "__main__": diff --git a/tests/utils/test_graph.py b/tests/utils/test_graph.py index a413026b8..8d5563f1e 100644 --- a/tests/utils/test_graph.py +++ b/tests/utils/test_graph.py @@ -89,11 +89,11 @@ def test_graph_write_to_hdf5(graph): # check the contents of the hdf5 file with h5py.File(hdf5_path, "r") as f5: - entry_group = f5[entry_id] + grp = f5[entry_id] # nodes - assert Nfeat.NODE in entry_group - node_features_group = entry_group[Nfeat.NODE] + assert Nfeat.NODE in grp + node_features_group = grp[Nfeat.NODE] assert node_feature_narray in node_features_group assert len(np.nonzero( node_features_group[node_feature_narray][()])) > 0 @@ -102,8 +102,8 @@ def test_graph_write_to_hdf5(graph): 2, ) # edges - assert Efeat.EDGE in entry_group - edge_features_group = entry_group[Efeat.EDGE] + assert Efeat.EDGE in grp + edge_features_group = grp[Efeat.EDGE] assert edge_feature_narray in edge_features_group assert len(np.nonzero( edge_features_group[edge_feature_narray][()])) > 0 @@ -112,7 +112,7 @@ def test_graph_write_to_hdf5(graph): assert len(np.nonzero(edge_features_group[Efeat.INDEX][()])) > 0 # target - assert entry_group[Target.VALUES][target_name][()] == target_value + assert grp[Target.VALUES][target_name][()] == target_value finally: shutil.rmtree(tmp_dir_path) # clean up after the test @@ -138,11 +138,11 @@ def test_graph_write_as_grid_to_hdf5(graph): # check the contents of the hdf5 file with h5py.File(hdf5_path, "r") as f5: - entry_group = f5[entry_id] + grp = f5[entry_id] # mapped features - assert gridstorage.MAPPED_FEATURES in entry_group - mapped_group = entry_group[gridstorage.MAPPED_FEATURES] + assert gridstorage.MAPPED_FEATURES in grp + mapped_group = grp[gridstorage.MAPPED_FEATURES] ## narray features for feature_name in [ f"{node_feature_narray}_000", f"{node_feature_narray}_001", @@ -161,7 +161,7 @@ def test_graph_write_as_grid_to_hdf5(graph): assert np.all(data.shape == tuple(grid_settings.points_counts)) # target - assert entry_group[Target.VALUES][target_name][()] == target_value + assert grp[Target.VALUES][target_name][()] == target_value finally: shutil.rmtree(tmp_dir_path) # clean up after the test @@ -200,17 +200,17 @@ def test_graph_augmented_write_as_grid_to_hdf5(graph): with h5py.File(hdf5_path, "r") as f5: assert list( f5.keys()) == [entry_id, f"{entry_id}_000", f"{entry_id}_001"] - entry_group = f5[entry_id] - mapped_group = entry_group[gridstorage.MAPPED_FEATURES] + grp = f5[entry_id] + mapped_group = grp[gridstorage.MAPPED_FEATURES] # check that the feature value is preserved after augmentation unaugmented_data = mapped_group[node_feature_singleton][:] for aug_id in [f"{entry_id}_000", f"{entry_id}_001"]: - entry_group = f5[aug_id] + grp = f5[aug_id] # mapped features - assert gridstorage.MAPPED_FEATURES in entry_group - mapped_group = entry_group[gridstorage.MAPPED_FEATURES] + assert gridstorage.MAPPED_FEATURES in grp + mapped_group = grp[gridstorage.MAPPED_FEATURES] ## narray features for feature_name in [ f"{node_feature_narray}_000", @@ -237,7 +237,7 @@ def test_graph_augmented_write_as_grid_to_hdf5(graph): np.sum(unaugmented_data)).item() < 0.2 # target - assert entry_group[Target.VALUES][target_name][( + assert grp[Target.VALUES][target_name][( )] == target_value finally: diff --git a/tutorials/training.ipynb b/tutorials/training.ipynb index 0ed71ed1d..6e0340970 100644 --- a/tutorials/training.ipynb +++ b/tutorials/training.ipynb @@ -231,7 +231,7 @@ "- For the `GraphDataset` class it is possible to define a dictionary to indicate which transformations to apply to the features, being the transformations lambda functions and/or standardization.\n", " - If the `standardize` key is `True`, standardization is applied after transformation. Standardization consists in applying the following formula on each feature's value: ${x' = \\frac{x - \\mu}{\\sigma}}$, being ${\\mu}$ the mean and ${\\sigma}$ the standard deviation. Standardization is a scaling method where the values are centered around mean with a unit standard deviation.\n", " - The transformation to apply can be speficied as a lambda function as a value of the key `transform`, which defaults to `None`.\n", - " - Since in the provided example standardization is applied, the training features' means and standard deviations need to be used for scaling validation and test sets. For doing so, `train` and `dataset_train` parameters are used. When `train` is set as `False`, a `dataset_train` of the same class must be provided and it will be used to scale the validation/testing sets according to its features values. You need to pass `features_transform` to the training dataset only, since in other cases it will be ignored and only the one of `dataset_train` will be considered. \n", + " - Since in the provided example standardization is applied, the training features' means and standard deviations need to be used for scaling validation and test sets. For doing so, `train_source` parameter is used. When `train_source` parameter is set, it will be used to scale the validation/testing sets. You need to pass `features_transform` to the training dataset only, since in other cases it will be ignored and only the one of `train_source` will be considered. \n", " - Note that transformations have not currently been implemented for the `GridDataset` class. \n", " - In the example below a logarithmic transformation and then the standardization are applied to all the features. It is also possible to use specific features as keys for indicating that transformation and/or standardization need to be apply to few features only." ] @@ -262,15 +262,13 @@ "dataset_val = GraphDataset(\n", " hdf5_path = input_data_path,\n", " subset = list(df_valid.entry), # selects only data points with ids in df_valid.entry\n", - " train = False,\n", - " dataset_train = dataset_train\n", + " train_source = dataset_train\n", ")\n", "print('\\nLoading test data...')\n", "dataset_test = GraphDataset(\n", " hdf5_path = input_data_path,\n", " subset = list(df_test.entry), # selects only data points with ids in df_test.entry\n", - " train = False,\n", - " dataset_train = dataset_train\n", + " train_source = dataset_train\n", ")" ] }, @@ -552,15 +550,13 @@ "dataset_val = GridDataset(\n", " hdf5_path = input_data_path,\n", " subset = list(df_valid.entry), # selects only data points with ids in df_valid.entry\n", - " train = False,\n", - " dataset_train = dataset_train\n", + " train_source = dataset_train\n", ")\n", "print('\\nLoading test data...')\n", "dataset_test = GridDataset(\n", " hdf5_path = input_data_path,\n", " subset = list(df_test.entry), # selects only data points with ids in df_test.entry\n", - " train = False,\n", - " dataset_train = dataset_train \n", + " train_source = dataset_train \n", ")" ] },