Learning¶
This module contains all the tools for deep learning in DeepRank. The two main modules are deeprank.learn.DataSet
and deeprank.learn.NeuralNet
. The DataSet
class allows to process several hdf5 files created by the deeprank.generate
toolset for use by pyTorch. This is done by creating several torch.data_utils.DataLoader
for the training, valiation and test of the model. Several options are possible to specify and filter which conformations should be used in the dataset. The NeuralNet
class is in charge of the deep learning part.There as well several options are possible to specify the task to be performed, the architecture of the neural network etc ….
Example:
>>> from deeprank.learn import *
>>> from model3d import cnn_class
>>>
>>> database = '1ak4.hdf5'
>>>
>>> # declare the dataset instance
>>> data_set = DataSet(database,
>>> chain1='C',
>>> chain2='D',
>>> select_feature='all',
>>> select_target='IRMSD',
>>> dict_filter={'IRMSD':'<4. or >10.'})
>>>
>>>
>>> # create the network
>>> model = NeuralNet(data_set, cnn_class, model_type='3d', task='class')
>>>
>>> # start the training
>>> model.train(nepoch = 250,divide_trainset=0.8, train_batch_size = 50, num_workers=8)
>>>
>>> # save the model
>>> model.save_model()
The details of the submodules are presented here. The two main ones are deeprank.learn.DataSet
and deeprank.learn.NeuralNet
.
note: | The module deeprank.learn.modelGenerator can automatically create the file defining the neural network architecture. |
---|
DataSet: create a torch dataset¶
-
class
deeprank.learn.DataSet.
DataSet
(train_database, valid_database=None, test_database=None, chain1='A', chain2='B', mapfly=True, grid_info=None, use_rotation=None, select_feature='all', select_target='DOCKQ', normalize_features=True, normalize_targets=True, target_ordering=None, dict_filter=None, pair_chain_feature=None, transform_to_2D=False, projection=0, clip_features=True, clip_factor=1.5, rotation_seed=None, tqdm=False, process=True)[source]¶ Generates the dataset needed for pytorch.
This class hanldes the data generated by deeprank.generate to be used in the deep learning part of DeepRank.
Parameters: - train_database (list(str)) – names of the hdf5 files used for the training/validation. Example: [‘1AK4.hdf5’,’1B7W.hdf5’,…]
- valid_database (list(str)) – names of the hdf5 files used for the validation. Example: [‘1ACB.hdf5’,’4JHF.hdf5’,…]
- test_database (list(str)) – names of the hdf5 files used for the test. Example: [‘7CEI.hdf5’]
- chain1 (str) – first chain ID, defaults to ‘A’
- chain2 (str) – second chain ID, defaults to ‘B’
- mapfly (bool) – do we compute the map in the batch preparation or read them
- grid_info (dict) –
grid information to map the feature. If None the original grid points are used. The dict contains:
- ’number_of_points”, the shape of grid
- ’resolution’, the resolution of grid, unit in A
Example
{‘number_of_points’: [10, 10, 10], ‘resolution’: [3, 3, 3]}
- use_rotation (int) – number of rotations to use. Example: 0 (use only original data) Default: None (use all data of the database)
- select_feature (dict or 'all', optional) – Select the features used in the learning. if mapfly is True: - {‘AtomDensities’: ‘all’, ‘Features’: ‘all’} - {‘AtomicDensities’: config.atom_vdw_radius_noH, ‘Features’: [‘PSSM_*’, ‘pssm_ic_*’]} if mapfly is False: - {‘AtomDensities_ind’: ‘all’, ‘Feature_ind’: ‘all’} - {‘Feature_ind’: [‘PSSM_*’, ‘pssm_ic_*’]} Default: ‘all’
- select_target (str,optional) – Specify required target. Default: ‘DOCKQ’
- normalize_features (Bool, optional) – normalize features or not Default: True
- normalize_targets (Bool, optional) – normalize targets or not Default: True
- target_ordering (str) – ‘lower’ (the lower the better) or ‘higher’ (the higher the better) By default is not specified (None) and the code tries to identify it. If identification fails ‘lower’ is used.
- dict_filter (None or dict, optional) – Specify if we filter the complexes based on target values, Example: {‘IRMSD’: ‘<4. or >10’} (select complexes with IRMSD lower than 4 or larger than 10) Default: None
- pair_chain_feature (None or callable, optional) – method to pair features of chainA and chainB Example: np.sum (sum the chainA and chainB features)
- transform_to_2D (bool, optional) – Boolean to use 2d maps instead of full 3d Default: False
- projection (int) – Projection axis from 3D to 2D: Mapping: 0 -> yz, 1 -> xz, 2 -> xy Default = 0
- clip_features (bool, optional) – Remove too large values of the grid. Can be needed for native complexes where the coulomb feature might be too large
- clip_factor (float, optional) – the features are clipped at: +/-mean + clip_factor * std
- tqdm (bool, optional) – Print the progress bar
- process (bool, optional) – Actually process the data set. Must be set to False when reusing a model for testing
- rotation_seed (int, optional) – random seed for getting rotation axis and angle.
Examples
>>> from deeprank.learn import * >>> train_database = '1ak4.hdf5' >>> data_set = DataSet(train_database, >>> valid_database = None, >>> test_database = None, >>> chain1='C', >>> chain2='D', >>> grid_info = { >>> 'number_of_points': (10, 10, 10), >>> 'resolution': (3, 3, 3) >>> }, >>> select_feature = { >>> 'AtomicDensities': 'all', >>> 'Features': [ >>> 'PSSM_*', 'pssm_ic_*' ] >>> }, >>> select_target='IRMSD', >>> normalize_features = True, >>> normalize_targets=True, >>> pair_chain_feature=np.add, >>> dict_filter={'IRMSD':'<4. or >10.'}, >>> process = True)
-
static
_get_database_name
(database)[source]¶ Get the list of hdf5 database file names.
Parameters: database (None, str or list(str)) – hdf5 database name(s). Returns: hdf5 file names Return type: list
-
process_dataset
()[source]¶ Process the data set.
Done by default. However must be turned off when one want to test a pretrained model. This can be done by setting
process=False
in the creation of theDataSet
instance.
-
create_index_molecules
()[source]¶ Create the indexing of each molecule in the dataset.
Create the indexing: [(‘1ak4.hdf5,1AK4_100w),…,(‘1fqj.hdf5,1FGJ_400w)] This allows to refer to one complex with its index in the list.
Raises: ValueError
– No aviable training data after filtering.
-
_select_pdb
(mol_names)[source]¶ Select complexes.
Parameters: mol_names (list) – list of complex names Returns: list of selected complexes Return type: list
-
filter
(molgrp)[source]¶ Filter the molecule according to a dictionary, e.g., dict_filter={‘DOCKQ’:’>0.1’, ‘IRMSD’:’<=4 or >10’}).
The filter is based on the attribute self.dict_filter that must be either of the form: { ‘name’: cond } or None
Parameters: molgrp (str) – group name of the molecule in the hdf5 file Returns: True if we keep the complex False otherwise Return type: bool Raises: ValueError
– If an unsuported condition is provided
-
get_mapped_feature_name
()[source]¶ Get actual mapped feature names for feature selections.
Note
- class parameter self.select_feature examples:
- ‘all’
- {‘AtomicDensities_ind’: ‘all’, ‘Feature_ind’:all}
- {‘Feature_ind’: [‘PSSM_*’, ‘pssm_ic_*’]}
- Feature type must be: ‘AtomicDensities_ind’ or ‘Feature_ind’.
Raises:
-
get_raw_feature_name
()[source]¶ Get actual raw feature names for feature selections.
Note
- class parameter self.select_feature examples:
- ‘all’
- {‘AtomicDensities’: ‘all’, ‘Features’:all}
- {‘AtomicDensities’: config.atom_vaw_radius_noH, ‘Features’: [‘PSSM_*’, ‘pssm_ic_*’]}
- Feature type must be: ‘AtomicDensities’ or ‘Features’.
Raises:
-
get_input_shape
()[source]¶ Get the size of the data and input.
Note
- self.data_shape: shape of the raw 3d data set
- self.input_shape: input size of the CNN. Potentially after 2d transformation.
-
get_grid_shape
()[source]¶ Get the shape of the matrices.
Raises: ValueError
– If no grid shape is provided or is present in the HDF5 file
-
_get_target_ordering
(order)[source]¶ Determine if ordering of the target.
This can be lower the better or higher the better If it can’t determine the ordering ‘lower’ is assumed
-
backtransform_target
(data)[source]¶ Returns the values of the target after de-normalization.
Parameters: data (list(float)) – normalized data Returns: un-normalized data Return type: list(float)
-
_normalize_target
(target)[source]¶ Normalize the values of the targets.
Parameters: target (list(float)) – raw data Returns: normalized data Return type: list(float)
-
_normalize_feature
(feature)[source]¶ Normalize the values of the features.
Parameters: feature (np.array) – raw feature values Returns: normalized feature values Return type: np.array
-
_clip_feature
(feature)[source]¶ Clip the value of the features at +/- mean + clip_factor * std. :param feature: raw feature values :type feature: np.array
Returns: clipped feature values Return type: np.array
-
static
_mad_based_outliers
(points, minv, maxv, thresh=3.5)[source]¶ Mean absolute deviation based outlier detection.
(Experimental). :param points: raw input data :type points: np.array :param minv: Minimum (negative) value requested :type minv: float :param maxv: Maximum (positive) value requested :type maxv: float :param thresh: Threshold for data detection :type thresh: float, optional
Returns: data where outliers were replaced by min/max values Return type: TYPE
-
load_one_molecule
(fname, mol=None)[source]¶ Load the feature/target of a single molecule.
Parameters: Returns: features, targets
Return type: np.array,float
-
map_one_molecule
(fname, mol=None, angle=None, axis=None)[source]¶ Map the feature and load feature/target of a single molecule.
Parameters: Returns: features, targets
Return type: np.array,float
-
static
convert2d
(feature, proj2d)[source]¶ Convert the 3D volumetric feature to a 2D planar data set.
proj2d specifies the dimension that we want to consider as channel for example for proj2d = 0 the 2D images are in the yz plane and the stack along the x dimension is considered as extra channels :param feature: raw features :type feature: np.array :param proj2d: projection :type proj2d: int
Returns: projected features Return type: np.array
-
static
make_feature_pair
(feature, op)[source]¶ Pair the features of both chains.
Parameters: - feature (np.array) – raw features
- op (callable) – function to combine the features
Returns: combined features
Return type: np.array
Raises: ValueError
– if op is not callable
-
get_grid
(mol_data)[source]¶ Get meshed grids and number of pointgs
Parameters: mol_data (h5 group) – HDF5 moleucle group Raises: ValueError
– Grid points not found in mol_data.Returns: meshgrid, npts Return type: tuple, tuple
-
map_atomic_densities
(feat_names, mol_data, grid, npts, angle, axis)[source]¶ Map atomic densities.
Parameters: Returns: atomic densities of each atom type on each chain
Return type:
-
static
_densgrid
(center, vdw_radius, grid, npts)[source]¶ Function to map individual atomic density on the grid.
The formula is equation (1) of the Koes paper Protein-Ligand Scoring with Convolutional NN Arxiv:1612.02751v1
Parameters: Returns: np.array (mapped density)
Return type: TYPE
-
static
_featgrid
(center, value, grid, npts)[source]¶ Map an individual feature (atomic or residue) on the grid.
Parameters: Returns: Mapped feature
Return type: np.array
Raises: ValueError
– Description
NeuralNet: perform deep learning¶
-
class
deeprank.learn.NeuralNet.
NeuralNet
(data_set, model, model_type='3d', proj2d=0, task='reg', class_weights=None, pretrained_model=None, chain1='A', chain2='B', cuda=False, ngpu=0, plot=False, save_hitrate=False, save_classmetrics=False, outdir='./')[source]¶ Train a Convolutional Neural Network for DeepRank.
Parameters: - data_set (deeprank.DataSet or list(str)) – Data set used for training or testing. - deeprank.DataSet for training; - str or list(str), e.g. ‘x.hdf5’, [‘x1.hdf5’, ‘x2.hdf5’], for testing when pretrained model is loaded.
- model (nn.Module) – Definition of the NN to use. Must subclass nn.Module. See examples in model2d.py and model3d.py
- model_type (str) – Type of model we want to use. Must be ‘2d’ or ‘3d’. If we specify a 2d model, the data set is automatically converted to the correct format.
- proj2d (int) – Defines how to slice the 3D volumetric data to generate 2D data. Allowed values are 0, 1 and 2, which are to slice along the YZ, XZ or XY plane, respectively.
- task (str 'reg' or 'class') – Task to perform. - ‘reg’ for regression - ‘class’ for classification. The loss function, the target datatype and plot functions will be autmatically adjusted depending on the task.
- class_weights (Tensor) – a manual rescaling weight given to each class. If given, has to be a Tensor of size #classes. Only applicable on ‘class’ task.
- pretrained_model (str) – Saved model to be used for further training or testing. When using pretrained model, remember to set the following ‘chain1’ and ‘chain2’ for the new data.
- chain1 (str) – first chain ID of new data when using pretrained model
- chain2 (str) – second chain ID of new data when using pretrained model
- cuda (bool) – Use CUDA.
- ngpu (int) – number of GPU to be used.
- plot (bool) – Plot the prediction results.
- save_hitrate (bool) – Save and plot hit rate.
- save_classmetrics (bool) – Save and plot classification metrics. Classification metrics include: - accuracy(ACC) - sensitivity(TPR) - specificity(TNR)
- outdir (str) – output directory
Raises: ValueError
– if dataset format is not recognizedValueError
– if task is not recognized
Examples
Train models: >>> data_set = Dataset(…) >>> model = NeuralNet(data_set, cnn, … model_type=’3d’, task=’reg’, … plot=True, save_hitrate=True, … outdir=’./out/’) >>> model.train(nepoch = 50, divide_trainset=0.8, … train_batch_size = 5, num_workers=0)
Test a model on new data: >>> data_set = [‘test01.hdf5’, ‘test02.hdf5’] >>> model = NeuralNet(data_set, cnn, … pretrained_model = ‘./model.pth.tar’, … outdir=’./out/’) >>> model.test()
-
train
(nepoch=50, divide_trainset=None, hdf5='epoch_data.hdf5', train_batch_size=10, preshuffle=True, preshuffle_seed=None, export_intermediate=True, num_workers=1, save_model='best', save_epoch='intermediate', hit_cutoff=None)[source]¶ Perform a simple training of the model.
Parameters: - nepoch (int, optional) – number of iterations
- divide_trainset (list, optional) – the percentage assign to the training, validation and test set. Examples: [0.7, 0.2, 0.1], [0.8, 0.2], None
- hdf5 (str, optional) – file to store the training results
- train_batch_size (int, optional) – size of the batch
- preshuffle (bool, optional) – preshuffle the dataset before dividing it.
- preshuffle_seed (int, optional) – set random seed for preshuffle
- export_intermediate (bool, optional) – export data at intermediate epochs.
- num_workers (int, optional) – number of workers to be used to prepare the batch data
- save_model (str, optional) – ‘best’ or ‘all’, save only the best model or all models.
- save_epoch (str, optional) – ‘intermediate’ or ‘all’, save the epochs data to HDF5.
- hit_cutoff (float, optional) – the cutoff used to define hit by comparing with docking models’ target value, e.g. IRMSD value
-
test
(hdf5='test_data.hdf5', hit_cutoff=None, has_target=False)[source]¶ Test a predefined model on a new dataset.
Parameters: - hdf5 (str, optional) – hdf5 file to store the test results
- hit_cutoff (float, optional) – the cutoff used to define hit by comparing with docking models’ target value, e.g. IRMSD value
- has_target (bool, optional) – specify the presence (True) or absence (False) of target values in the test set. No metrics can be computed if False.
Examples
>>> # adress of the database >>> database = '1ak4.hdf5' >>> # Load the model in a new network instance >>> model = NeuralNet(database, cnn, ... pretrained_model='./model/model.pth.tar', ... outdir='./test/') >>> # test the model >>> model.test()
-
save_model
(filename='model.pth.tar')[source]¶ save the model to disk.
Parameters: filename (str, optional) – name of the file
-
_divide_dataset
(divide_set, preshuffle, preshuffle_seed)[source]¶ Divide the data set into training, validation and test according to the percentage in divide_set.
Parameters: Returns: - Indices of the
training/validation/test set.
Return type:
-
_train
(index_train, index_valid, index_test, nepoch=50, train_batch_size=5, export_intermediate=False, num_workers=1, save_epoch='intermediate', save_model='best')[source]¶ Train the model.
Parameters: - index_train (list(int)) – Indices of the training set
- index_valid (list(int)) – Indices of the validation set
- index_test (list(int)) – Indices of the testing set
- nepoch (int, optional) – numbr of epoch
- train_batch_size (int, optional) – size of the batch
- export_intermediate (bool, optional) – export itnermediate data
- num_workers (int, optional) – number of workers pytorch uses to create the batch size
- save_epoch (str,optional) – ‘intermediate’ or ‘all’
- save_model (str, optional) – ‘all’ or ‘best’
Returns: Parameters of the network after training
Return type: torch.tensor
-
_epoch
(data_loader, train_model, has_target=True)[source]¶ Perform one single epoch iteration over a data loader.
Parameters: - data_loader (torch.DataLoader) – DataLoader for the epoch
- train_model (bool) – train the model if True or not if False
Returns: loss of the model dict: data of the epoch
Return type:
-
_get_variables
(inputs, targets)[source]¶ Convert the feature/target in torch.Variables.
The format is different for regression where the targets are float and classification where they are int.
Parameters: - inputs (np.array) – raw features
- targets (np.array) – raw target values
Returns: features torch.Variable: target values
Return type: torch.Variable
-
_export_losses
(figname)[source]¶ Plot the losses vs the epoch.
Parameters: figname (str) – name of the file where to export the figure
-
_plot_scatter_reg
(figname)[source]¶ Plot a scatter plots of predictions VS targets.
Useful to visualize the performance of the training algorithm
Parameters: figname (str) – filename
-
_plot_boxplot_class
(figname)[source]¶ Plot a boxplot of predictions VS targets.
It is only usefull in classification tasks.
Parameters: figname (str) – filename
-
plot_hit_rate
(figname)[source]¶ Plot the hit rate of the different training/valid/test sets.
- The hit rate is defined as:
- The percentage of positive(near-native) decoys that are included among the top m decoys.
Parameters: figname (str) – filename for the plot
modelGenerator: generate NN architecture¶
-
class
deeprank.learn.modelGenerator.
NetworkGenerator
(name='_tmp_model_', fname='_tmp_model_.py', conv_layers=None, fc_layers=None)[source]¶ Automatic generation of NN files.
This class allows for automatic generation of python file containing the definition of torch formatted neural network.
Parameters: Example
>>> conv_layers = [] >>> conv_layers.append(conv(output_size=4,kernel_size=2,post='relu')) >>> conv_layers.append(pool(kernel_size=2)) >>> conv_layers.append(conv(input_size=4,output_size=5,kernel_size=2,post='relu')) >>> conv_layers.append(pool(kernel_size=2)) >>> >>> fc_layers = [] >>> fc_layers.append(fc(output_size=84,post='relu')) >>> fc_layers.append(fc(input_size=84,output_size=1)) >>> >>> MG = NetworkGenerator(name='test',fname='model_test.py',conv_layers=conv_layers,fc_layers=fc_layers) >>> MG.print() >>> MG.write()
-
class
deeprank.learn.modelGenerator.
conv
(input_size=-1, output_size=None, kernel_size=None, post=None)[source]¶ Wrapper around the convolutional layer.
Parameters: Example:
>>> conv_layers.append(conv(output_size=4,kernel_size=2,post='relu'))
-
class
deeprank.learn.modelGenerator.
pool
(kernel_size=None, post=None)[source]¶ Wrapper around the pool layer.
Parameters: Example:
>>> conv_layers.append(pool(kernel_size=2))
-
class
deeprank.learn.modelGenerator.
dropout
(percent=0.5)[source]¶ Wrapper around the dropout layer layer.
Parameters: percent (float) – percent of dropout Example:
>>> fc_layers.append(dropout(precent=0.25))