Learning

This module contains all the tools for deep learning in DeepRank. The two main modules are deeprank.learn.DataSet and deeprank.learn.NeuralNet. The DataSet class allows to process several hdf5 files created by the deeprank.generate toolset for use by pyTorch. This is done by creating several torch.data_utils.DataLoader for the training, valiation and test of the model. Several options are possible to specify and filter which conformations should be used in the dataset. The NeuralNet class is in charge of the deep learning part.There as well several options are possible to specify the task to be performed, the architecture of the neural network etc ….

Example:

>>> from deeprank.learn import *
>>> from model3d import cnn_class
>>>
>>> database = '1ak4.hdf5'
>>>
>>> # declare the dataset instance
>>> data_set = DataSet(database,
>>>                    chain1='C',
>>>                    chain2='D',
>>>                    select_feature='all',
>>>                    select_target='IRMSD',
>>>                    dict_filter={'IRMSD':'<4. or >10.'})
>>>
>>>
>>> # create the network
>>> model = NeuralNet(data_set, cnn_class, model_type='3d', task='class')
>>>
>>> # start the training
>>> model.train(nepoch = 250,divide_trainset=0.8, train_batch_size = 50, num_workers=8)
>>>
>>> # save the model
>>> model.save_model()

The details of the submodules are presented here. The two main ones are deeprank.learn.DataSet and deeprank.learn.NeuralNet.

note:The module deeprank.learn.modelGenerator can automatically create the file defining the neural network architecture.

DataSet: create a torch dataset

class deeprank.learn.DataSet.DataSet(train_database, valid_database=None, test_database=None, chain1='A', chain2='B', mapfly=True, grid_info=None, use_rotation=None, select_feature='all', select_target='DOCKQ', normalize_features=True, normalize_targets=True, target_ordering=None, dict_filter=None, pair_chain_feature=None, transform_to_2D=False, projection=0, clip_features=True, clip_factor=1.5, rotation_seed=None, tqdm=False, process=True)[source]

Generates the dataset needed for pytorch.

This class hanldes the data generated by deeprank.generate to be used in the deep learning part of DeepRank.

Parameters:
  • train_database (list(str)) – names of the hdf5 files used for the training/validation. Example: [‘1AK4.hdf5’,’1B7W.hdf5’,…]
  • valid_database (list(str)) – names of the hdf5 files used for the validation. Example: [‘1ACB.hdf5’,’4JHF.hdf5’,…]
  • test_database (list(str)) – names of the hdf5 files used for the test. Example: [‘7CEI.hdf5’]
  • chain1 (str) – first chain ID, defaults to ‘A’
  • chain2 (str) – second chain ID, defaults to ‘B’
  • mapfly (bool) – do we compute the map in the batch preparation or read them
  • grid_info (dict) –

    grid information to map the feature. If None the original grid points are used. The dict contains:

    • ’number_of_points”, the shape of grid
    • ’resolution’, the resolution of grid, unit in A

    Example

    {‘number_of_points’: [10, 10, 10], ‘resolution’: [3, 3, 3]}

  • use_rotation (int) – number of rotations to use. Example: 0 (use only original data) Default: None (use all data of the database)
  • select_feature (dict or 'all', optional) – Select the features used in the learning. if mapfly is True: - {‘AtomDensities’: ‘all’, ‘Features’: ‘all’} - {‘AtomicDensities’: config.atom_vdw_radius_noH, ‘Features’: [‘PSSM_*’, ‘pssm_ic_*’]} if mapfly is False: - {‘AtomDensities_ind’: ‘all’, ‘Feature_ind’: ‘all’} - {‘Feature_ind’: [‘PSSM_*’, ‘pssm_ic_*’]} Default: ‘all’
  • select_target (str,optional) – Specify required target. Default: ‘DOCKQ’
  • normalize_features (Bool, optional) – normalize features or not Default: True
  • normalize_targets (Bool, optional) – normalize targets or not Default: True
  • target_ordering (str) – ‘lower’ (the lower the better) or ‘higher’ (the higher the better) By default is not specified (None) and the code tries to identify it. If identification fails ‘lower’ is used.
  • dict_filter (None or dict, optional) – Specify if we filter the complexes based on target values, Example: {‘IRMSD’: ‘<4. or >10’} (select complexes with IRMSD lower than 4 or larger than 10) Default: None
  • pair_chain_feature (None or callable, optional) – method to pair features of chainA and chainB Example: np.sum (sum the chainA and chainB features)
  • transform_to_2D (bool, optional) – Boolean to use 2d maps instead of full 3d Default: False
  • projection (int) – Projection axis from 3D to 2D: Mapping: 0 -> yz, 1 -> xz, 2 -> xy Default = 0
  • clip_features (bool, optional) – Remove too large values of the grid. Can be needed for native complexes where the coulomb feature might be too large
  • clip_factor (float, optional) – the features are clipped at: +/-mean + clip_factor * std
  • tqdm (bool, optional) – Print the progress bar
  • process (bool, optional) – Actually process the data set. Must be set to False when reusing a model for testing
  • rotation_seed (int, optional) – random seed for getting rotation axis and angle.

Examples

>>> from deeprank.learn import *
>>> train_database = '1ak4.hdf5'
>>> data_set = DataSet(train_database,
>>>                    valid_database = None,
>>>                    test_database = None,
>>>                    chain1='C',
>>>                    chain2='D',
>>>                    grid_info = {
>>>                        'number_of_points': (10, 10, 10),
>>>                        'resolution': (3, 3, 3)
>>>                    },
>>>                    select_feature = {
>>>                       'AtomicDensities': 'all',
>>>                       'Features': [
>>>                            'PSSM_*', 'pssm_ic_*' ]
>>>                    },
>>>                    select_target='IRMSD',
>>>                    normalize_features = True,
>>>                    normalize_targets=True,
>>>                    pair_chain_feature=np.add,
>>>                    dict_filter={'IRMSD':'<4. or >10.'},
>>>                    process = True)
static _get_database_name(database)[source]

Get the list of hdf5 database file names.

Parameters:database (None, str or list(str)) – hdf5 database name(s).
Returns:hdf5 file names
Return type:list
process_dataset()[source]

Process the data set.

Done by default. However must be turned off when one want to test a pretrained model. This can be done by setting process=False in the creation of the DataSet instance.

static check_hdf5_files(database)[source]

Check if the data contained in the hdf5 file is ok.

create_index_molecules()[source]

Create the indexing of each molecule in the dataset.

Create the indexing: [(‘1ak4.hdf5,1AK4_100w),…,(‘1fqj.hdf5,1FGJ_400w)] This allows to refer to one complex with its index in the list.

Raises:ValueError – No aviable training data after filtering.
_select_pdb(mol_names)[source]

Select complexes.

Parameters:mol_names (list) – list of complex names
Returns:list of selected complexes
Return type:list
filter(molgrp)[source]

Filter the molecule according to a dictionary, e.g., dict_filter={‘DOCKQ’:’>0.1’, ‘IRMSD’:’<=4 or >10’}).

The filter is based on the attribute self.dict_filter that must be either of the form: { ‘name’: cond } or None

Parameters:molgrp (str) – group name of the molecule in the hdf5 file
Returns:True if we keep the complex False otherwise
Return type:bool
Raises:ValueError – If an unsuported condition is provided
get_mapped_feature_name()[source]

Get actual mapped feature names for feature selections.

Note

  • class parameter self.select_feature examples:
    • ‘all’
    • {‘AtomicDensities_ind’: ‘all’, ‘Feature_ind’:all}
    • {‘Feature_ind’: [‘PSSM_*’, ‘pssm_ic_*’]}
  • Feature type must be: ‘AtomicDensities_ind’ or ‘Feature_ind’.
Raises:
get_raw_feature_name()[source]

Get actual raw feature names for feature selections.

Note

  • class parameter self.select_feature examples:
    • ‘all’
    • {‘AtomicDensities’: ‘all’, ‘Features’:all}
    • {‘AtomicDensities’: config.atom_vaw_radius_noH, ‘Features’: [‘PSSM_*’, ‘pssm_ic_*’]}
  • Feature type must be: ‘AtomicDensities’ or ‘Features’.
Raises:
print_possible_features()[source]

Print the possible features in the group.

get_pairing_feature()[source]

Creates the index of paired features.

get_input_shape()[source]

Get the size of the data and input.

Note

  • self.data_shape: shape of the raw 3d data set
  • self.input_shape: input size of the CNN. Potentially after 2d transformation.
get_grid_shape()[source]

Get the shape of the matrices.

Raises:ValueError – If no grid shape is provided or is present in the HDF5 file
compute_norm()[source]

compute the normalization factors.

get_norm()[source]

Get the normalization values for the features.

_read_norm()[source]

Read or create the normalization file for the complex.

_get_target_ordering(order)[source]

Determine if ordering of the target.

This can be lower the better or higher the better If it can’t determine the ordering ‘lower’ is assumed

backtransform_target(data)[source]

Returns the values of the target after de-normalization.

Parameters:data (list(float)) – normalized data
Returns:un-normalized data
Return type:list(float)
_normalize_target(target)[source]

Normalize the values of the targets.

Parameters:target (list(float)) – raw data
Returns:normalized data
Return type:list(float)
_normalize_feature(feature)[source]

Normalize the values of the features.

Parameters:feature (np.array) – raw feature values
Returns:normalized feature values
Return type:np.array
_clip_feature(feature)[source]

Clip the value of the features at +/- mean + clip_factor * std. :param feature: raw feature values :type feature: np.array

Returns:clipped feature values
Return type:np.array
static _mad_based_outliers(points, minv, maxv, thresh=3.5)[source]

Mean absolute deviation based outlier detection.

(Experimental). :param points: raw input data :type points: np.array :param minv: Minimum (negative) value requested :type minv: float :param maxv: Maximum (positive) value requested :type maxv: float :param thresh: Threshold for data detection :type thresh: float, optional

Returns:data where outliers were replaced by min/max values
Return type:TYPE
load_one_molecule(fname, mol=None)[source]

Load the feature/target of a single molecule.

Parameters:
  • fname (str) – hdf5 file name
  • mol (None or str, optional) – name of the complex in the hdf5
Returns:

features, targets

Return type:

np.array,float

map_one_molecule(fname, mol=None, angle=None, axis=None)[source]

Map the feature and load feature/target of a single molecule.

Parameters:
  • fname (str) – hdf5 file name
  • mol (None or str, optional) – name of the complex in the hdf5
Returns:

features, targets

Return type:

np.array,float

static convert2d(feature, proj2d)[source]

Convert the 3D volumetric feature to a 2D planar data set.

proj2d specifies the dimension that we want to consider as channel for example for proj2d = 0 the 2D images are in the yz plane and the stack along the x dimension is considered as extra channels :param feature: raw features :type feature: np.array :param proj2d: projection :type proj2d: int

Returns:projected features
Return type:np.array
static make_feature_pair(feature, op)[source]

Pair the features of both chains.

Parameters:
  • feature (np.array) – raw features
  • op (callable) – function to combine the features
Returns:

combined features

Return type:

np.array

Raises:

ValueError – if op is not callable

get_grid(mol_data)[source]

Get meshed grids and number of pointgs

Parameters:mol_data (h5 group) – HDF5 moleucle group
Raises:ValueError – Grid points not found in mol_data.
Returns:meshgrid, npts
Return type:tuple, tuple
map_atomic_densities(feat_names, mol_data, grid, npts, angle, axis)[source]

Map atomic densities.

Parameters:
  • feat_names (dict) – Element type and vdw radius
  • mol_data (h5 group) – HDF5 molecule group
  • grid (tuple) – mesh grid of x,y,z
  • npts (tuple) – number of points on axis x,y,z
  • angle (float) – rotation angle
  • axis (list) – rotation axis
Returns:

atomic densities of each atom type on each chain

Return type:

list

static _densgrid(center, vdw_radius, grid, npts)[source]

Function to map individual atomic density on the grid.

The formula is equation (1) of the Koes paper Protein-Ligand Scoring with Convolutional NN Arxiv:1612.02751v1

Parameters:
  • center (list(float)) – position of the atoms
  • vdw_radius (float) – vdw radius of the atom
Returns:

np.array (mapped density)

Return type:

TYPE

map_feature(feat_names, mol_data, grid, npts, angle, axis)[source]
static _featgrid(center, value, grid, npts)[source]

Map an individual feature (atomic or residue) on the grid.

Parameters:
  • center (list(float)) – position of the feature center
  • value (float) – value of the feature
  • type (str, optional) – method to map
Returns:

Mapped feature

Return type:

np.array

Raises:

ValueError – Description

NeuralNet: perform deep learning

class deeprank.learn.NeuralNet.NeuralNet(data_set, model, model_type='3d', proj2d=0, task='reg', class_weights=None, pretrained_model=None, chain1='A', chain2='B', cuda=False, ngpu=0, plot=False, save_hitrate=False, save_classmetrics=False, outdir='./')[source]

Train a Convolutional Neural Network for DeepRank.

Parameters:
  • data_set (deeprank.DataSet or list(str)) – Data set used for training or testing. - deeprank.DataSet for training; - str or list(str), e.g. ‘x.hdf5’, [‘x1.hdf5’, ‘x2.hdf5’], for testing when pretrained model is loaded.
  • model (nn.Module) – Definition of the NN to use. Must subclass nn.Module. See examples in model2d.py and model3d.py
  • model_type (str) – Type of model we want to use. Must be ‘2d’ or ‘3d’. If we specify a 2d model, the data set is automatically converted to the correct format.
  • proj2d (int) – Defines how to slice the 3D volumetric data to generate 2D data. Allowed values are 0, 1 and 2, which are to slice along the YZ, XZ or XY plane, respectively.
  • task (str 'reg' or 'class') – Task to perform. - ‘reg’ for regression - ‘class’ for classification. The loss function, the target datatype and plot functions will be autmatically adjusted depending on the task.
  • class_weights (Tensor) – a manual rescaling weight given to each class. If given, has to be a Tensor of size #classes. Only applicable on ‘class’ task.
  • pretrained_model (str) – Saved model to be used for further training or testing. When using pretrained model, remember to set the following ‘chain1’ and ‘chain2’ for the new data.
  • chain1 (str) – first chain ID of new data when using pretrained model
  • chain2 (str) – second chain ID of new data when using pretrained model
  • cuda (bool) – Use CUDA.
  • ngpu (int) – number of GPU to be used.
  • plot (bool) – Plot the prediction results.
  • save_hitrate (bool) – Save and plot hit rate.
  • save_classmetrics (bool) – Save and plot classification metrics. Classification metrics include: - accuracy(ACC) - sensitivity(TPR) - specificity(TNR)
  • outdir (str) – output directory
Raises:

Examples

Train models: >>> data_set = Dataset(…) >>> model = NeuralNet(data_set, cnn, … model_type=’3d’, task=’reg’, … plot=True, save_hitrate=True, … outdir=’./out/’) >>> model.train(nepoch = 50, divide_trainset=0.8, … train_batch_size = 5, num_workers=0)

Test a model on new data: >>> data_set = [‘test01.hdf5’, ‘test02.hdf5’] >>> model = NeuralNet(data_set, cnn, … pretrained_model = ‘./model.pth.tar’, … outdir=’./out/’) >>> model.test()

train(nepoch=50, divide_trainset=None, hdf5='epoch_data.hdf5', train_batch_size=10, preshuffle=True, preshuffle_seed=None, export_intermediate=True, num_workers=1, save_model='best', save_epoch='intermediate', hit_cutoff=None)[source]

Perform a simple training of the model.

Parameters:
  • nepoch (int, optional) – number of iterations
  • divide_trainset (list, optional) – the percentage assign to the training, validation and test set. Examples: [0.7, 0.2, 0.1], [0.8, 0.2], None
  • hdf5 (str, optional) – file to store the training results
  • train_batch_size (int, optional) – size of the batch
  • preshuffle (bool, optional) – preshuffle the dataset before dividing it.
  • preshuffle_seed (int, optional) – set random seed for preshuffle
  • export_intermediate (bool, optional) – export data at intermediate epochs.
  • num_workers (int, optional) – number of workers to be used to prepare the batch data
  • save_model (str, optional) – ‘best’ or ‘all’, save only the best model or all models.
  • save_epoch (str, optional) – ‘intermediate’ or ‘all’, save the epochs data to HDF5.
  • hit_cutoff (float, optional) – the cutoff used to define hit by comparing with docking models’ target value, e.g. IRMSD value
static convertSeconds2Days(time)[source]
test(hdf5='test_data.hdf5', hit_cutoff=None, has_target=False)[source]

Test a predefined model on a new dataset.

Parameters:
  • hdf5 (str, optional) – hdf5 file to store the test results
  • hit_cutoff (float, optional) – the cutoff used to define hit by comparing with docking models’ target value, e.g. IRMSD value
  • has_target (bool, optional) – specify the presence (True) or absence (False) of target values in the test set. No metrics can be computed if False.

Examples

>>> # adress of the database
>>> database = '1ak4.hdf5'
>>> # Load the model in a new network instance
>>> model = NeuralNet(database, cnn,
...                   pretrained_model='./model/model.pth.tar',
...                   outdir='./test/')
>>> # test the model
>>> model.test()
save_model(filename='model.pth.tar')[source]

save the model to disk.

Parameters:filename (str, optional) – name of the file
load_model_params()[source]

Get model parameters from a saved model.

load_optimizer_params()[source]

Get optimizer parameters from a saved model.

load_nn_params()[source]

Get NeuralNet parameters from a saved model.

load_data_params()[source]

Get dataset parameters from a saved model.

_divide_dataset(divide_set, preshuffle, preshuffle_seed)[source]

Divide the data set into training, validation and test according to the percentage in divide_set.

Parameters:
  • divide_set (list(float)) – percentage used for training/validation/test. Example: [0.8, 0.1, 0.1], [0.8, 0.2]
  • preshuffle (bool) – shuffle the dataset before dividing it
  • preshuffle_seed (int, optional) – set random seed
Returns:

Indices of the

training/validation/test set.

Return type:

list(int),list(int),list(int)

_train(index_train, index_valid, index_test, nepoch=50, train_batch_size=5, export_intermediate=False, num_workers=1, save_epoch='intermediate', save_model='best')[source]

Train the model.

Parameters:
  • index_train (list(int)) – Indices of the training set
  • index_valid (list(int)) – Indices of the validation set
  • index_test (list(int)) – Indices of the testing set
  • nepoch (int, optional) – numbr of epoch
  • train_batch_size (int, optional) – size of the batch
  • export_intermediate (bool, optional) – export itnermediate data
  • num_workers (int, optional) – number of workers pytorch uses to create the batch size
  • save_epoch (str,optional) – ‘intermediate’ or ‘all’
  • save_model (str, optional) – ‘all’ or ‘best’
Returns:

Parameters of the network after training

Return type:

torch.tensor

_epoch(data_loader, train_model, has_target=True)[source]

Perform one single epoch iteration over a data loader.

Parameters:
  • data_loader (torch.DataLoader) – DataLoader for the epoch
  • train_model (bool) – train the model if True or not if False
Returns:

loss of the model dict: data of the epoch

Return type:

float

_get_variables(inputs, targets)[source]

Convert the feature/target in torch.Variables.

The format is different for regression where the targets are float and classification where they are int.

Parameters:
  • inputs (np.array) – raw features
  • targets (np.array) – raw target values
Returns:

features torch.Variable: target values

Return type:

torch.Variable

_export_losses(figname)[source]

Plot the losses vs the epoch.

Parameters:figname (str) – name of the file where to export the figure
_export_metrics(metricname)[source]
_plot_scatter_reg(figname)[source]

Plot a scatter plots of predictions VS targets.

Useful to visualize the performance of the training algorithm

Parameters:figname (str) – filename
_plot_boxplot_class(figname)[source]

Plot a boxplot of predictions VS targets.

It is only usefull in classification tasks.

Parameters:figname (str) – filename
plot_hit_rate(figname)[source]

Plot the hit rate of the different training/valid/test sets.

The hit rate is defined as:
The percentage of positive(near-native) decoys that are included among the top m decoys.
Parameters:figname (str) – filename for the plot
_compute_hitrate(hit_cutoff=None)[source]
_get_relevance(data, hit_cutoff=None)[source]
_get_classmetrics(data, metricname)[source]
static _get_binclass_prediction(data)[source]
_export_epoch_hdf5(epoch, data)[source]

Export the epoch data to the hdf5 file.

Export the data of a given epoch in train/valid/test group. In each group are stored the predcited values (outputs), ground truth (targets) and molecule name (mol).

Parameters:
  • epoch (int) – index of the epoch
  • data (dict) – data of the epoch

modelGenerator: generate NN architecture

class deeprank.learn.modelGenerator.NetworkGenerator(name='_tmp_model_', fname='_tmp_model_.py', conv_layers=None, fc_layers=None)[source]

Automatic generation of NN files.

This class allows for automatic generation of python file containing the definition of torch formatted neural network.

Parameters:
  • name (str, optional) – name of the model in the python file
  • fname (str, optional) – name of the file containing the model
  • conv_layers (list(layers)) – list of convolutional layers
  • fc_layers (list(layers)) – list of fully connected layers

Example

>>> conv_layers = []
>>> conv_layers.append(conv(output_size=4,kernel_size=2,post='relu'))
>>> conv_layers.append(pool(kernel_size=2))
>>> conv_layers.append(conv(input_size=4,output_size=5,kernel_size=2,post='relu'))
>>> conv_layers.append(pool(kernel_size=2))
>>>
>>> fc_layers = []
>>> fc_layers.append(fc(output_size=84,post='relu'))
>>> fc_layers.append(fc(input_size=84,output_size=1))
>>>
>>> MG = NetworkGenerator(name='test',fname='model_test.py',conv_layers=conv_layers,fc_layers=fc_layers)
>>> MG.print()
>>> MG.write()
write()[source]

Write the model to file.

static _write_import(fhandle)[source]
_write_definition(fhandle)[source]
_write_init(fhandle)[source]
static _write_conv_output(fhandle)[source]
_write_forward_feature(fhandle)[source]
_write_forward(fhandle)[source]
print()[source]

Print the model to screen.

get_new_random_model()[source]

Get a new Random Model.

_init_conv_layer_random(ilayer)[source]
_init_fc_layer_random(ilayer)[source]
class deeprank.learn.modelGenerator.conv(input_size=-1, output_size=None, kernel_size=None, post=None)[source]

Wrapper around the convolutional layer.

Parameters:
  • input_size (int, optional) – input size (default, let the generator figure it out)
  • output_size (int, optional) – output size
  • kernel_size (int, optional) – kernel size
  • post (str, optional) – post process of the data

Example:

>>> conv_layers.append(conv(output_size=4,kernel_size=2,post='relu'))
class deeprank.learn.modelGenerator.pool(kernel_size=None, post=None)[source]

Wrapper around the pool layer.

Parameters:
  • kernel_size (int, optional) – kernel size
  • post (str, optional) – post process of the data

Example:

>>> conv_layers.append(pool(kernel_size=2))
class deeprank.learn.modelGenerator.dropout(percent=0.5)[source]

Wrapper around the dropout layer layer.

Parameters:percent (float) – percent of dropout

Example:

>>> fc_layers.append(dropout(precent=0.25))
class deeprank.learn.modelGenerator.fc(input_size=-1, output_size=None, post=None)[source]

Wrapper around the fully conneceted layer.

Parameters:
  • input_size (int, optional) – input size (default, let the generator figure it out)
  • output_size (int, optional) – output size
  • post (str, optional) – post process of the data

Example:

>>> fc_layers.append(fc(output_size=84,post='relu'))