Learning¶

This module contains all the tools for deep learning in DeepRank. The two main modules are deeprank.learn.DataSet and deeprank.learn.NeuralNet. The DataSet class allows to process several hdf5 files created by the deeprank.generate toolset for use by pyTorch. This is done by creating several torch.data_utils.DataLoader for the training, valiation and test of the model. Several options are possible to specify and filter which conformations should be used in the dataset. The NeuralNet class is in charge of the deep learning part.There as well several options are possible to specify the task to be performed, the architecture of the neural network etc ….

Example:

>>> from deeprank.learn import *
>>> from model3d import cnn_class
>>>
>>> database = '1ak4.hdf5'
>>>
>>> # declare the dataset instance
>>> data_set = DataSet(database,
>>>                    chain1='C',
>>>                    chain2='D',
>>>                    select_feature='all',
>>>                    select_target='IRMSD',
>>>                    dict_filter={'IRMSD':'<4. or >10.'})
>>>
>>>
>>> # create the network
>>> model = NeuralNet(data_set, cnn_class, model_type='3d', task='class')
>>>
>>> # start the training
>>> model.train(nepoch = 250,divide_trainset=0.8, train_batch_size = 50, num_workers=8)
>>>
>>> # save the model
>>> model.save_model()

The details of the submodules are presented here. The two main ones are deeprank.learn.DataSet and deeprank.learn.NeuralNet.

note:	The module `deeprank.learn.modelGenerator` can automatically create the file defining the neural network architecture.

DataSet: create a torch dataset¶

class deeprank.learn.DataSet.DataSet(train_database, valid_database=None, test_database=None, chain1='A', chain2='B', mapfly=True, grid_info=None, use_rotation=None, select_feature='all', select_target='DOCKQ', normalize_features=True, normalize_targets=True, target_ordering=None, dict_filter=None, pair_chain_feature=None, transform_to_2D=False, projection=0, clip_features=True, clip_factor=1.5, rotation_seed=None, tqdm=False, process=True)[source]¶

Generates the dataset needed for pytorch.

This class hanldes the data generated by deeprank.generate to be used in the deep learning part of DeepRank.

Parameters:

train_database (list(str)) – names of the hdf5 files used for the training/validation. Example: [‘1AK4.hdf5’,’1B7W.hdf5’,…]
valid_database (list(str)) – names of the hdf5 files used for the validation. Example: [‘1ACB.hdf5’,’4JHF.hdf5’,…]
test_database (list(str)) – names of the hdf5 files used for the test. Example: [‘7CEI.hdf5’]
chain1 (str) – first chain ID, defaults to ‘A’
chain2 (str) – second chain ID, defaults to ‘B’
mapfly (bool) – do we compute the map in the batch preparation or read them
grid_info (dict) –
grid information to map the feature. If None the original grid points are used. The dict contains:
- ’number_of_points”, the shape of grid
- ’resolution’, the resolution of grid, unit in A
Example

{‘number_of_points’: [10, 10, 10], ‘resolution’: [3, 3, 3]}
use_rotation (int) – number of rotations to use. Example: 0 (use only original data) Default: None (use all data of the database)
select_feature (dict or 'all', optional) – Select the features used in the learning. if mapfly is True: - {‘AtomDensities’: ‘all’, ‘Features’: ‘all’} - {‘AtomicDensities’: config.atom_vdw_radius_noH, ‘Features’: [‘PSSM_*’, ‘pssm_ic_*’]} if mapfly is False: - {‘AtomDensities_ind’: ‘all’, ‘Feature_ind’: ‘all’} - {‘Feature_ind’: [‘PSSM_*’, ‘pssm_ic_*’]} Default: ‘all’
select_target (str,optional) – Specify required target. Default: ‘DOCKQ’
normalize_features (Bool, optional) – normalize features or not Default: True
normalize_targets (Bool, optional) – normalize targets or not Default: True
target_ordering (str) – ‘lower’ (the lower the better) or ‘higher’ (the higher the better) By default is not specified (None) and the code tries to identify it. If identification fails ‘lower’ is used.
dict_filter (None or dict, optional) – Specify if we filter the complexes based on target values, Example: {‘IRMSD’: ‘<4. or >10’} (select complexes with IRMSD lower than 4 or larger than 10) Default: None
pair_chain_feature (None or callable, optional) – method to pair features of chainA and chainB Example: np.sum (sum the chainA and chainB features)
transform_to_2D (bool, optional) – Boolean to use 2d maps instead of full 3d Default: False
projection (int) – Projection axis from 3D to 2D: Mapping: 0 -> yz, 1 -> xz, 2 -> xy Default = 0
clip_features (bool, optional) – Remove too large values of the grid. Can be needed for native complexes where the coulomb feature might be too large
clip_factor (float, optional) – the features are clipped at: +/-mean + clip_factor * std
tqdm (bool, optional) – Print the progress bar
process (bool, optional) – Actually process the data set. Must be set to False when reusing a model for testing
rotation_seed (int, optional) – random seed for getting rotation axis and angle.

Examples

>>> from deeprank.learn import *
>>> train_database = '1ak4.hdf5'
>>> data_set = DataSet(train_database,
>>>                    valid_database = None,
>>>                    test_database = None,
>>>                    chain1='C',
>>>                    chain2='D',
>>>                    grid_info = {
>>>                        'number_of_points': (10, 10, 10),
>>>                        'resolution': (3, 3, 3)
>>>                    },
>>>                    select_feature = {
>>>                       'AtomicDensities': 'all',
>>>                       'Features': [
>>>                            'PSSM_*', 'pssm_ic_*' ]
>>>                    },
>>>                    select_target='IRMSD',
>>>                    normalize_features = True,
>>>                    normalize_targets=True,
>>>                    pair_chain_feature=np.add,
>>>                    dict_filter={'IRMSD':'<4. or >10.'},
>>>                    process = True)

static _get_database_name(database)[source]¶

Get the list of hdf5 database file names.

Parameters:	database (None, str or list(str)) – hdf5 database name(s).
Returns:	hdf5 file names
Return type:	list

process_dataset()[source]¶

Process the data set.

Done by default. However must be turned off when one want to test a pretrained model. This can be done by setting process=False in the creation of the DataSet instance.

static check_hdf5_files(database)[source]¶: Check if the data contained in the hdf5 file is ok.

create_index_molecules()[source]¶

Create the indexing of each molecule in the dataset.

Create the indexing: [(‘1ak4.hdf5,1AK4_100w),…,(‘1fqj.hdf5,1FGJ_400w)] This allows to refer to one complex with its index in the list.

Raises:	`ValueError` – No aviable training data after filtering.

_select_pdb(mol_names)[source]¶

Select complexes.

Parameters:	mol_names (list) – list of complex names
Returns:	list of selected complexes
Return type:	list

filter(molgrp)[source]¶

Filter the molecule according to a dictionary, e.g., dict_filter={‘DOCKQ’:’>0.1’, ‘IRMSD’:’<=4 or >10’}).

The filter is based on the attribute self.dict_filter that must be either of the form: { ‘name’: cond } or None

Parameters:	molgrp (str) – group name of the molecule in the hdf5 file
Returns:	True if we keep the complex False otherwise
Return type:	bool
Raises:	`ValueError` – If an unsuported condition is provided

get_mapped_feature_name()[source]¶

Get actual mapped feature names for feature selections.

Note

class parameter self.select_feature examples:
- ‘all’
- {‘AtomicDensities_ind’: ‘all’, ‘Feature_ind’:all}
- {‘Feature_ind’: [‘PSSM_*’, ‘pssm_ic_*’]}
Feature type must be: ‘AtomicDensities_ind’ or ‘Feature_ind’.

Raises:	`KeyError` – Wrong feature type. `KeyError` – Wrong feature type.

get_raw_feature_name()[source]¶

Get actual raw feature names for feature selections.

Note

class parameter self.select_feature examples:
- ‘all’
- {‘AtomicDensities’: ‘all’, ‘Features’:all}
- {‘AtomicDensities’: config.atom_vaw_radius_noH, ‘Features’: [‘PSSM_*’, ‘pssm_ic_*’]}
Feature type must be: ‘AtomicDensities’ or ‘Features’.

Raises:	`KeyError` – Wrong feature type. `KeyError` – Wrong feature type.

print_possible_features()[source]¶: Print the possible features in the group.

get_pairing_feature()[source]¶: Creates the index of paired features.

get_input_shape()[source]¶

Get the size of the data and input.

Note

self.data_shape: shape of the raw 3d data set
self.input_shape: input size of the CNN. Potentially after 2d transformation.

get_grid_shape()[source]¶

Get the shape of the matrices.

Raises:	`ValueError` – If no grid shape is provided or is present in the HDF5 file

compute_norm()[source]¶: compute the normalization factors.

get_norm()[source]¶: Get the normalization values for the features.

_read_norm()[source]¶: Read or create the normalization file for the complex.

_get_target_ordering(order)[source]¶

Determine if ordering of the target.

This can be lower the better or higher the better If it can’t determine the ordering ‘lower’ is assumed

backtransform_target(data)[source]¶

Returns the values of the target after de-normalization.

Parameters:	data (list(float)) – normalized data
Returns:	un-normalized data
Return type:	list(float)

_normalize_target(target)[source]¶

Normalize the values of the targets.

Parameters:	target (list(float)) – raw data
Returns:	normalized data
Return type:	list(float)

_normalize_feature(feature)[source]¶

Normalize the values of the features.

Parameters:	feature (np.array) – raw feature values
Returns:	normalized feature values
Return type:	np.array

_clip_feature(feature)[source]¶

Clip the value of the features at +/- mean + clip_factor * std. :param feature: raw feature values :type feature: np.array

Returns:	clipped feature values
Return type:	np.array

static _mad_based_outliers(points, minv, maxv, thresh=3.5)[source]¶

Mean absolute deviation based outlier detection.

(Experimental). :param points: raw input data :type points: np.array :param minv: Minimum (negative) value requested :type minv: float :param maxv: Maximum (positive) value requested :type maxv: float :param thresh: Threshold for data detection :type thresh: float, optional

Returns:	data where outliers were replaced by min/max values
Return type:	TYPE

load_one_molecule(fname, mol=None)[source]¶

Load the feature/target of a single molecule.

Parameters:	fname (str) – hdf5 file name mol (None or str, optional) – name of the complex in the hdf5
Returns:	features, targets
Return type:	np.array,float

map_one_molecule(fname, mol=None, angle=None, axis=None)[source]¶

Map the feature and load feature/target of a single molecule.

Parameters:	fname (str) – hdf5 file name mol (None or str, optional) – name of the complex in the hdf5
Returns:	features, targets
Return type:	np.array,float

static convert2d(feature, proj2d)[source]¶

Convert the 3D volumetric feature to a 2D planar data set.

proj2d specifies the dimension that we want to consider as channel for example for proj2d = 0 the 2D images are in the yz plane and the stack along the x dimension is considered as extra channels :param feature: raw features :type feature: np.array :param proj2d: projection :type proj2d: int

Returns:	projected features
Return type:	np.array

static make_feature_pair(feature, op)[source]¶

Pair the features of both chains.

Parameters:	feature (np.array) – raw features op (callable) – function to combine the features
Returns:	combined features
Return type:	np.array
Raises:	`ValueError` – if op is not callable

get_grid(mol_data)[source]¶

Get meshed grids and number of pointgs

Parameters:	mol_data (h5 group) – HDF5 moleucle group
Raises:	`ValueError` – Grid points not found in mol_data.
Returns:	meshgrid, npts
Return type:	tuple, tuple

map_atomic_densities(feat_names, mol_data, grid, npts, angle, axis)[source]¶

Map atomic densities.

Parameters:	feat_names (dict) – Element type and vdw radius mol_data (h5 group) – HDF5 molecule group grid (tuple) – mesh grid of x,y,z npts (tuple) – number of points on axis x,y,z angle (float) – rotation angle axis (list) – rotation axis
Returns:	atomic densities of each atom type on each chain
Return type:	list

static _densgrid(center, vdw_radius, grid, npts)[source]¶

Function to map individual atomic density on the grid.

The formula is equation (1) of the Koes paper Protein-Ligand Scoring with Convolutional NN Arxiv:1612.02751v1

Parameters:	center (list(float)) – position of the atoms vdw_radius (float) – vdw radius of the atom
Returns:	np.array (mapped density)
Return type:	TYPE

map_feature(feat_names, mol_data, grid, npts, angle, axis)[source]¶

static _featgrid(center, value, grid, npts)[source]¶

Map an individual feature (atomic or residue) on the grid.

Parameters:	center (list(float)) – position of the feature center value (float) – value of the feature type (str, optional) – method to map
Returns:	Mapped feature
Return type:	np.array
Raises:	`ValueError` – Description

NeuralNet: perform deep learning¶

class deeprank.learn.NeuralNet.NeuralNet(data_set, model, model_type='3d', proj2d=0, task='reg', class_weights=None, pretrained_model=None, chain1='A', chain2='B', cuda=False, ngpu=0, plot=False, save_hitrate=False, save_classmetrics=False, outdir='./')[source]¶

Train a Convolutional Neural Network for DeepRank.

Parameters:

data_set (deeprank.DataSet or list(str)) – Data set used for training or testing. - deeprank.DataSet for training; - str or list(str), e.g. ‘x.hdf5’, [‘x1.hdf5’, ‘x2.hdf5’], for testing when pretrained model is loaded.
model (nn.Module) – Definition of the NN to use. Must subclass nn.Module. See examples in model2d.py and model3d.py
model_type (str) – Type of model we want to use. Must be ‘2d’ or ‘3d’. If we specify a 2d model, the data set is automatically converted to the correct format.
proj2d (int) – Defines how to slice the 3D volumetric data to generate 2D data. Allowed values are 0, 1 and 2, which are to slice along the YZ, XZ or XY plane, respectively.
task (str 'reg' or 'class') – Task to perform. - ‘reg’ for regression - ‘class’ for classification. The loss function, the target datatype and plot functions will be autmatically adjusted depending on the task.
class_weights (Tensor) – a manual rescaling weight given to each class. If given, has to be a Tensor of size #classes. Only applicable on ‘class’ task.
pretrained_model (str) – Saved model to be used for further training or testing. When using pretrained model, remember to set the following ‘chain1’ and ‘chain2’ for the new data.
chain1 (str) – first chain ID of new data when using pretrained model
chain2 (str) – second chain ID of new data when using pretrained model
cuda (bool) – Use CUDA.
ngpu (int) – number of GPU to be used.
plot (bool) – Plot the prediction results.
save_hitrate (bool) – Save and plot hit rate.
save_classmetrics (bool) – Save and plot classification metrics. Classification metrics include: - accuracy(ACC) - sensitivity(TPR) - specificity(TNR)
outdir (str) – output directory

Raises:

ValueError – if dataset format is not recognized
ValueError – if task is not recognized

Examples

Train models: >>> data_set = Dataset(…) >>> model = NeuralNet(data_set, cnn, … model_type=’3d’, task=’reg’, … plot=True, save_hitrate=True, … outdir=’./out/’) >>> model.train(nepoch = 50, divide_trainset=0.8, … train_batch_size = 5, num_workers=0)

Test a model on new data: >>> data_set = [‘test01.hdf5’, ‘test02.hdf5’] >>> model = NeuralNet(data_set, cnn, … pretrained_model = ‘./model.pth.tar’, … outdir=’./out/’) >>> model.test()

train(nepoch=50, divide_trainset=None, hdf5='epoch_data.hdf5', train_batch_size=10, preshuffle=True, preshuffle_seed=None, export_intermediate=True, num_workers=1, save_model='best', save_epoch='intermediate', hit_cutoff=None)[source]¶

Perform a simple training of the model.

Parameters:

nepoch (int, optional) – number of iterations
divide_trainset (list, optional) – the percentage assign to the training, validation and test set. Examples: [0.7, 0.2, 0.1], [0.8, 0.2], None
hdf5 (str, optional) – file to store the training results
train_batch_size (int, optional) – size of the batch
preshuffle (bool, optional) – preshuffle the dataset before dividing it.
preshuffle_seed (int, optional) – set random seed for preshuffle
export_intermediate (bool, optional) – export data at intermediate epochs.
num_workers (int, optional) – number of workers to be used to prepare the batch data
save_model (str, optional) – ‘best’ or ‘all’, save only the best model or all models.
save_epoch (str, optional) – ‘intermediate’ or ‘all’, save the epochs data to HDF5.
hit_cutoff (float, optional) – the cutoff used to define hit by comparing with docking models’ target value, e.g. IRMSD value

static convertSeconds2Days(time)[source]¶

test(hdf5='test_data.hdf5', hit_cutoff=None, has_target=False)[source]¶

Test a predefined model on a new dataset.

Parameters:	hdf5 (str, optional) – hdf5 file to store the test results hit_cutoff (float, optional) – the cutoff used to define hit by comparing with docking models’ target value, e.g. IRMSD value has_target (bool, optional) – specify the presence (True) or absence (False) of target values in the test set. No metrics can be computed if False.

Examples

>>> # adress of the database
>>> database = '1ak4.hdf5'
>>> # Load the model in a new network instance
>>> model = NeuralNet(database, cnn,
...                   pretrained_model='./model/model.pth.tar',
...                   outdir='./test/')
>>> # test the model
>>> model.test()

save_model(filename='model.pth.tar')[source]¶

save the model to disk.

Parameters:	filename (str, optional) – name of the file

load_model_params()[source]¶: Get model parameters from a saved model.

load_optimizer_params()[source]¶: Get optimizer parameters from a saved model.

load_nn_params()[source]¶: Get NeuralNet parameters from a saved model.

load_data_params()[source]¶: Get dataset parameters from a saved model.

_divide_dataset(divide_set, preshuffle, preshuffle_seed)[source]¶

Divide the data set into training, validation and test according to the percentage in divide_set.

Parameters:

divide_set (list(float)) – percentage used for training/validation/test. Example: [0.8, 0.1, 0.1], [0.8, 0.2]
preshuffle (bool) – shuffle the dataset before dividing it
preshuffle_seed (int, optional) – set random seed

Returns:

Indices of the: training/validation/test set.

Return type:

list(int),list(int),list(int)

_train(index_train, index_valid, index_test, nepoch=50, train_batch_size=5, export_intermediate=False, num_workers=1, save_epoch='intermediate', save_model='best')[source]¶

Train the model.

Parameters:	index_train (list(int)) – Indices of the training set index_valid (list(int)) – Indices of the validation set index_test (list(int)) – Indices of the testing set nepoch (int, optional) – numbr of epoch train_batch_size (int, optional) – size of the batch export_intermediate (bool, optional) – export itnermediate data num_workers (int, optional) – number of workers pytorch uses to create the batch size save_epoch (str,optional) – ‘intermediate’ or ‘all’ save_model (str, optional) – ‘all’ or ‘best’
Returns:	Parameters of the network after training
Return type:	torch.tensor

_epoch(data_loader, train_model, has_target=True)[source]¶

Perform one single epoch iteration over a data loader.

Parameters:	data_loader (torch.DataLoader) – DataLoader for the epoch train_model (bool) – train the model if True or not if False
Returns:	loss of the model dict: data of the epoch
Return type:	float

_get_variables(inputs, targets)[source]¶

Convert the feature/target in torch.Variables.

The format is different for regression where the targets are float and classification where they are int.

Parameters:	inputs (np.array) – raw features targets (np.array) – raw target values
Returns:	features torch.Variable: target values
Return type:	torch.Variable

_export_losses(figname)[source]¶

Plot the losses vs the epoch.

Parameters:	figname (str) – name of the file where to export the figure

_export_metrics(metricname)[source]¶

_plot_scatter_reg(figname)[source]¶

Plot a scatter plots of predictions VS targets.

Useful to visualize the performance of the training algorithm

Parameters:	figname (str) – filename

_plot_boxplot_class(figname)[source]¶

Plot a boxplot of predictions VS targets.

It is only usefull in classification tasks.

Parameters:	figname (str) – filename

plot_hit_rate(figname)[source]¶

Plot the hit rate of the different training/valid/test sets.

The hit rate is defined as:: The percentage of positive(near-native) decoys that are included among the top m decoys.

Parameters:	figname (str) – filename for the plot

_compute_hitrate(hit_cutoff=None)[source]¶

_get_relevance(data, hit_cutoff=None)[source]¶

_get_classmetrics(data, metricname)[source]¶

static _get_binclass_prediction(data)[source]¶

_export_epoch_hdf5(epoch, data)[source]¶

Export the epoch data to the hdf5 file.

Export the data of a given epoch in train/valid/test group. In each group are stored the predcited values (outputs), ground truth (targets) and molecule name (mol).

Parameters:	epoch (int) – index of the epoch data (dict) – data of the epoch

modelGenerator: generate NN architecture¶

class deeprank.learn.modelGenerator.NetworkGenerator(name='_tmp_model_', fname='_tmp_model_.py', conv_layers=None, fc_layers=None)[source]¶

Automatic generation of NN files.

This class allows for automatic generation of python file containing the definition of torch formatted neural network.

Parameters:	name (str, optional) – name of the model in the python file fname (str, optional) – name of the file containing the model conv_layers (list(layers)) – list of convolutional layers fc_layers (list(layers)) – list of fully connected layers

Example

>>> conv_layers = []
>>> conv_layers.append(conv(output_size=4,kernel_size=2,post='relu'))
>>> conv_layers.append(pool(kernel_size=2))
>>> conv_layers.append(conv(input_size=4,output_size=5,kernel_size=2,post='relu'))
>>> conv_layers.append(pool(kernel_size=2))
>>>
>>> fc_layers = []
>>> fc_layers.append(fc(output_size=84,post='relu'))
>>> fc_layers.append(fc(input_size=84,output_size=1))
>>>
>>> MG = NetworkGenerator(name='test',fname='model_test.py',conv_layers=conv_layers,fc_layers=fc_layers)
>>> MG.print()
>>> MG.write()

write()[source]¶: Write the model to file.

static _write_import(fhandle)[source]¶

_write_definition(fhandle)[source]¶

_write_init(fhandle)[source]¶

static _write_conv_output(fhandle)[source]¶

_write_forward_feature(fhandle)[source]¶

_write_forward(fhandle)[source]¶

print()[source]¶: Print the model to screen.

get_new_random_model()[source]¶: Get a new Random Model.

_init_conv_layer_random(ilayer)[source]¶

_init_fc_layer_random(ilayer)[source]¶

class deeprank.learn.modelGenerator.conv(input_size=-1, output_size=None, kernel_size=None, post=None)[source]¶

Wrapper around the convolutional layer.

Parameters:	input_size (int, optional) – input size (default, let the generator figure it out) output_size (int, optional) – output size kernel_size (int, optional) – kernel size post (str, optional) – post process of the data

Example:

>>> conv_layers.append(conv(output_size=4,kernel_size=2,post='relu'))

class deeprank.learn.modelGenerator.pool(kernel_size=None, post=None)[source]¶

Wrapper around the pool layer.

Parameters:	kernel_size (int, optional) – kernel size post (str, optional) – post process of the data

Example:

>>> conv_layers.append(pool(kernel_size=2))

class deeprank.learn.modelGenerator.dropout(percent=0.5)[source]¶

Wrapper around the dropout layer layer.

Parameters:	percent (float) – percent of dropout

Example:

>>> fc_layers.append(dropout(precent=0.25))

class deeprank.learn.modelGenerator.fc(input_size=-1, output_size=None, post=None)[source]¶

Wrapper around the fully conneceted layer.

Parameters:	input_size (int, optional) – input size (default, let the generator figure it out) output_size (int, optional) – output size post (str, optional) – post process of the data

Example:

>>> fc_layers.append(fc(output_size=84,post='relu'))