Data Generation

This module contains all the tools to compute the features and targets and to map the features onto a grid of points. The main class used for the data generation is deeprank.generate.DataGenerator. Through this class you can specify the molecules you want to consider, the features and the targets that need to be computed and the way to map the features on the grid. The data is stored in a single HDF5 file. In this file, each conformation has its own group that contains all the information related to the conformation. This includes the pdb data, the value of the feature (in human readable format and xyz-val format), the value of the targe values, the grid points and the mapped features on the grid.

At the moment a number of features are already implemented. This include:

  • Atomic densities
  • Coulomb & vd Waals interactions
  • Atomic charges
  • PSSM data
  • Information content
  • Buried surface area
  • Contact Residue Densities

More features can be easily implemented and integrated in the data generation workflow. You can see example here. The calculation of a number of target values have also been implemented:

  • i-RMSD
  • l-RMSD
  • FNAT
  • DockQ
  • binary class

There as well new targets can be implemented and integrated to the workflow.

Normalization of the data can be time consuming as the dataset becomes large. As an attempt to alleviate this problem, the class deeprank.generate.NormalizeData has been created. This class directly compute and store the standard deviation and mean value of each feature within a given hdf5 file.

Example:

>>> from deeprank.generate import *
>>> from time import time
>>>
>>> pdb_source     = ['./1AK4/decoys/']
>>> pdb_native     = ['./1AK4/native/']
>>> pssm_source    = './1AK4/pssm_new/'
>>> h5file = '1ak4.hdf5'
>>>
>>> #init the data assembler
>>> database = DataGenerator(chain1='C',chain2='D',
>>>                          pdb_source=pdb_source,pdb_native=pdb_native,pssm_source=pssm_source,
>>>                          data_augmentation=None,
>>>                          compute_targets  = ['deeprank.targets.dockQ'],
>>>                          compute_features = ['deeprank.features.AtomicFeature',
>>>                                              'deeprank.features.FullPSSM',
>>>                                              'deeprank.features.PSSM_IC',
>>>                                              'deeprank.features.BSA'],
>>>                          hdf5=h5file)
>>>
>>> t0 = time()
>>> #create new files
>>> database.create_database(prog_bar=True)
>>>
>>> # map the features
>>> grid_info = {
>>>     'number_of_points': [30,30,30],
>>>     'resolution': [1.,1.,1.],
>>>     'atomic_densities': {'CA':3.5,'N':3.5,'O':3.5,'C':3.5},
>>> }
>>> database.map_features(grid_info,try_sparse=True,time=False,prog_bar=True)
>>>
>>> # add a new target
>>> database.add_target(prog_bar=True)
>>> print(' '*25 + '--> Done in %f s.' %(time()-t0))
>>>
>>> # get the normalization
>>> norm = NormalizeData(h5file)
>>> norm.get()

The details of the different submodule are listed here. The only module that really needs to be used is DataGenerator and NormalizeData. The GridTools class should not be directly used by inexperienced users.

Structure Alignement

All the complexes contained in the dataset can be aligned similarly to facilitate and improve the training of the model. This can easily be done using the align option of the DataGenerator for example to align all the complexes along the ‘z’ direction one can use:

>>> database = DataGenerator(chain1='C',chain2='D',
>>>                          pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>>                          align={"axis":'z'}, data_augmentation=2,
>>>                          compute_targets=[ ... ], compute_features=[ ... ], ... )

Other options are possbile, for example if you would like to have the alignement done only using a subpart of the complex, say the chains A and B you can use:

>>> database = DataGenerator(chain1='C',chain2='D',
>>>                          pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>>                          align={"axis":'z', "selection": {"chainID":["A","B"]} }, data_augmentation=2,
>>>                          compute_targets=[ ... ], compute_features=[ ... ], ... )

All the selection offered by pdb2sql can be used in the align dictionnary e.g.: “resId”:[1,2,3], “resName”:[‘VAL’,’LEU’], … Only the atoms selected will be aligned in the give direction.

You can also try to align the interface between two chains in a given plane. This can be done using:

>>> database = DataGenerator(chain1='C',chain2='D',
>>>                          pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>>                          align={"plane":'xy', "selection":"interface"}, data_augmentation=2,
>>>                          compute_targets=[ ... ], compute_features=[ ... ], ... )

which by default will use the interface between the first two chains. If you have more than two chains in the complex and want to specify which chains are forming the interface to be aligned you can use:

>>> database = DataGenerator(chain1='C',chain2='D',
>>>                          pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>>                          align={"plane":'xy', "selection":"interface", "chain1":'A', "chain2":'C'}, data_augmentation=2,
>>>                          compute_targets=[ ... ], compute_features=[ ... ], ... )

DataGenerator

deeprank.generate.DataGenerator._printif(string, cond)[source]
class deeprank.generate.DataGenerator.DataGenerator(chain1, chain2, pdb_select=None, pdb_source=None, pdb_native=None, pssm_source=None, align=None, compute_targets=None, compute_features=None, data_augmentation=None, hdf5='database.h5', mpi_comm=None)[source]

Bases: object

Generate the data (features/targets/maps) required for deeprank.

Parameters:
  • chain1 (str) – First chain ID
  • chain2 (str) – Second chain ID
  • pdb_select (list(str), optional) – List of individual conformation for mapping
  • pdb_source (list(str), optional) – List of folders where to find the pdbs for mapping
  • pdb_native (list(str), optional) – List of folders where to find the native comformations, nust set it if having targets to compute in parameter “compute_targets”.
  • pssm_source (list(str), optional) – List of folders where to find the PSSM files
  • align (dict, optional) – Dicitionary to align the compexes, e.g. align = {“selection”:{“chainID”:[“A”,”B”]},”axis”:”z”}} e.g. align = {“selection”:”interface”,”plane”:”xy”} if “selection” is not specified the entire complex is used for alignement
  • compute_targets (list(str), optional) – List of python files computing the targets, “pdb_native” must be set if having targets to compute.
  • compute_features (list(str), optional) – List of python files computing the features
  • data_augmentation (int, optional) – Number of rotation performed one each complex
  • hdf5 (str, optional) – name of the hdf5 file where the data is saved, default to ‘database.h5’
  • mpi_comm (MPI_COMM) – MPI COMMUNICATOR
Raises:

NotADirectoryError – if the source are not found

Example

>>> from deeprank.generate import *
>>> # sources to assemble the data base
>>> pdb_source     = ['./1AK4/decoys/']
>>> pdb_native     = ['./1AK4/native/']
>>> pssm_source    = ['./1AK4/pssm_new/']
>>> h5file = '1ak4.hdf5'
>>>
>>> #init the data assembler
>>> database = DataGenerator(chain1='C',
>>>                          chain2='D',
>>>                          pdb_source=pdb_source,
>>>                          pdb_native=pdb_native,
>>>                          pssm_source=pssm_source,
>>>                          data_augmentation=None,
>>>                          compute_targets=['deeprank.targets.dockQ'],
>>>                          compute_features=['deeprank.features.AtomicFeature',
>>>                                            'deeprank.features.PSSM_IC',
>>>                                            'deeprank.features.BSA'],
>>>                          hdf5=h5file)
create_database(verbose=False, remove_error=True, prog_bar=False, contact_distance=8.5, random_seed=None)[source]

Create the hdf5 file architecture and compute the features/targets.

Parameters:
  • verbose (bool, optional) – Print creation details
  • remove_error (bool, optional) – remove the groups that errored
  • prog_bar (bool, optional) – use tqdm
  • contact_distance (float) – contact distance cutoff, defaults to 8.5Å
  • random_seed (int) – random seed for getting rotation axis and angle
Raises:

ValueError – If creation of the group errored.

Example:

>>> # sources to assemble the data base
>>> pdb_source     = ['./1AK4/decoys/']
>>> pdb_native     = ['./1AK4/native/']
>>> pssm_source    = ['./1AK4/pssm_new/']
>>> h5file = '1ak4.hdf5'
>>>
>>> #init the data assembler
>>> database = DataGenerator(chain1='C',
>>>                          chain2='D',
>>>                          pdb_source=pdb_source,
>>>                          pdb_native=pdb_native,
>>>                          pssm_source=pssm_source,
>>>                          data_augmentation=None,
>>>                          compute_targets  = ['deeprank.targets.dockQ'],
>>>                          compute_features = ['deeprank.features.AtomicFeature',
>>>                                              'deeprank.features.PSSM_IC',
>>>                                              'deeprank.features.BSA'],
>>>                          hdf5=h5file)
>>>
>>> #create new files
>>> database.create_database(prog_bar=True)
aug_data(augmentation, keep_existing_aug=True, random_seed=None)[source]

Augment exiting original PDB data and features.

Parameters:
  • augmentation (int) – Times of augmentation
  • keep_existing_aug (bool, optional) – Keep existing augmentated data. If False, existing aug will be removed. Defaults to True.

Examples

>>> database = DataGenerator(h5='database.h5')
>>> database.aug_data(augmentation=3, append=True)
>>> grid_info = {
>>>     'number_of_points': [30,30,30],
>>>     'resolution': [1.,1.,1.],
>>>     'atomic_densities': {'C':1.7, 'N':1.55, 'O':1.52, 'S':1.8},
>>>     }
>>> database.map_features(grid_info)
add_feature(remove_error=True, prog_bar=True)[source]

Add a feature to an existing hdf5 file.

Parameters:
  • remove_error (bool) – remove errored molecule
  • prog_bar (bool, optional) – use tqdm

Example:

>>> h5file = '1ak4.hdf5'
>>>
>>> #init the data assembler
>>> database = DataGenerator(compute_features  = ['deeprank.features.ResidueDensity'],
>>>                          hdf5=h5file)
>>>
>>> database.add_feature(remove_error=True, prog_bar=True)
add_unique_target(targdict)[source]

Add identical targets for all the complexes in the datafile.

This is usefull if you want to add the binary class of all the complexes created from decoys or natives

Parameters:targdict (dict) – Example: {‘DOCKQ’:1.0}
>>> database = DataGenerator(hdf5='1ak4.hdf5')
>>> database.add_unique_target({'DOCKQ':1.0})
add_target(prog_bar=False)[source]

Add a target to an existing hdf5 file.

Parameters:prog_bar (bool, optional) – Use tqdm

Example:

>>> h5file = '1ak4.hdf5'
>>>
>>> #init the data assembler
>>> database = DataGenerator(compute_targets =['deeprank.targets.binary_class'],
>>>                          hdf5=h5file)
>>>
>>> database.add_target(prog_bar=True)
realign_complexes(align, compute_features=None, pssm_source=None)[source]

Align all the complexes already present in the HDF5.

Parameters:{dict} -- alignement dictionary (align) –
Keyword Arguments:
 {list} -- list of features to be computed (compute_features) –
if None computes the features specified in
the attrs[‘features’] of the file (if present)
pssm_source {str} – path of the pssm files. If None the source specfied in
the attrs[‘pssm_source’] will be used (if present) (default: {None})
Raises:ValueError – If no PSSM detected

Example:

>>> database = DataGenerator(hdf5='1ak4.hdf5')
>>> # if comute_features and pssm_source are not specified
>>> # the values in hdf5.attrs['features'] and hdf5.attrs['pssm_source'] will be used
>>> database.realign_complex(align={'axis':'x'},
>>>                          compute_features['deeprank.features.X'],
>>>                           pssm_source='./1ak4_pssm/')
_get_grid_center(pdb, contact_distance)[source]
precompute_grid(grid_info, contact_distance=8.5, prog_bar=False, time=False, try_sparse=True)[source]
map_features(grid_info={}, cuda=False, gpu_block=None, cuda_kernel='kernel_map.c', cuda_func_name='gaussian', try_sparse=True, reset=False, use_tmpdir=False, time=False, prog_bar=True, grid_prog_bar=False, remove_error=True)[source]

Map the feature on a grid of points centered at the interface.

If features to map are not given, they will be are automatically determined for each molecule. Otherwise, given features will be mapped for all molecules (i.e. existing mapped features will be recalculated).

Parameters:
  • grid_info (dict) – Informaton for the grid. See deeprank.generate.GridTools.py for details.
  • cuda (bool, optional) – Use CUDA
  • gpu_block (None, optional) – GPU block size to be used
  • cuda_kernel (str, optional) – filename containing CUDA kernel
  • cuda_func_name (str, optional) – The name of the function in the kernel
  • try_sparse (bool, optional) – Try to save the grids as sparse format
  • reset (bool, optional) – remove grids if some are already present
  • use_tmpdir (bool, optional) – use a scratch directory
  • time (bool, optional) – time the mapping process
  • prog_bar (bool, optional) – use tqdm for each molecule
  • grid_prog_bar (bool, optional) – use tqdm for each grid
  • remove_error (bool, optional) – remove the data that errored

Example:

>>> #init the data assembler
>>> database = DataGenerator(hdf5='1ak4.hdf5')
>>>
>>> # map the features
>>> grid_info = {
>>>     'number_of_points': [30,30,30],
>>>     'resolution': [1.,1.,1.],
>>>     'atomic_densities': {'C':1.7, 'N':1.55, 'O':1.52, 'S':1.8},
>>> }
>>>
>>> database.map_features(grid_info,try_sparse=True,time=False,prog_bar=True)
remove(feature=True, pdb=True, points=True, grid=False)[source]

Remove data from the data set.

Equivalent to the cleandata command line tool. Once the data has been removed from the file it is impossible to add new features/targets

Parameters:
  • feature (bool, optional) – Remove the features
  • pdb (bool, optional) – Remove the pdbs
  • points (bool, optional) – remove teh grid points
  • grid (bool, optional) – remove the maps
_tune_cuda_kernel(grid_info, cuda_kernel='kernel_map.c', func='gaussian')[source]

Tune the CUDA kernel using the kernel tuner http://benvanwerkhoven.github.io/kernel_tuner/

Parameters:
  • grid_info (dict) – information for the grid definition
  • cuda_kernel (str, optional) – file containing the kernel
  • func (str, optional) – function in the kernel to be used
Raises:

ValueError – If the tuner has not been used

_test_cuda(grid_info, gpu_block=8, cuda_kernel='kernel_map.c', func='gaussian')[source]

Test the CUDA kernel.

Parameters:
  • grid_info (dict) – Information for the grid definition
  • gpu_block (int, optional) – GPU block size to be used
  • cuda_kernel (str, optional) – File containing the kernel
  • func (str, optional) – function in the kernel to be used
Raises:

ValueError – If the kernel has not been installed

static _compile_cuda_kernel(cuda_kernel, npts, res)[source]

Compile the cuda kernel.

Parameters:
  • cuda_kernel (str) – filename
  • npts (tuple(int)) – number of grid points in each direction
  • res (tuple(float)) – resolution in each direction
Returns:

compiled kernel

Return type:

compiler.SourceModule

static _get_cuda_function(module, func_name)[source]

Get a single function from the compiled kernel.

Parameters:
  • module (compiler.SourceModule) – compiled kernel module
  • func_name (str) – Name of the funtion
Returns:

cuda function

Return type:

func

static _tunable_kernel(kernel)[source]

Make a tunale kernel.

Parameters:kernel (str) – String of the kernel
Returns:tunable kernel
Return type:TYPE
_filter_cplx()[source]

Filter the name of the complexes.

static _compute_features(feat_list, pdb_data, featgrp, featgrp_raw, chain1, chain2, logger)[source]

Compute the features.

Parameters:
  • feat_list (list(str)) – list of function name, e.g., [‘deeprank.features.ResidueDensity’, ‘deeprank.features.PSSM_IC’]
  • pdb_data (bytes) – PDB translated in bytes
  • featgrp (str) – name of the group where to store the xyz feature
  • featgrp_raw (str) – name of the group where to store the raw feature
  • chain1 (str) – First chain ID
  • chain2 (str) – Second chain ID
  • logger (logger) – name of logger object
Returns:

error happened or not

Return type:

bool

static _compute_targets(targ_list, pdb_data, targrp)[source]

Compute the targets.

Parameters:
  • targ_list (list(str)) – list of function name
  • pdb_data (bytes) – PDB translated in btes
  • targrp (str) – name of the group where to store the targets
  • logger (logger) – name of logger object
_add_pdb(molgrp, pdbfile, name)[source]

Add a pdb to a molgrp.

Parameters:
  • molgrp (str) – mopl group where tp add the pdb
  • pdbfile (str) – psb file to add
  • name (str) – dataset name in the hdf5 molgroup
_get_aligned_sqldb(pdbfile, dict_align)[source]

return a sqldb of the pdb that is aligned as specified in the dict

Parameters:
  • {str} -- path ot the pdb (pdbfile) –
  • {dict} -- dictionanry of options to align the pdb (dict_align) –
static _get_aligned_rotation_axis_angle(random_seed, dict_align)[source]
Returns the axis and angle of rotation for data
augmentation with aligned complexes
Parameters:
  • {int} -- random seed of rotation (random_seed) –
  • {dict} -- the dict describing the alignement (dict_align) –
Returns:

axis of rotation float: angle of rotation

Return type:

list(float)

_add_aug_pdb(molgrp, pdbfile, name, axis, angle)[source]

Add augmented pdbs to the dataset.

Parameters:
  • molgrp (str) – name of the molgroup
  • pdbfile (str) – pdb file name
  • name (str) – name of the dataset
  • axis (list(float)) – axis of rotation
  • angle (float) – angle of rotation
  • dict_align (dict) – dict for alignement of the original pdb
Returns:

center of the molecule

Return type:

list(float)

static _rotate_feature(molgrp, axis, angle, center, feat_name='all')[source]

Rotate the raw feature values.

Parameters:
  • molgrp (str) – name pf the molgrp
  • axis (list(float)) – axis of rotation
  • angle (float) – angle of rotation
  • center (list(float)) – center of rotation
  • feat_name (str) – name of the feature to rotate or ‘all’

NormalizeData

class deeprank.generate.NormalizeData.NormalizeData(fname, shape=None)[source]

Compute the normalization factor for the features and targets of a given HDF5 file.

The normalization of the features is done through the NormParam class that assumes gaussian distribution. Hence the Normalized data should be normally distributed with a 0 mean value and 1 standard deviation. The normalization of the targets is done vian a min/max normalization. As a result the normalized targets should all lie between 0 and 1. By default the output file containing the normalization dictionary is called <hdf5name>_norm.pckl

Parameters:
  • fname (str) – name of the hdf5 file
  • shape (tuple(int), optional) – shape of the grid in the hdf5 file

Example

>>> norm = NormalizeData('1ak4.hdf5')
>>> norm.get()
get()[source]

Get the normalization and write them to file.

_load()[source]

Load data from already existing normalization file.

_extract_shape()[source]

Get the shape of the data in the hdf5 file.

_extract_data()[source]

Extract the data from the different maps.

_process_data()[source]

Compute the standard deviation of the data.

_export_data()[source]

Pickle the data to file.

class deeprank.generate.NormalizeData.NormParam(std=0, mean=0, var=0, sqmean=0)[source]

Compute gaussian normalization for a given feature.

This class allows to extract the standard deviation, mean value, variance and square root of the mean value of a mapped feature stored in the hdf5 file. As the entire data set is too large to fit in memory, the standard deviation of a given feature is calculated from the std of all the individual grids. This is done following: https://stats.stackexchange.com/questions/25848/how-to-sum-a-standard-deviation:

\[\sigma_{tot}=\sqrt{\frac{1}{N}\sum_i \sigma_i^2+\frac{1}{N}\sum_i\mu_i^2-(\frac{1}{N}\sum_i\mu_i)^2}\]
Parameters:
  • std (float, optional) – standard deviation
  • mean (float,optional) – mean value
  • var (float,optional) – variance
  • sqmean (float, optional) – square roo of the variance
add(mean, var)[source]

Add the mean value, sqmean and variance of a new molecule to the corresponding attributes.

process(n)[source]

Compute the standard deviation of the ensemble.

class deeprank.generate.NormalizeData.MinMaxParam(minv=None, maxv=None)[source]

Compute the min/max of an ensenble of data.

This is principally used to normalized the target values

Parameters:
  • minv (float, optional) – minimal value
  • maxv (float, optional) – maximal value
update(val)[source]

GridTools

deeprank.generate.GridTools.logif(string, cond)[source]
class deeprank.generate.GridTools.GridTools(molgrp, chain1, chain2, number_of_points=30, resolution=1.0, atomic_densities=None, atomic_densities_mode='ind', feature=None, feature_mode='ind', contact_distance=8.5, cuda=False, gpu_block=None, cuda_func=None, cuda_atomic=None, prog_bar=False, time=False, try_sparse=True)[source]

Map the feature of a complex on the grid.

Parameters:
  • molgrp (str) – name of the group of the molecule in the HDF5 file.
  • chain1 (str) – First chain ID.
  • chain2 (str) – Second chain ID.
  • number_of_points (int, optional) – number of points we want in each direction of the grid.
  • resolution (float, optional) – distance(in Angs) between two points.
  • atomic_densities (dict, optional) – dictionary of element types with their vdw radius, see deeprank.config.atom_vdw_radius_noH
  • atomic_densities_mode (str, optional) – Mode for mapping (deprecated must be ‘ind’).
  • feature (None, optional) – Name of the features to be mapped. By default all the features present in hdf5_file[’< molgrp > /features/] will be mapped.
  • feature_mode (str, optional) – Mode for mapping (deprecated must be ‘ind’).
  • contact_distance (float, optional) – the dmaximum distance between two contact atoms default 8.5Å.
  • cuda (bool, optional) – Use CUDA or not.
  • gpu_block (tuple(int), optional) – GPU block size to use.
  • cuda_func (None, optional) – Name of the CUDA function to be used for the mapping of the features. Must be present in kernel_cuda.c.
  • cuda_atomic (None, optional) – Name of the CUDA function to be used for the mapping of the atomic densities. Must be present in kernel_cuda.c.
  • prog_bar (bool, optional) – print progression bar for individual grid (default False).
  • time (bool, optional) – print timing statistic for individual grid (default False).
  • try_sparse (bool, optional) – Try to store the matrix in sparse format (default True).
create_new_data()[source]

Create new feature for a given complex.

update_feature()[source]

Update existing feature in a complex.

read_pdb()[source]

Create a sql databse for the pdb.

get_contact_center()[source]

Get the center of conact atoms.

add_all_features()[source]

Add all the features toa given molecule.

add_all_atomic_densities()[source]

Add all atomic densities.

define_grid_points()[source]

Define the grid points.

map_atomic_densities(only_contact=True)[source]

Map the atomic densities to the grid.

Parameters:only_contact (bool, optional) – Map only the contact atoms
Raises:ImportError – Description
densgrid(center, vdw_radius)[source]

Function to map individual atomic density on the grid.

The formula is equation (1) of the Koes paper Protein-Ligand Scoring with Convolutional NN Arxiv:1612.02751v1

Parameters:
  • center (list(float)) – position of the atoms
  • vdw_radius (float) – vdw radius of the atom
Returns:

np.array (mapped density)

Return type:

TYPE

map_features(featlist, transform=None)[source]

Map individual feature to the grid.

For residue based feature the feature file must be of the format chainID residue_name(3-letter) residue_number [values]

For atom based feature it must be chainID residue_name(3-letter) residue_number atome_name [values]

Parameters:
  • featlist (list(str)) – list of features to be mapped
  • transform (callable, optional) – transformation of the feature (?)
Returns:

Mapped features

Return type:

np.array

Raises:
featgrid(center, value, type_='fast_gaussian')[source]

Map an individual feature (atomic or residue) on the grid.

Parameters:
  • center (list(float)) – position of the feature center
  • value (float) – value of the feature
  • type (str, optional) – method to map
Returns:

Mapped feature

Return type:

np.array

Raises:

ValueError – Description

export_grid_points()[source]

export the grid points to the hdf5 file.

hdf5_grid_data(dict_data, data_name)[source]

Save the mapped feature to the hdf5 file.

Parameters:
  • dict_data (dict) – feature values stored as a dict
  • data_name (str) – feature name