Data Generation¶

This module contains all the tools to compute the features and targets and to map the features onto a grid of points. The main class used for the data generation is deeprank.generate.DataGenerator. Through this class you can specify the molecules you want to consider, the features and the targets that need to be computed and the way to map the features on the grid. The data is stored in a single HDF5 file. In this file, each conformation has its own group that contains all the information related to the conformation. This includes the pdb data, the value of the feature (in human readable format and xyz-val format), the value of the targe values, the grid points and the mapped features on the grid.

At the moment a number of features are already implemented. This include:

Atomic densities

Coulomb & vd Waals interactions

Atomic charges

PSSM data

Information content

Buried surface area

Contact Residue Densities

More features can be easily implemented and integrated in the data generation workflow. You can see example here. The calculation of a number of target values have also been implemented:

i-RMSD

l-RMSD

FNAT

DockQ

binary class

There as well new targets can be implemented and integrated to the workflow.

Normalization of the data can be time consuming as the dataset becomes large. As an attempt to alleviate this problem, the class deeprank.generate.NormalizeData has been created. This class directly compute and store the standard deviation and mean value of each feature within a given hdf5 file.

Example:

>>> from deeprank.generate import *
>>> from time import time
>>>
>>> pdb_source     = ['./1AK4/decoys/']
>>> pdb_native     = ['./1AK4/native/']
>>> pssm_source    = './1AK4/pssm_new/'
>>> h5file = '1ak4.hdf5'
>>>
>>> #init the data assembler
>>> database = DataGenerator(chain1='C',chain2='D',
>>>                          pdb_source=pdb_source,pdb_native=pdb_native,pssm_source=pssm_source,
>>>                          data_augmentation=None,
>>>                          compute_targets  = ['deeprank.targets.dockQ'],
>>>                          compute_features = ['deeprank.features.AtomicFeature',
>>>                                              'deeprank.features.FullPSSM',
>>>                                              'deeprank.features.PSSM_IC',
>>>                                              'deeprank.features.BSA'],
>>>                          hdf5=h5file)
>>>
>>> t0 = time()
>>> #create new files
>>> database.create_database(prog_bar=True)
>>>
>>> # map the features
>>> grid_info = {
>>>     'number_of_points': [30,30,30],
>>>     'resolution': [1.,1.,1.],
>>>     'atomic_densities': {'CA':3.5,'N':3.5,'O':3.5,'C':3.5},
>>> }
>>> database.map_features(grid_info,try_sparse=True,time=False,prog_bar=True)
>>>
>>> # add a new target
>>> database.add_target(prog_bar=True)
>>> print(' '*25 + '--> Done in %f s.' %(time()-t0))
>>>
>>> # get the normalization
>>> norm = NormalizeData(h5file)
>>> norm.get()

The details of the different submodule are listed here. The only module that really needs to be used is DataGenerator and NormalizeData. The GridTools class should not be directly used by inexperienced users.

Structure Alignement¶

All the complexes contained in the dataset can be aligned similarly to facilitate and improve the training of the model. This can easily be done using the align option of the DataGenerator for example to align all the complexes along the ‘z’ direction one can use:

>>> database = DataGenerator(chain1='C',chain2='D',
>>>                          pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>>                          align={"axis":'z'}, data_augmentation=2,
>>>                          compute_targets=[ ... ], compute_features=[ ... ], ... )

Other options are possbile, for example if you would like to have the alignement done only using a subpart of the complex, say the chains A and B you can use:

>>> database = DataGenerator(chain1='C',chain2='D',
>>>                          pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>>                          align={"axis":'z', "selection": {"chainID":["A","B"]} }, data_augmentation=2,
>>>                          compute_targets=[ ... ], compute_features=[ ... ], ... )

All the selection offered by pdb2sql can be used in the align dictionnary e.g.: “resId”:[1,2,3], “resName”:[‘VAL’,’LEU’], … Only the atoms selected will be aligned in the give direction.

You can also try to align the interface between two chains in a given plane. This can be done using:

>>> database = DataGenerator(chain1='C',chain2='D',
>>>                          pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>>                          align={"plane":'xy', "selection":"interface"}, data_augmentation=2,
>>>                          compute_targets=[ ... ], compute_features=[ ... ], ... )

which by default will use the interface between the first two chains. If you have more than two chains in the complex and want to specify which chains are forming the interface to be aligned you can use:

>>> database = DataGenerator(chain1='C',chain2='D',
>>>                          pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>>                          align={"plane":'xy', "selection":"interface", "chain1":'A', "chain2":'C'}, data_augmentation=2,
>>>                          compute_targets=[ ... ], compute_features=[ ... ], ... )

DataGenerator¶

deeprank.generate.DataGenerator._printif(string, cond)[source]¶

class deeprank.generate.DataGenerator.DataGenerator(chain1, chain2, pdb_select=None, pdb_source=None, pdb_native=None, pssm_source=None, align=None, compute_targets=None, compute_features=None, data_augmentation=None, hdf5='database.h5', mpi_comm=None)[source]¶

Bases: object

Generate the data (features/targets/maps) required for deeprank.

Parameters:

chain1 (str) – First chain ID
chain2 (str) – Second chain ID
pdb_select (list(str), optional) – List of individual conformation for mapping
pdb_source (list(str), optional) – List of folders where to find the pdbs for mapping
pdb_native (list(str), optional) – List of folders where to find the native comformations, nust set it if having targets to compute in parameter “compute_targets”.
pssm_source (list(str), optional) – List of folders where to find the PSSM files
align (dict, optional) – Dicitionary to align the compexes, e.g. align = {“selection”:{“chainID”:[“A”,”B”]},”axis”:”z”}} e.g. align = {“selection”:”interface”,”plane”:”xy”} if “selection” is not specified the entire complex is used for alignement
compute_targets (list(str), optional) – List of python files computing the targets, “pdb_native” must be set if having targets to compute.
compute_features (list(str), optional) – List of python files computing the features
data_augmentation (int, optional) – Number of rotation performed one each complex
hdf5 (str, optional) – name of the hdf5 file where the data is saved, default to ‘database.h5’
mpi_comm (MPI_COMM) – MPI COMMUNICATOR

Raises:

NotADirectoryError – if the source are not found

Example

>>> from deeprank.generate import *
>>> # sources to assemble the data base
>>> pdb_source     = ['./1AK4/decoys/']
>>> pdb_native     = ['./1AK4/native/']
>>> pssm_source    = ['./1AK4/pssm_new/']
>>> h5file = '1ak4.hdf5'
>>>
>>> #init the data assembler
>>> database = DataGenerator(chain1='C',
>>>                          chain2='D',
>>>                          pdb_source=pdb_source,
>>>                          pdb_native=pdb_native,
>>>                          pssm_source=pssm_source,
>>>                          data_augmentation=None,
>>>                          compute_targets=['deeprank.targets.dockQ'],
>>>                          compute_features=['deeprank.features.AtomicFeature',
>>>                                            'deeprank.features.PSSM_IC',
>>>                                            'deeprank.features.BSA'],
>>>                          hdf5=h5file)

create_database(verbose=False, remove_error=True, prog_bar=False, contact_distance=8.5, random_seed=None)[source]¶

Create the hdf5 file architecture and compute the features/targets.

Parameters:	verbose (bool, optional) – Print creation details remove_error (bool, optional) – remove the groups that errored prog_bar (bool, optional) – use tqdm contact_distance (float) – contact distance cutoff, defaults to 8.5Å random_seed (int) – random seed for getting rotation axis and angle
Raises:	`ValueError` – If creation of the group errored.

Example:

>>> # sources to assemble the data base
>>> pdb_source     = ['./1AK4/decoys/']
>>> pdb_native     = ['./1AK4/native/']
>>> pssm_source    = ['./1AK4/pssm_new/']
>>> h5file = '1ak4.hdf5'
>>>
>>> #init the data assembler
>>> database = DataGenerator(chain1='C',
>>>                          chain2='D',
>>>                          pdb_source=pdb_source,
>>>                          pdb_native=pdb_native,
>>>                          pssm_source=pssm_source,
>>>                          data_augmentation=None,
>>>                          compute_targets  = ['deeprank.targets.dockQ'],
>>>                          compute_features = ['deeprank.features.AtomicFeature',
>>>                                              'deeprank.features.PSSM_IC',
>>>                                              'deeprank.features.BSA'],
>>>                          hdf5=h5file)
>>>
>>> #create new files
>>> database.create_database(prog_bar=True)

aug_data(augmentation, keep_existing_aug=True, random_seed=None)[source]¶

Augment exiting original PDB data and features.

Parameters:	augmentation (int) – Times of augmentation keep_existing_aug (bool, optional) – Keep existing augmentated data. If False, existing aug will be removed. Defaults to True.

Examples

>>> database = DataGenerator(h5='database.h5')
>>> database.aug_data(augmentation=3, append=True)
>>> grid_info = {
>>>     'number_of_points': [30,30,30],
>>>     'resolution': [1.,1.,1.],
>>>     'atomic_densities': {'C':1.7, 'N':1.55, 'O':1.52, 'S':1.8},
>>>     }
>>> database.map_features(grid_info)

add_feature(remove_error=True, prog_bar=True)[source]¶

Add a feature to an existing hdf5 file.

Parameters:	remove_error (bool) – remove errored molecule prog_bar (bool, optional) – use tqdm

Example:

>>> h5file = '1ak4.hdf5'
>>>
>>> #init the data assembler
>>> database = DataGenerator(compute_features  = ['deeprank.features.ResidueDensity'],
>>>                          hdf5=h5file)
>>>
>>> database.add_feature(remove_error=True, prog_bar=True)

add_unique_target(targdict)[source]¶

Add identical targets for all the complexes in the datafile.

This is usefull if you want to add the binary class of all the complexes created from decoys or natives

Parameters:	targdict (dict) – Example: {‘DOCKQ’:1.0}

>>> database = DataGenerator(hdf5='1ak4.hdf5')
>>> database.add_unique_target({'DOCKQ':1.0})

add_target(prog_bar=False)[source]¶

Add a target to an existing hdf5 file.

Parameters:	prog_bar (bool, optional) – Use tqdm

Example:

>>> h5file = '1ak4.hdf5'
>>>
>>> #init the data assembler
>>> database = DataGenerator(compute_targets =['deeprank.targets.binary_class'],
>>>                          hdf5=h5file)
>>>
>>> database.add_target(prog_bar=True)

realign_complexes(align, compute_features=None, pssm_source=None)[source]¶

Align all the complexes already present in the HDF5.

Keyword Arguments:
Parameters:	{dict} -- alignement dictionary (align) –
	{list} -- list of features to be computed (compute_features) – if None computes the features specified in the attrs[‘features’] of the file (if present) pssm_source {str} – path of the pssm files. If None the source specfied in the attrs[‘pssm_source’] will be used (if present) (default: {None})
Raises:	`ValueError` – If no PSSM detected

Example:

>>> database = DataGenerator(hdf5='1ak4.hdf5')
>>> # if comute_features and pssm_source are not specified
>>> # the values in hdf5.attrs['features'] and hdf5.attrs['pssm_source'] will be used
>>> database.realign_complex(align={'axis':'x'},
>>>                          compute_features['deeprank.features.X'],
>>>                           pssm_source='./1ak4_pssm/')

_get_grid_center(pdb, contact_distance)[source]¶

precompute_grid(grid_info, contact_distance=8.5, prog_bar=False, time=False, try_sparse=True)[source]¶

map_features(grid_info={}, cuda=False, gpu_block=None, cuda_kernel='kernel_map.c', cuda_func_name='gaussian', try_sparse=True, reset=False, use_tmpdir=False, time=False, prog_bar=True, grid_prog_bar=False, remove_error=True)[source]¶

Map the feature on a grid of points centered at the interface.

If features to map are not given, they will be are automatically determined for each molecule. Otherwise, given features will be mapped for all molecules (i.e. existing mapped features will be recalculated).

Parameters:

grid_info (dict) – Informaton for the grid. See deeprank.generate.GridTools.py for details.
cuda (bool, optional) – Use CUDA
gpu_block (None, optional) – GPU block size to be used
cuda_kernel (str, optional) – filename containing CUDA kernel
cuda_func_name (str, optional) – The name of the function in the kernel
try_sparse (bool, optional) – Try to save the grids as sparse format
reset (bool, optional) – remove grids if some are already present
use_tmpdir (bool, optional) – use a scratch directory
time (bool, optional) – time the mapping process
prog_bar (bool, optional) – use tqdm for each molecule
grid_prog_bar (bool, optional) – use tqdm for each grid
remove_error (bool, optional) – remove the data that errored

Example:

>>> #init the data assembler
>>> database = DataGenerator(hdf5='1ak4.hdf5')
>>>
>>> # map the features
>>> grid_info = {
>>>     'number_of_points': [30,30,30],
>>>     'resolution': [1.,1.,1.],
>>>     'atomic_densities': {'C':1.7, 'N':1.55, 'O':1.52, 'S':1.8},
>>> }
>>>
>>> database.map_features(grid_info,try_sparse=True,time=False,prog_bar=True)

remove(feature=True, pdb=True, points=True, grid=False)[source]¶

Remove data from the data set.

Equivalent to the cleandata command line tool. Once the data has been removed from the file it is impossible to add new features/targets

Parameters:	feature (bool, optional) – Remove the features pdb (bool, optional) – Remove the pdbs points (bool, optional) – remove teh grid points grid (bool, optional) – remove the maps

_tune_cuda_kernel(grid_info, cuda_kernel='kernel_map.c', func='gaussian')[source]¶

Tune the CUDA kernel using the kernel tuner http://benvanwerkhoven.github.io/kernel_tuner/

Parameters:	grid_info (dict) – information for the grid definition cuda_kernel (str, optional) – file containing the kernel func (str, optional) – function in the kernel to be used
Raises:	`ValueError` – If the tuner has not been used

_test_cuda(grid_info, gpu_block=8, cuda_kernel='kernel_map.c', func='gaussian')[source]¶

Test the CUDA kernel.

Parameters:	grid_info (dict) – Information for the grid definition gpu_block (int, optional) – GPU block size to be used cuda_kernel (str, optional) – File containing the kernel func (str, optional) – function in the kernel to be used
Raises:	`ValueError` – If the kernel has not been installed

static _compile_cuda_kernel(cuda_kernel, npts, res)[source]¶

Compile the cuda kernel.

Parameters:	cuda_kernel (str) – filename npts (tuple(int)) – number of grid points in each direction res (tuple(float)) – resolution in each direction
Returns:	compiled kernel
Return type:	compiler.SourceModule

static _get_cuda_function(module, func_name)[source]¶

Get a single function from the compiled kernel.

Parameters:	module (compiler.SourceModule) – compiled kernel module func_name (str) – Name of the funtion
Returns:	cuda function
Return type:	func

static _tunable_kernel(kernel)[source]¶

Make a tunale kernel.

Parameters:	kernel (str) – String of the kernel
Returns:	tunable kernel
Return type:	TYPE

_filter_cplx()[source]¶: Filter the name of the complexes.

static _compute_features(feat_list, pdb_data, featgrp, featgrp_raw, chain1, chain2, logger)[source]¶

Compute the features.

Parameters:	feat_list (list(str)) – list of function name, e.g., [‘deeprank.features.ResidueDensity’, ‘deeprank.features.PSSM_IC’] pdb_data (bytes) – PDB translated in bytes featgrp (str) – name of the group where to store the xyz feature featgrp_raw (str) – name of the group where to store the raw feature chain1 (str) – First chain ID chain2 (str) – Second chain ID logger (logger) – name of logger object
Returns:	error happened or not
Return type:	bool

static _compute_targets(targ_list, pdb_data, targrp)[source]¶

Compute the targets.

Parameters:	targ_list (list(str)) – list of function name pdb_data (bytes) – PDB translated in btes targrp (str) – name of the group where to store the targets logger (logger) – name of logger object

_add_pdb(molgrp, pdbfile, name)[source]¶

Add a pdb to a molgrp.

Parameters:	molgrp (str) – mopl group where tp add the pdb pdbfile (str) – psb file to add name (str) – dataset name in the hdf5 molgroup

_get_aligned_sqldb(pdbfile, dict_align)[source]¶

return a sqldb of the pdb that is aligned as specified in the dict

Parameters:	{str} -- path ot the pdb (pdbfile) – {dict} -- dictionanry of options to align the pdb (dict_align) –

static _get_aligned_rotation_axis_angle(random_seed, dict_align)[source]¶

Returns the axis and angle of rotation for data: augmentation with aligned complexes

Parameters:	{int} -- random seed of rotation (random_seed) – {dict} -- the dict describing the alignement (dict_align) –
Returns:	axis of rotation float: angle of rotation
Return type:	list(float)

_add_aug_pdb(molgrp, pdbfile, name, axis, angle)[source]¶

Add augmented pdbs to the dataset.

Parameters:	molgrp (str) – name of the molgroup pdbfile (str) – pdb file name name (str) – name of the dataset axis (list(float)) – axis of rotation angle (float) – angle of rotation dict_align (dict) – dict for alignement of the original pdb
Returns:	center of the molecule
Return type:	list(float)

static _rotate_feature(molgrp, axis, angle, center, feat_name='all')[source]¶

Rotate the raw feature values.

Parameters:	molgrp (str) – name pf the molgrp axis (list(float)) – axis of rotation angle (float) – angle of rotation center (list(float)) – center of rotation feat_name (str) – name of the feature to rotate or ‘all’

NormalizeData¶

class deeprank.generate.NormalizeData.NormalizeData(fname, shape=None)[source]¶

Compute the normalization factor for the features and targets of a given HDF5 file.

The normalization of the features is done through the NormParam class that assumes gaussian distribution. Hence the Normalized data should be normally distributed with a 0 mean value and 1 standard deviation. The normalization of the targets is done vian a min/max normalization. As a result the normalized targets should all lie between 0 and 1. By default the output file containing the normalization dictionary is called <hdf5name>_norm.pckl

Parameters:	fname (str) – name of the hdf5 file shape (tuple(int), optional) – shape of the grid in the hdf5 file

Example

>>> norm = NormalizeData('1ak4.hdf5')
>>> norm.get()

get()[source]¶: Get the normalization and write them to file.

_load()[source]¶: Load data from already existing normalization file.

_extract_shape()[source]¶: Get the shape of the data in the hdf5 file.

_extract_data()[source]¶: Extract the data from the different maps.

_process_data()[source]¶: Compute the standard deviation of the data.

_export_data()[source]¶: Pickle the data to file.

class deeprank.generate.NormalizeData.NormParam(std=0, mean=0, var=0, sqmean=0)[source]¶

Compute gaussian normalization for a given feature.

This class allows to extract the standard deviation, mean value, variance and square root of the mean value of a mapped feature stored in the hdf5 file. As the entire data set is too large to fit in memory, the standard deviation of a given feature is calculated from the std of all the individual grids. This is done following: https://stats.stackexchange.com/questions/25848/how-to-sum-a-standard-deviation:

\[\sigma_{tot}=\sqrt{\frac{1}{N}\sum_i \sigma_i^2+\frac{1}{N}\sum_i\mu_i^2-(\frac{1}{N}\sum_i\mu_i)^2}\]

Parameters:	std (float, optional) – standard deviation mean (float,optional) – mean value var (float,optional) – variance sqmean (float, optional) – square roo of the variance

add(mean, var)[source]¶: Add the mean value, sqmean and variance of a new molecule to the corresponding attributes.

process(n)[source]¶: Compute the standard deviation of the ensemble.

class deeprank.generate.NormalizeData.MinMaxParam(minv=None, maxv=None)[source]¶

Compute the min/max of an ensenble of data.

This is principally used to normalized the target values

Parameters:	minv (float, optional) – minimal value maxv (float, optional) – maximal value

update(val)[source]¶

GridTools¶

deeprank.generate.GridTools.logif(string, cond)[source]¶

class deeprank.generate.GridTools.GridTools(molgrp, chain1, chain2, number_of_points=30, resolution=1.0, atomic_densities=None, atomic_densities_mode='ind', feature=None, feature_mode='ind', contact_distance=8.5, cuda=False, gpu_block=None, cuda_func=None, cuda_atomic=None, prog_bar=False, time=False, try_sparse=True)[source]¶

Map the feature of a complex on the grid.

Parameters:

molgrp (str) – name of the group of the molecule in the HDF5 file.
chain1 (str) – First chain ID.
chain2 (str) – Second chain ID.
number_of_points (int, optional) – number of points we want in each direction of the grid.
resolution (float, optional) – distance(in Angs) between two points.
atomic_densities (dict, optional) – dictionary of element types with their vdw radius, see deeprank.config.atom_vdw_radius_noH
atomic_densities_mode (str, optional) – Mode for mapping (deprecated must be ‘ind’).
feature (None, optional) – Name of the features to be mapped. By default all the features present in hdf5_file[’< molgrp > /features/] will be mapped.
feature_mode (str, optional) – Mode for mapping (deprecated must be ‘ind’).
contact_distance (float, optional) – the dmaximum distance between two contact atoms default 8.5Å.
cuda (bool, optional) – Use CUDA or not.
gpu_block (tuple(int), optional) – GPU block size to use.
cuda_func (None, optional) – Name of the CUDA function to be used for the mapping of the features. Must be present in kernel_cuda.c.
cuda_atomic (None, optional) – Name of the CUDA function to be used for the mapping of the atomic densities. Must be present in kernel_cuda.c.
prog_bar (bool, optional) – print progression bar for individual grid (default False).
time (bool, optional) – print timing statistic for individual grid (default False).
try_sparse (bool, optional) – Try to store the matrix in sparse format (default True).

create_new_data()[source]¶: Create new feature for a given complex.

update_feature()[source]¶: Update existing feature in a complex.

read_pdb()[source]¶: Create a sql databse for the pdb.

get_contact_center()[source]¶: Get the center of conact atoms.

add_all_features()[source]¶: Add all the features toa given molecule.

add_all_atomic_densities()[source]¶: Add all atomic densities.

define_grid_points()[source]¶: Define the grid points.

map_atomic_densities(only_contact=True)[source]¶

Map the atomic densities to the grid.

Parameters:	only_contact (bool, optional) – Map only the contact atoms
Raises:	`ImportError` – Description

densgrid(center, vdw_radius)[source]¶

Function to map individual atomic density on the grid.

The formula is equation (1) of the Koes paper Protein-Ligand Scoring with Convolutional NN Arxiv:1612.02751v1

Parameters:	center (list(float)) – position of the atoms vdw_radius (float) – vdw radius of the atom
Returns:	np.array (mapped density)
Return type:	TYPE

map_features(featlist, transform=None)[source]¶

Map individual feature to the grid.

For residue based feature the feature file must be of the format chainID residue_name(3-letter) residue_number [values]

For atom based feature it must be chainID residue_name(3-letter) residue_number atome_name [values]

Parameters:	featlist (list(str)) – list of features to be mapped transform (callable, optional) – transformation of the feature (?)
Returns:	Mapped features
Return type:	np.array
Raises:	`ImportError` – Description `ValueError` – Description

featgrid(center, value, type_='fast_gaussian')[source]¶

Map an individual feature (atomic or residue) on the grid.

Parameters:	center (list(float)) – position of the feature center value (float) – value of the feature type (str, optional) – method to map
Returns:	Mapped feature
Return type:	np.array
Raises:	`ValueError` – Description

export_grid_points()[source]¶: export the grid points to the hdf5 file.

hdf5_grid_data(dict_data, data_name)[source]¶

Save the mapped feature to the hdf5 file.

Parameters:	dict_data (dict) – feature values stored as a dict data_name (str) – feature name