Data Generation¶
This module contains all the tools to compute the features and targets and to map the features onto a grid of points. The main class used for the data generation is deeprank.generate.DataGenerator
. Through this class you can specify the molecules you want to consider, the features and the targets that need to be computed and the way to map the features on the grid. The data is stored in a single HDF5 file. In this file, each conformation has its own group that contains all the information related to the conformation. This includes the pdb data, the value of the feature (in human readable format and xyz-val format), the value of the targe values, the grid points and the mapped features on the grid.
At the moment a number of features are already implemented. This include:
- Atomic densities
- Coulomb & vd Waals interactions
- Atomic charges
- PSSM data
- Information content
- Buried surface area
- Contact Residue Densities
More features can be easily implemented and integrated in the data generation workflow. You can see example here. The calculation of a number of target values have also been implemented:
- i-RMSD
- l-RMSD
- FNAT
- DockQ
- binary class
There as well new targets can be implemented and integrated to the workflow.
Normalization of the data can be time consuming as the dataset becomes large. As an attempt to alleviate this problem, the class deeprank.generate.NormalizeData
has been created. This class directly compute and store the standard deviation and mean value of each feature within a given hdf5 file.
Example:
>>> from deeprank.generate import *
>>> from time import time
>>>
>>> pdb_source = ['./1AK4/decoys/']
>>> pdb_native = ['./1AK4/native/']
>>> pssm_source = './1AK4/pssm_new/'
>>> h5file = '1ak4.hdf5'
>>>
>>> #init the data assembler
>>> database = DataGenerator(chain1='C',chain2='D',
>>> pdb_source=pdb_source,pdb_native=pdb_native,pssm_source=pssm_source,
>>> data_augmentation=None,
>>> compute_targets = ['deeprank.targets.dockQ'],
>>> compute_features = ['deeprank.features.AtomicFeature',
>>> 'deeprank.features.FullPSSM',
>>> 'deeprank.features.PSSM_IC',
>>> 'deeprank.features.BSA'],
>>> hdf5=h5file)
>>>
>>> t0 = time()
>>> #create new files
>>> database.create_database(prog_bar=True)
>>>
>>> # map the features
>>> grid_info = {
>>> 'number_of_points': [30,30,30],
>>> 'resolution': [1.,1.,1.],
>>> 'atomic_densities': {'CA':3.5,'N':3.5,'O':3.5,'C':3.5},
>>> }
>>> database.map_features(grid_info,try_sparse=True,time=False,prog_bar=True)
>>>
>>> # add a new target
>>> database.add_target(prog_bar=True)
>>> print(' '*25 + '--> Done in %f s.' %(time()-t0))
>>>
>>> # get the normalization
>>> norm = NormalizeData(h5file)
>>> norm.get()
The details of the different submodule are listed here. The only module that really needs to be used is DataGenerator
and NormalizeData
. The GridTools
class should not be directly used by inexperienced users.
Structure Alignement¶
All the complexes contained in the dataset can be aligned similarly to facilitate and improve the training of the model. This can easily be done using the align option of the DataGenerator for example to align all the complexes along the ‘z’ direction one can use:
>>> database = DataGenerator(chain1='C',chain2='D',
>>> pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>> align={"axis":'z'}, data_augmentation=2,
>>> compute_targets=[ ... ], compute_features=[ ... ], ... )
Other options are possbile, for example if you would like to have the alignement done only using a subpart of the complex, say the chains A and B you can use:
>>> database = DataGenerator(chain1='C',chain2='D',
>>> pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>> align={"axis":'z', "selection": {"chainID":["A","B"]} }, data_augmentation=2,
>>> compute_targets=[ ... ], compute_features=[ ... ], ... )
All the selection offered by pdb2sql can be used in the align dictionnary e.g.: “resId”:[1,2,3], “resName”:[‘VAL’,’LEU’], … Only the atoms selected will be aligned in the give direction.
You can also try to align the interface between two chains in a given plane. This can be done using:
>>> database = DataGenerator(chain1='C',chain2='D',
>>> pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>> align={"plane":'xy', "selection":"interface"}, data_augmentation=2,
>>> compute_targets=[ ... ], compute_features=[ ... ], ... )
which by default will use the interface between the first two chains. If you have more than two chains in the complex and want to specify which chains are forming the interface to be aligned you can use:
>>> database = DataGenerator(chain1='C',chain2='D',
>>> pdb_source=pdb_source, pdb_native=pdb_native, pssm_source=pssm_source,
>>> align={"plane":'xy', "selection":"interface", "chain1":'A', "chain2":'C'}, data_augmentation=2,
>>> compute_targets=[ ... ], compute_features=[ ... ], ... )
DataGenerator¶
-
class
deeprank.generate.DataGenerator.
DataGenerator
(chain1, chain2, pdb_select=None, pdb_source=None, pdb_native=None, pssm_source=None, align=None, compute_targets=None, compute_features=None, data_augmentation=None, hdf5='database.h5', mpi_comm=None)[source]¶ Bases:
object
Generate the data (features/targets/maps) required for deeprank.
Parameters: - chain1 (str) – First chain ID
- chain2 (str) – Second chain ID
- pdb_select (list(str), optional) – List of individual conformation for mapping
- pdb_source (list(str), optional) – List of folders where to find the pdbs for mapping
- pdb_native (list(str), optional) – List of folders where to find the native comformations, nust set it if having targets to compute in parameter “compute_targets”.
- pssm_source (list(str), optional) – List of folders where to find the PSSM files
- align (dict, optional) – Dicitionary to align the compexes, e.g. align = {“selection”:{“chainID”:[“A”,”B”]},”axis”:”z”}} e.g. align = {“selection”:”interface”,”plane”:”xy”} if “selection” is not specified the entire complex is used for alignement
- compute_targets (list(str), optional) – List of python files computing the targets, “pdb_native” must be set if having targets to compute.
- compute_features (list(str), optional) – List of python files computing the features
- data_augmentation (int, optional) – Number of rotation performed one each complex
- hdf5 (str, optional) – name of the hdf5 file where the data is saved, default to ‘database.h5’
- mpi_comm (MPI_COMM) – MPI COMMUNICATOR
Raises: NotADirectoryError
– if the source are not foundExample
>>> from deeprank.generate import * >>> # sources to assemble the data base >>> pdb_source = ['./1AK4/decoys/'] >>> pdb_native = ['./1AK4/native/'] >>> pssm_source = ['./1AK4/pssm_new/'] >>> h5file = '1ak4.hdf5' >>> >>> #init the data assembler >>> database = DataGenerator(chain1='C', >>> chain2='D', >>> pdb_source=pdb_source, >>> pdb_native=pdb_native, >>> pssm_source=pssm_source, >>> data_augmentation=None, >>> compute_targets=['deeprank.targets.dockQ'], >>> compute_features=['deeprank.features.AtomicFeature', >>> 'deeprank.features.PSSM_IC', >>> 'deeprank.features.BSA'], >>> hdf5=h5file)
-
create_database
(verbose=False, remove_error=True, prog_bar=False, contact_distance=8.5, random_seed=None)[source]¶ Create the hdf5 file architecture and compute the features/targets.
Parameters: Raises: ValueError
– If creation of the group errored.Example:
>>> # sources to assemble the data base >>> pdb_source = ['./1AK4/decoys/'] >>> pdb_native = ['./1AK4/native/'] >>> pssm_source = ['./1AK4/pssm_new/'] >>> h5file = '1ak4.hdf5' >>> >>> #init the data assembler >>> database = DataGenerator(chain1='C', >>> chain2='D', >>> pdb_source=pdb_source, >>> pdb_native=pdb_native, >>> pssm_source=pssm_source, >>> data_augmentation=None, >>> compute_targets = ['deeprank.targets.dockQ'], >>> compute_features = ['deeprank.features.AtomicFeature', >>> 'deeprank.features.PSSM_IC', >>> 'deeprank.features.BSA'], >>> hdf5=h5file) >>> >>> #create new files >>> database.create_database(prog_bar=True)
-
aug_data
(augmentation, keep_existing_aug=True, random_seed=None)[source]¶ Augment exiting original PDB data and features.
Parameters: Examples
>>> database = DataGenerator(h5='database.h5') >>> database.aug_data(augmentation=3, append=True) >>> grid_info = { >>> 'number_of_points': [30,30,30], >>> 'resolution': [1.,1.,1.], >>> 'atomic_densities': {'C':1.7, 'N':1.55, 'O':1.52, 'S':1.8}, >>> } >>> database.map_features(grid_info)
-
add_feature
(remove_error=True, prog_bar=True)[source]¶ Add a feature to an existing hdf5 file.
Parameters: Example:
>>> h5file = '1ak4.hdf5' >>> >>> #init the data assembler >>> database = DataGenerator(compute_features = ['deeprank.features.ResidueDensity'], >>> hdf5=h5file) >>> >>> database.add_feature(remove_error=True, prog_bar=True)
-
add_unique_target
(targdict)[source]¶ Add identical targets for all the complexes in the datafile.
This is usefull if you want to add the binary class of all the complexes created from decoys or natives
Parameters: targdict (dict) – Example: {‘DOCKQ’:1.0} >>> database = DataGenerator(hdf5='1ak4.hdf5') >>> database.add_unique_target({'DOCKQ':1.0})
-
add_target
(prog_bar=False)[source]¶ Add a target to an existing hdf5 file.
Parameters: prog_bar (bool, optional) – Use tqdm Example:
>>> h5file = '1ak4.hdf5' >>> >>> #init the data assembler >>> database = DataGenerator(compute_targets =['deeprank.targets.binary_class'], >>> hdf5=h5file) >>> >>> database.add_target(prog_bar=True)
-
realign_complexes
(align, compute_features=None, pssm_source=None)[source]¶ Align all the complexes already present in the HDF5.
Parameters: {dict} -- alignement dictionary (align) – Keyword Arguments: {list} -- list of features to be computed (compute_features) – - if None computes the features specified in
- the attrs[‘features’] of the file (if present)
- pssm_source {str} – path of the pssm files. If None the source specfied in
- the attrs[‘pssm_source’] will be used (if present) (default: {None})
Raises: ValueError
– If no PSSM detectedExample:
>>> database = DataGenerator(hdf5='1ak4.hdf5') >>> # if comute_features and pssm_source are not specified >>> # the values in hdf5.attrs['features'] and hdf5.attrs['pssm_source'] will be used >>> database.realign_complex(align={'axis':'x'}, >>> compute_features['deeprank.features.X'], >>> pssm_source='./1ak4_pssm/')
-
precompute_grid
(grid_info, contact_distance=8.5, prog_bar=False, time=False, try_sparse=True)[source]¶
-
map_features
(grid_info={}, cuda=False, gpu_block=None, cuda_kernel='kernel_map.c', cuda_func_name='gaussian', try_sparse=True, reset=False, use_tmpdir=False, time=False, prog_bar=True, grid_prog_bar=False, remove_error=True)[source]¶ Map the feature on a grid of points centered at the interface.
If features to map are not given, they will be are automatically determined for each molecule. Otherwise, given features will be mapped for all molecules (i.e. existing mapped features will be recalculated).
Parameters: - grid_info (dict) – Informaton for the grid. See deeprank.generate.GridTools.py for details.
- cuda (bool, optional) – Use CUDA
- gpu_block (None, optional) – GPU block size to be used
- cuda_kernel (str, optional) – filename containing CUDA kernel
- cuda_func_name (str, optional) – The name of the function in the kernel
- try_sparse (bool, optional) – Try to save the grids as sparse format
- reset (bool, optional) – remove grids if some are already present
- use_tmpdir (bool, optional) – use a scratch directory
- time (bool, optional) – time the mapping process
- prog_bar (bool, optional) – use tqdm for each molecule
- grid_prog_bar (bool, optional) – use tqdm for each grid
- remove_error (bool, optional) – remove the data that errored
Example:
>>> #init the data assembler >>> database = DataGenerator(hdf5='1ak4.hdf5') >>> >>> # map the features >>> grid_info = { >>> 'number_of_points': [30,30,30], >>> 'resolution': [1.,1.,1.], >>> 'atomic_densities': {'C':1.7, 'N':1.55, 'O':1.52, 'S':1.8}, >>> } >>> >>> database.map_features(grid_info,try_sparse=True,time=False,prog_bar=True)
-
remove
(feature=True, pdb=True, points=True, grid=False)[source]¶ Remove data from the data set.
Equivalent to the cleandata command line tool. Once the data has been removed from the file it is impossible to add new features/targets
Parameters:
-
_tune_cuda_kernel
(grid_info, cuda_kernel='kernel_map.c', func='gaussian')[source]¶ Tune the CUDA kernel using the kernel tuner http://benvanwerkhoven.github.io/kernel_tuner/
Parameters: Raises: ValueError
– If the tuner has not been used
-
_test_cuda
(grid_info, gpu_block=8, cuda_kernel='kernel_map.c', func='gaussian')[source]¶ Test the CUDA kernel.
Parameters: Raises: ValueError
– If the kernel has not been installed
-
static
_compile_cuda_kernel
(cuda_kernel, npts, res)[source]¶ Compile the cuda kernel.
Parameters: Returns: compiled kernel
Return type: compiler.SourceModule
-
static
_get_cuda_function
(module, func_name)[source]¶ Get a single function from the compiled kernel.
Parameters: - module (compiler.SourceModule) – compiled kernel module
- func_name (str) – Name of the funtion
Returns: cuda function
Return type: func
-
static
_tunable_kernel
(kernel)[source]¶ Make a tunale kernel.
Parameters: kernel (str) – String of the kernel Returns: tunable kernel Return type: TYPE
-
static
_compute_features
(feat_list, pdb_data, featgrp, featgrp_raw, chain1, chain2, logger)[source]¶ Compute the features.
Parameters: - feat_list (list(str)) – list of function name, e.g., [‘deeprank.features.ResidueDensity’, ‘deeprank.features.PSSM_IC’]
- pdb_data (bytes) – PDB translated in bytes
- featgrp (str) – name of the group where to store the xyz feature
- featgrp_raw (str) – name of the group where to store the raw feature
- chain1 (str) – First chain ID
- chain2 (str) – Second chain ID
- logger (logger) – name of logger object
Returns: error happened or not
Return type:
-
_get_aligned_sqldb
(pdbfile, dict_align)[source]¶ return a sqldb of the pdb that is aligned as specified in the dict
Parameters: - {str} -- path ot the pdb (pdbfile) –
- {dict} -- dictionanry of options to align the pdb (dict_align) –
-
static
_get_aligned_rotation_axis_angle
(random_seed, dict_align)[source]¶ - Returns the axis and angle of rotation for data
- augmentation with aligned complexes
Parameters: - {int} -- random seed of rotation (random_seed) –
- {dict} -- the dict describing the alignement (dict_align) –
Returns: axis of rotation float: angle of rotation
Return type:
-
_add_aug_pdb
(molgrp, pdbfile, name, axis, angle)[source]¶ Add augmented pdbs to the dataset.
Parameters: Returns: center of the molecule
Return type:
NormalizeData¶
-
class
deeprank.generate.NormalizeData.
NormalizeData
(fname, shape=None)[source]¶ Compute the normalization factor for the features and targets of a given HDF5 file.
The normalization of the features is done through the NormParam class that assumes gaussian distribution. Hence the Normalized data should be normally distributed with a 0 mean value and 1 standard deviation. The normalization of the targets is done vian a min/max normalization. As a result the normalized targets should all lie between 0 and 1. By default the output file containing the normalization dictionary is called <hdf5name>_norm.pckl
Parameters: Example
>>> norm = NormalizeData('1ak4.hdf5') >>> norm.get()
-
class
deeprank.generate.NormalizeData.
NormParam
(std=0, mean=0, var=0, sqmean=0)[source]¶ Compute gaussian normalization for a given feature.
This class allows to extract the standard deviation, mean value, variance and square root of the mean value of a mapped feature stored in the hdf5 file. As the entire data set is too large to fit in memory, the standard deviation of a given feature is calculated from the std of all the individual grids. This is done following: https://stats.stackexchange.com/questions/25848/how-to-sum-a-standard-deviation:
\[\sigma_{tot}=\sqrt{\frac{1}{N}\sum_i \sigma_i^2+\frac{1}{N}\sum_i\mu_i^2-(\frac{1}{N}\sum_i\mu_i)^2}\]Parameters:
GridTools¶
-
class
deeprank.generate.GridTools.
GridTools
(molgrp, chain1, chain2, number_of_points=30, resolution=1.0, atomic_densities=None, atomic_densities_mode='ind', feature=None, feature_mode='ind', contact_distance=8.5, cuda=False, gpu_block=None, cuda_func=None, cuda_atomic=None, prog_bar=False, time=False, try_sparse=True)[source]¶ Map the feature of a complex on the grid.
Parameters: - molgrp (str) – name of the group of the molecule in the HDF5 file.
- chain1 (str) – First chain ID.
- chain2 (str) – Second chain ID.
- number_of_points (int, optional) – number of points we want in each direction of the grid.
- resolution (float, optional) – distance(in Angs) between two points.
- atomic_densities (dict, optional) – dictionary of element types with their vdw radius, see deeprank.config.atom_vdw_radius_noH
- atomic_densities_mode (str, optional) – Mode for mapping (deprecated must be ‘ind’).
- feature (None, optional) – Name of the features to be mapped. By default all the features present in hdf5_file[’< molgrp > /features/] will be mapped.
- feature_mode (str, optional) – Mode for mapping (deprecated must be ‘ind’).
- contact_distance (float, optional) – the dmaximum distance between two contact atoms default 8.5Å.
- cuda (bool, optional) – Use CUDA or not.
- gpu_block (tuple(int), optional) – GPU block size to use.
- cuda_func (None, optional) – Name of the CUDA function to be used for the mapping of the features. Must be present in kernel_cuda.c.
- cuda_atomic (None, optional) – Name of the CUDA function to be used for the mapping of the atomic densities. Must be present in kernel_cuda.c.
- prog_bar (bool, optional) – print progression bar for individual grid (default False).
- time (bool, optional) – print timing statistic for individual grid (default False).
- try_sparse (bool, optional) – Try to store the matrix in sparse format (default True).
-
map_atomic_densities
(only_contact=True)[source]¶ Map the atomic densities to the grid.
Parameters: only_contact (bool, optional) – Map only the contact atoms Raises: ImportError
– Description
-
densgrid
(center, vdw_radius)[source]¶ Function to map individual atomic density on the grid.
The formula is equation (1) of the Koes paper Protein-Ligand Scoring with Convolutional NN Arxiv:1612.02751v1
Parameters: Returns: np.array (mapped density)
Return type: TYPE
-
map_features
(featlist, transform=None)[source]¶ Map individual feature to the grid.
For residue based feature the feature file must be of the format chainID residue_name(3-letter) residue_number [values]
For atom based feature it must be chainID residue_name(3-letter) residue_number atome_name [values]
Parameters: Returns: Mapped features
Return type: np.array
Raises: ImportError
– DescriptionValueError
– Description
-
featgrid
(center, value, type_='fast_gaussian')[source]¶ Map an individual feature (atomic or residue) on the grid.
Parameters: Returns: Mapped feature
Return type: np.array
Raises: ValueError
– Description