Machine Learning for Precision Oncology and Drug Design (MLPODD)
Our research focuses on the development and application of computational methods to predict and analyse the modulation of protein and cell function by small organic molecules. These problems can be tackled by generating predictive models from relevant data using machine learning (an approach that has been recently rebranded as AI for Drug Discovery). Within this area, problems of interest include predicting treatment response of tumours from their molecular profiles for precision oncology, cancer pharmaco-omics modelling for phenotypic drug design, molecular target prediction by bioactivity data mining and target-based drug design (e.g. structure-based virtual screening guided by highly-predictive machine-learning scoring functions).
Precision Oncology – Methods
The efficacy of a drug treatment is strongly cancer patient-dependent. There is hence a great need to investigate computational methods able to predict which patients will respond to a given treatment. Many thousands of numerical features are often describing each tumour (e.g. those coming from cheap and fast molecular profiling technologies, such as RNA-seq or Methyl-Seq). Machine learning can be used to identify which combinations of these gene alterations can predict treatment response and thus guide precision oncology efforts. Unfortunately, the number of tumours of a given cancer type that have been both molecularly profiled and treated with the same drug is generally small (it rarely exceeds 100). Such high-dimensional classification problems are hard, as many algorithms struggle to build classifiers ignoring the thousands of irrelevant features.
We are investigating the integration of feature selection with machine learning algorithms to build classifiers that only make use of a much smaller subset of features (those most discriminative). For instance, when systematically analysing a comprehensive in vivo data set (1), we have observed that identifying an optimal subset of features using random forest as the base learner results in predictive models for most cancer types, treatments and profiles. We are also interested is the challenge of how to best interpret a prediction in terms of the selected gene alterations in order to explain why a specific tumour is sensitive or resistant to the treatment.
Precision Oncology - Applications
We have compared the standard approach of identifying single-gene markers to the emerging multi-gene approach of combining multiple gene alterations with machine learning using the same in vitro pharmaco-genomic data (2, 3). We have looked at the same question using in vivo preclinical data (1) and we are currently investigating this issue with in vivo clinical data as well.
All these studies reveal that a higher proportion of cancer type-treatment binomials can be accurately predicted if: 1) multi-gene classifiers are built (especially those integrating feature selection), 2) a higher number of machine learning algorithms is employed, and 3) a higher number of molecular profiles is considered. By systematically comparing single-gene and multi-gene classifiers, we have also found out that the characteristic low recall (sensitivity) of a single-gene marker is not an intrinsic limitation of precision oncology, but a result of using a single-feature classifier instead of one effectively combining multiple gene alterations (1, 3).
We are currently investigating the application of the developed tools to clinical pharmaco-omic data sets, as those coming from acute myeloid leukaemia and metastatic breast cancer patients.
Drug Design - Methods
In addition to research intended to optimise the application of known drugs, there is a constant need to discover new drugs to treat cancer patients who do not respond to first-line treatments, relapse and/or have poor-prognosis with current treatments. This cannot be achieved without a way to identify molecules modulating a specific biological function of a therapeutic target. There is now a range of computational methods able to predict the biological activities of a molecule from ever-increasing volumes of relevant experimental data. For instance, Virtual Screening (VS) methods can be used to search vast libraries of molecules for those likely to be active against the considered target. In practice, these tools have been able to discover drug leads in a wide range of targets and are particularly useful in those targets where High-Throughput Screening (HTS) performs poorly or it is not an option (e.g. technically not possible, too expensive or too slow). There are also methods devised for optimising the potency of drug leads as well as predicting their off-targets.
For the scenario where one has a molecule with affinity for the target of interest, we devised a ligand-based VS method named Ultrafast Shape Recognition (USR)(4). USR searches these libraries for molecules with a similar 3D shape to that of this template. This is beneficial in that similarly shaped molecules are likely to both hit the same targets as the search template and have a different chemical scaffold(4). Others have built upon this concept by incorporating the spatial distribution of pharmacophoric properties to the search, as in USRCAT(5). We have recently implemented both tools in the USR-VS(6) webserver to carry out large-scale prospective VS.
If a structural model of the protein target is available (e.g. X-ray crystal structure), structure-based methods such as molecular docking can be used to predict the strength with which a molecule binds the target. Docking is useful to identify new drug leads for a target or design more potent drug leads. The single most important limitation of docking is in ranking molecules by their predicted binding strength, which is carried out by specialised Scoring Functions (SFs). In this area, we demonstrated(7) the advantages of machine-learning SFs over classical SFs (i.e. those based on a linear combination of features). We revealed(8) that a more precise chemical description of the protein-ligand complex does not generally lead to more predictive SFs as it was generally thought. We recently show(9) that the performance of classical SFs quickly stagnates with increasing training data size, unlike that of machine-learning SFs. When tailored for VS, we have found(10) that machine-learning SFs obtain substantially improved VS performance by training with unusually large sets of inactives.
In the best case scenario, a drug lead with high potency against its intended target is generated at a high financial and time expense. Unfortunately, many of these optimised leads turn out to be not cell-active in the end and hence have no therapeutic value. With our collaborators in the UK, we have implemented as a webserver an existing method to predict the cell line growth inhibition induced by a molecule(19). This can be used to position a lead on a cancer type by predicting on which cell lines this would induce stronger growth inhibition. This tool can also be used for phenotypic drug design, where a large library of molecules is searched for those predicted to be more active on a given cancer type. Thereafter, it is desirable to predict which the targets of the resulting phenotypic hits are. With this purpose, we have developed and validated a target prediction method(11), which is available as webserver(12). Recently, we have also developed a method to predict the synergy of drugs in inhibiting cancer cell lines(13).
Drug Design - Applications
In prospective VS studies, we have observed that USR excels at discovering bioactive molecules with new chemical scaffolds (14–17). Several collaborations are ongoing to discover novel ligands for other targets using USR and USRCAT. We have also used a machine-learning SF (RF-Score) as a part of a hierarchical VS protocol, which led to the discovery of a large proportion of inhibitors of an antibacterial target(15). However, unlike RF-Score, RF-Score-VS was devised specifically for VS, which results in substantially better VS results, as discussed in our review(18). We have now initiated collaborations to use machine-learning SFs for prospective VS against several cancer targets. On the other hand, we are using MolTarPred(12) to predict the targets of some clinical drugs. Our collaborators have experimentally confirmed some of the predicted targets (one of these previously unknown targets binds to the drug with a 300 nM potency).
On the phenotypic drug design side, we have predicted the growth inhibition potencies and pairwise synergies of a large set of clinical drugs on cancer cell lines using (19) and (13), respectively. Selected predictions are currently being validated in vitro by our collaborators.
1. Nguyen,L., Naulaerts,S., Bomane,A., Bruna,A., Ghislat,G. and Ballester,P. (2018). bioRxiv, 10.1101/277772.
2. Nguyen,L., Dang,C.C. and Ballester,P.J. (2017) F1000Research, 5, 2927.
3. Naulaerts,S., Dang,C.C., Ballester,P.J., Naulaerts,S., Dang,C.C., Ballester,P.J., Naulaerts,S., Dang,C.C. and Ballester,P.J. (2017) Oncotarget, 5.
4. Ballester,P.J. and Richards,W.G. (2007) J. Comput. Chem., 28, 1711–1723.
5. Schreyer,A. and Blundell,T. (2012) J. Cheminform., 4, 27.
6. Li,H., Leung,K.-S., Wong,M.-H. and Ballester,P.J. (2016) Nucleic Acids Res., 44, W436–W441.
7. Ballester,P.J. and Mitchell,J.B.O. (2010) Bioinformatics, 26, 1169–1175. (ISI highly-cited paper for being within the top 1% of citations within its JCR category)
8. Ballester,P.J., Schreyer,A. and Blundell,T.L. (2014) J. Chem. Inf. Model., 54, 944–955.
9. Li,H., Peng,J., Sidorov,P., Leung,Y., Leung,K.-S., Wong,M.-H., Lu,G. and Ballester,P.J. (2019) Bioinformatics, 10.1093/bioinformatics/btz183.
10. Wójcikowski,M., Ballester,P.J. and Siedlecki,P. (2017) Sci. Rep., 7, 46710. (79th most read paper out of the over 24,000 published by Scientific Reports in 2017: www.nature.com/collections/zzcpmcdkqp/content/76-100)
11. Peón,A., Naulaerts,S. and Ballester,P.J. (2017) Sci. Rep., 7, 3820.
12. Peón,A., Li,H., Ghislat,G., Leung,K., Wong,M., Lu,G. and Ballester,P.J. (2019) Chem. Biol. Drug Des., 10.1111/cbdd.13516.
13. Sidorov,P., Naulaerts,S., Ariey-Bonnet,J., Pasquier,E. and Ballester,P. (2018) bioRxiv, 10.1101/504076.
14. Ballester,P.J., Westwood,I., Laurieri,N., Sim,E. and Richards,W.G. (2010) J. R. Soc. Interface R. Soc., 7, 335–342.
15. Ballester,P.J., Mangold,M., Howard,N.I., Robinson,R.L.M., Abell,C., Blumberger,J., Mitchell,J.B.O., Marchese Robinson,R.L., Abell,C., Blumberger,J., et al. (2012) J. R. Soc. Interface, 9, 3196–3207.
16. Hoeger,B., Diether,M., Ballester,P.J. and Köhn,M. (2014) Eur. J. Med. Chem., 88, 89–100.
17. Patil,S.P., Ballester,P.J. and Kerezsi,C.R. (2014) J. Comput. Aided. Mol. Des., 28, 89–97.
18. Ain,Q.U., Aleksandrova,A., Roessler,F.D. and Ballester,P.J. (2015) Wiley Interdiscip. Rev. Comput. Mol. Sci., 5, 405–424. (among the top 10 most downloaded articles of this journal in 2018: wires.wiley.com/WileyCDA/WiresCollection/id-43.html)
19. Cortés-Ciriano,I., Murrell,D.S., Chetrit,B., Bender,A., Malliavin,T. and Ballester,P.J. bioRxiv, 10.1101/105478
This team was created with the arrival of Dr Pedro Ballester to the CRCM in October 2014. We are interested in acting as a host for applications to CR permanent research positions from Inserm or CNRS as well as postdoctoral fellowships (e.g. EU Marie Curie or HFSP programmes). Applications for PhD scholarships can also be supported. Prospective applicants must send a concise explanation of their research interests and a CV with publications to pedro.ballester(at)inserm(dot)fr.
Current team members are: Dr Pavel Sidorov (postdoc 2017-19), Ms Linh Nguyen (PhD student 2016-19), Ms Alexandra Bomane (PhD student 2016-19), Mr Adeolu Ogunleye (PhD student 2018-21), Mr Amad Diouf (M2 student 2019), Mr Louison Fresnais (M2 student 2019) and Dr Pedro Ballester (PI, tenured).
Past team members are: Dr Stefan Naulaerts (postdoc 2017-18), Dr Cuong Dang (postdoc 2015-17), Dr Antonio Peon (postdoc 2015-17), Ms Elva Novoa (PhD student 2016), Ms Fahmida Ahmad (PhD student 2016), Dr Hongjian Li (postdoc 2015), Mr Michal Zulcinski (M2 student 2018) and Mr Nicolas Jaume (M2 student 2016).
About the team leader
Dr Ballester has authored 57 papers since 2003 (53 published, 4 in review), 79% of them as single or joint corresponding author. When restricted to peer-reviewed papers in leading positions, either first or corresponding author, his h-index is 21 (Source: Google Scholar). His three largest grants as PI to date are: 2017-19 ANR Tremplin-ERC (France; €130,000), 2015-17 A*MIDEX Excellence Chair (France; €235,000) and 2010-14 MRC Methodology Research Fellowship (UK; £400,905). In addition, he has raised funding for 2 PhD scholarships from regional programmes (2016, 2019) and also attracted 4 PhD scholarships from international bilateral programmes between France and: Vietnam (2015), Mexico (2016), Pakistan (2016), Nigeria (2018). Referee for funding organisations (France ANR, Spain ANEP, Netherlands NOSR, Switzerland SNSF, Israel ISF, Luxembourgh FNR, etc.), also serves as editor and reviewer for several journals (certified here: https://publons.com/author/975063/).
Further information can be found in his CV [PDF]
Education and Work History
- 2016: HDR, Aix-Marseille University, France.
- Since 2015: Group leader of the team MLPODD, CRCM – Marseille, France.
- Since 2014: Researcher CR1 Inserm, France.
- 2010-2014: MRC Methodology Research Fellow at EMBL-EBI, UK.
- 2009-2010: postdoctoral researcher at University of Cambridge, UK.
- 2005-2008: postdoctoral researcher at University of Oxford, UK.
- 2001-2005: PhD and research assistant at Imperial College London, UK.
- 2000-2001: MSc with Distinction from King’s College London, UK.
Awards and Honours
- 2017: ANR TREMPLIN-ERC.
- 2015: A*MIDEX Excellence Chair.
- 2014: tenured CR1 Inserm position in French-wide competition.
- 2011: Junior Research Fellow at Wolfson College Cambridge , UK.
- 2010: MRC Methodology Research Fellow in UK-wide competition.
- 2007: Junior Research Fellow at St Cross College Oxford, UK.
- 2000: Sa Nostra Foundation Scholarship for funding a foreign MSc in Spanish-wide competition.
[last updated: April 2019]