Teach – Discover – Treat: A COMP initiative to provide high quality computational chemistry tutorials that impact education and drug discovery for neglected diseases http://www.TDTproject.org
Challenge 1 :: Malaria High-throughput Screen
Use a dataset from a high-throughput phenotypic screen for Malaria (single concentration screening data and selected IC50s) to build a predictive model for anti-Malaria activity, and use that model to select the next set of compounds for screening in order to identify high quality chemical starting points for optimization.
Workflow components:
Screening set: 305,568 compounds with single concentration percent inhibition data
Held-out test set: 1,056 compounds with unpublished IC50 data.
The experimental IC50 values for the 1,056 compounds held-out test set can be downloaded here (ASCII text file; tab-delimited).
NOTE: Please use the following reference for the provided experimental IC50 values:
“To be published”, David C. Smithson, Julie Clark, Michele Connelly, W. Armand Guiguemde, Anang A. Shelat, R. Kiplin Guy
Discovery opportunity: Commitment from our partners at St. Jude to screen at least 100 compounds in the Malaria phenotypic screen for an overall winner.
Background
Whole cell phenotypic screening campaigns against Plasmodium falciparum (Pf) have been successful in identifying new hit chemotypes. The widespread resistance of Pf to many current antimalarial drugs make following up on these hit discoveries an urgent and important effort.
Computational challenge and discovery opportunity
The challenge to the computational chemistry community is to develop a workflow based on the available data that allows for the selection of a new set of compounds to be screened in the Pf phenotypic assay as part of hit validation and expansion from the initial screen. Validation of computational methodology will come from comparing predictions of affinity with a held-out test set measured in the Pf phenotypic assay. Successful approaches can be used to screen larger, more diverse commercially available sets and if an overall winner is identified, their selected compounds will be screened in Pf phenotypic assay.
Input data-package
HTS hit list and compounds with confirmed IC50 data in malaria Pf whole cell assay (partnership Anang Shelat, Kip Guy, St. Jude)
Files in data-package:
1. malariaHTS_trainingSet.txt.gz: tab-delimited file with 305568 rows, compound information, single point percent inhibition data and EC50 data. There are six columns within this ~10MB zipped file:
SAMPLE: sample identifier
Pf3D7_ps_green: primary screen, measuring green fluorescence intensity
Pf3D7_ps_red: primary screen, measuring red fluorescence intensity
Pf3D7_ps_hit: standardized call on hits: ‘true’ if activity in red AND green 80% <= x < 250%; ‘false’ if activity in red AND green <20%; ‘ambiguous’ for all other compounds (20 – 80%, or >250%)
Pf3D7_pEC50: Reported pEC50 value (NA for compounds not submitted for dose-response confirmation)
Canonical_Smiles: standardized structure information
2. malariaHTS_externalTestSet.txt: 1,056 compounds in held-out, external test set for validating the predictive model
Relevant literature
Malaria dataset
Nature, 2010, v465, pp311-315, http://dx.doi.org/10.1038/nature09099
Chemistry & Biology, 2012, v19, pp116-129, http://dx.doi.org/10.1016/j.chembiol.2012.01.004
HTS data analysis
http://dx.doi.org/10.1021/jm201328e
http://dx.doi.org/10.1021/ci2003285
http://jbx.sagepub.com/content/16/7/775
http://dx.doi.org/10.1021/ci900113d
http://dx.doi.org/10.1021/ci0502808
Tasks to be covered in tutorial
1. Select compounds from HTS hit-list for confirmation assays; use HTS.txt file and primary screening data
a) Hit-list triaging: filtering of primary hits
b) Hit selection: selecting representatives for confirmation assays; we request you propose a selection of 500 compounds (these do not have to be part of the set of compounds that were selected for EC50 in the real-life experiment!).
Note: The training set file provided by TDT contains both single concentration screening data and EC50 data for compounds that were selected for experimental follow-up. The workflow should start with the single concentration screening data and analyze those compounds (and associated data) as if there were no EC50 data available. Initially, the compounds would be triaged/filtered followed by selection of single concentration hits (compounds) for experimental EC50 follow-up. In short, the developed workflow does not need to match the selections that the experimental group made.
2. Build & apply predictive model for further hit finding
a) Dataset preparation for model building, including how to define a training set and an internal test set; use HTS.txt file and standardized call of what constitutes a hit as captured in the column Pf3D7_ps_hit
b) Model building with training set
c) Model validation with internal test set; report early enrichment (top 5%) as well as AUC. Recommended reading:
http://www.springerlink.com/content/6k41w776567368q5/
http://www.springerlink.com/content/u8wp371t2l1182p6/
http://www.springerlink.com/content/k853876563745m26/
d) Model validation with the held-out, external test set; use extTest.txt and report predicted activity for the whole set. Experimental data is available for this set and will be used to judge the quality of this submission.
3. Follow-up hit finding
a) Download most recent file of commercially available compounds from eMolecules, http://downloads.emolecules.com/ordersc/ and select most recent directory. We recommend you download this in January 2014 for good availability; use the “parent” file (without salts and solvates)
b) Rank-order commercial compounds based on predicted activity NOTE: At least 100 compounds from the winning submission will be acquired and tested in the Malaria Pf whole-cell assay (partnership Anang Shelat, Kip Guy, St. Jude); we ask that you submit a rank-ordered list of 1000 compounds from the eMolecules file.
Submission package
Submit your tutorial and data here: http://file.teach-discover-treat.org/submit/index.php
Specific files to include for this challenge
1. Predictions for held-out, external test set against the Pf whole-cell assay (identifier, smiles, and measure of predicted activity)
2. Rank-ordered list of top-1000 commercial compounds predicted to be active in the Pf whole-cell assay (identifier, smiles)
The Judging Criteria can be found here.
Challenge 1 :: Malaria High-throughput Screen
Use a dataset from a high-throughput phenotypic screen for Malaria (single concentration screening data and selected IC50s) to build a predictive model for anti-Malaria activity, and use that model to select the next set of compounds for screening in order to identify high quality chemical starting points for optimization.
Workflow components:
- Analysis of single concentration screening data: hit list triaging, selection of compounds for IC50
- Building and validating a predictive activity model, including predicting activity in a held-out test set
- Follow-up hit-finding: applying predictive model to rank-order commercially available compounds
Screening set: 305,568 compounds with single concentration percent inhibition data
Held-out test set: 1,056 compounds with unpublished IC50 data.
The experimental IC50 values for the 1,056 compounds held-out test set can be downloaded here (ASCII text file; tab-delimited).
NOTE: Please use the following reference for the provided experimental IC50 values:
“To be published”, David C. Smithson, Julie Clark, Michele Connelly, W. Armand Guiguemde, Anang A. Shelat, R. Kiplin Guy
Discovery opportunity: Commitment from our partners at St. Jude to screen at least 100 compounds in the Malaria phenotypic screen for an overall winner.
Background
Whole cell phenotypic screening campaigns against Plasmodium falciparum (Pf) have been successful in identifying new hit chemotypes. The widespread resistance of Pf to many current antimalarial drugs make following up on these hit discoveries an urgent and important effort.
Computational challenge and discovery opportunity
The challenge to the computational chemistry community is to develop a workflow based on the available data that allows for the selection of a new set of compounds to be screened in the Pf phenotypic assay as part of hit validation and expansion from the initial screen. Validation of computational methodology will come from comparing predictions of affinity with a held-out test set measured in the Pf phenotypic assay. Successful approaches can be used to screen larger, more diverse commercially available sets and if an overall winner is identified, their selected compounds will be screened in Pf phenotypic assay.
Input data-package
HTS hit list and compounds with confirmed IC50 data in malaria Pf whole cell assay (partnership Anang Shelat, Kip Guy, St. Jude)
Files in data-package:
1. malariaHTS_trainingSet.txt.gz: tab-delimited file with 305568 rows, compound information, single point percent inhibition data and EC50 data. There are six columns within this ~10MB zipped file:
SAMPLE: sample identifier
Pf3D7_ps_green: primary screen, measuring green fluorescence intensity
Pf3D7_ps_red: primary screen, measuring red fluorescence intensity
Pf3D7_ps_hit: standardized call on hits: ‘true’ if activity in red AND green 80% <= x < 250%; ‘false’ if activity in red AND green <20%; ‘ambiguous’ for all other compounds (20 – 80%, or >250%)
Pf3D7_pEC50: Reported pEC50 value (NA for compounds not submitted for dose-response confirmation)
Canonical_Smiles: standardized structure information
2. malariaHTS_externalTestSet.txt: 1,056 compounds in held-out, external test set for validating the predictive model
Relevant literature
Malaria dataset
Nature, 2010, v465, pp311-315, http://dx.doi.org/10.1038/nature09099
Chemistry & Biology, 2012, v19, pp116-129, http://dx.doi.org/10.1016/j.chembiol.2012.01.004
HTS data analysis
http://dx.doi.org/10.1021/jm201328e
http://dx.doi.org/10.1021/ci2003285
http://jbx.sagepub.com/content/16/7/775
http://dx.doi.org/10.1021/ci900113d
http://dx.doi.org/10.1021/ci0502808
Tasks to be covered in tutorial
1. Select compounds from HTS hit-list for confirmation assays; use HTS.txt file and primary screening data
a) Hit-list triaging: filtering of primary hits
b) Hit selection: selecting representatives for confirmation assays; we request you propose a selection of 500 compounds (these do not have to be part of the set of compounds that were selected for EC50 in the real-life experiment!).
Note: The training set file provided by TDT contains both single concentration screening data and EC50 data for compounds that were selected for experimental follow-up. The workflow should start with the single concentration screening data and analyze those compounds (and associated data) as if there were no EC50 data available. Initially, the compounds would be triaged/filtered followed by selection of single concentration hits (compounds) for experimental EC50 follow-up. In short, the developed workflow does not need to match the selections that the experimental group made.
2. Build & apply predictive model for further hit finding
a) Dataset preparation for model building, including how to define a training set and an internal test set; use HTS.txt file and standardized call of what constitutes a hit as captured in the column Pf3D7_ps_hit
b) Model building with training set
c) Model validation with internal test set; report early enrichment (top 5%) as well as AUC. Recommended reading:
http://www.springerlink.com/content/6k41w776567368q5/
http://www.springerlink.com/content/u8wp371t2l1182p6/
http://www.springerlink.com/content/k853876563745m26/
d) Model validation with the held-out, external test set; use extTest.txt and report predicted activity for the whole set. Experimental data is available for this set and will be used to judge the quality of this submission.
3. Follow-up hit finding
a) Download most recent file of commercially available compounds from eMolecules, http://downloads.emolecules.com/ordersc/ and select most recent directory. We recommend you download this in January 2014 for good availability; use the “parent” file (without salts and solvates)
b) Rank-order commercial compounds based on predicted activity NOTE: At least 100 compounds from the winning submission will be acquired and tested in the Malaria Pf whole-cell assay (partnership Anang Shelat, Kip Guy, St. Jude); we ask that you submit a rank-ordered list of 1000 compounds from the eMolecules file.
Submission package
Submit your tutorial and data here: http://file.teach-discover-treat.org/submit/index.php
Specific files to include for this challenge
1. Predictions for held-out, external test set against the Pf whole-cell assay (identifier, smiles, and measure of predicted activity)
2. Rank-ordered list of top-1000 commercial compounds predicted to be active in the Pf whole-cell assay (identifier, smiles)
The Judging Criteria can be found here.