Datasets

For convenience, references and datasets related to benchmarking results of the IEDB Analysis Resource predictive tools are collected here.

MHC class I binding prediction

Automated benchmarking of peptide-MHC class I binding predictions
Trolle T, Metushi IG, Greenbaum JA, Kim Y, Sidney J, Lund O, Sette A, Peters B, Nielsen M.
Bioinformatics
- Description:
  Numerous in silico methods predicting peptide binding to major histocompatibility complex (MHC) class I molecules have been developed over the last decades. However, the multitude of available prediction tools makes it non-trivial for the end-user to select which tool to use for a given task. To provide a solid basis on which to compare different prediction tools, we here describe a framework for the automated benchmarking of peptide-MHC class I binding prediction tools. The framework runs weekly benchmarks on data that are newly entered into the Immune Epitope Database (IEDB), giving the public access to frequent, up-to-date performance evaluations of all participating tools. To overcome potential selection bias in the data included in the IEDB, a strategy was implemented that suggests a set of peptides for which different prediction methods give divergent predictions as to their binding capability. Upon experimental binding validation, these peptides entered the benchmark study.
- Links:
  Weekly results
  Participate

NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data.
Reynisson B., Alvarez B., Paul S., Peters B., Nielsen M.
Nucleic Acids Res. 2020.
- Description of the dataset:
  Dataset used for training of NetMHCPan 4.1. See manuscript for details.
- Dataset availability: https://services.healthtech.dtu.dk/suppl/immunology/NAR_NetMHCpan_NetMHCIIpan/

NetMHCpan-4.0: Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data.
Vanessa Jurtz, Sinu Paul, Massimo Andreatta, Paolo Marcatili, Bjoern Peters, Morten Nielsen
J Immunol, 2017.
- Description of the dataset:
  Dataset used for training of NetMHCPan 4.0. See manuscript for details.
- Data format: Text file format.
- Dataset availability: https://services.healthtech.dtu.dk/suppl/immunology/NetMHCpan-4.0/

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions.
Kim Y, Sidney J, Buus S, Sette A, Nielsen M, Peters B.
BMC Bioinformatics
- Description of the dataset:
  1) All binding data used in the paper: BD2009, BD2013, and Blind.
  2) For BD2009 data set, three cross-validation data partitions were generated: cv_rnd, cv_sr, and cv_gs.
  3) FILE_S1: Prediction performances for SMMPMBEC, NetMHC, and NetMHCpan against cv_rnd, cv_sr, cv_gs, and Blind data sets. An R script that constructs logistic regression models of deviations (i.e. |cv - blind|) is also included.
- Date of the dataset generation: 2014
- Details on the dataset generation: BD2009 and BD2013 refer to MHC-I binding data files compiled in 2009 and 2013. Blind data sets refer to data resulting after subtracting BD2009 from BD2013. In the paper, different cross-validation strategies (i.e. cv_rnd, cv_sr, and cv_gs) were tested. Please see the Methods section for details of the cross-validation strategies.
- Data format: Text file format.
- Dataset availability: benchmark_reliability.tar.gz

Dataset used for retraining the IEDB class I binding prediction tools.
- Description of the dataset:The dataset is largely identical to that of Kim et al (2014), described above, but includes additional data that was not publicly available at the time.
- Date of the dataset generation:2013
- Details on the dataset generation:The dataset was compiled from three sources: the IEDB, the Sette lab, and the Buus lab. If a peptide/allele combination had more than 1 measurement among the three sources, its geometric mean was taken.
- Data format: Compressed text file containing binding data.
- Dataset availability: binding_data_2013.zip

Derivation of an amino acid similarity matrix for peptide: MHC binding and its application as a Bayesian prior.
Kim Y, Sidney J, Pinilla C, Sette A, Peters B.
BMC Bioinformatics, 2009.
- Description of the dataset: Cross-validated predictive performances for SMMPMBEC using the same binding data set as in [Peters et al. PLOS Comput Biol 2006].
- Date of the dataset generation: 2009
- Details on the dataset generation: Using the same cross-valiation data partitions as was done for ANN and ARB in 2006, cross-validated predictions using SMMPMBEC were made.
- Data format: A table in Excel file format.
- Dataset availability: http://www.biomedcentral.com/1471-2105/10/394/additional

A Community Resource Benchmarking Predictions of Peptide Binding to MHC-I Molecules.
Peters B, Bui HH, Frankild S, Nielsen M, Lundegaard C, Kostem E, Basch D, Lamberth K, Harndahl M, Fleri W, Wilson SS, Sidney J, Lund O, Buus S, Sette A.
PLOS Computational Biology, 2006.
- Description of the dataset: Experimentally measured peptide binding affinities for MHC class I molecules from two sources: the Alessandro Sette lab at the La Jolla Institute and the Soren Buus lab at the University of Copenhagen. The dataset contains 48,828 affinities and covers a total of 48 mouse, human, macaque and chimpanzee MHC class I alleles.
- Date of the dataset generation: 2006
- Details on the dataset generation: Used two different assays to generate the binding data.
- Data format: Compressed text files containing experimental binding data as well as cross-validated predicted affinities.
- Dataset availability: ANN, ARB, SMM

MHC class II binding prediction

An automated benchmarking platform for MHC class II binding prediction methods.
Andreatta M, Trolle T, Yan Z, Greenbaum JA, Peters B, Nielsen M.
Bioinformatics, 2018.
- Description:
  Computational methods for the prediction of peptide-MHC binding have become an integral and essential component for candidate selection in experimental T cell epitope discovery studies. The sheer amount of published prediction methods—and often discordant reports on their performance—poses a considerable quandary to the experimentalist who needs to choose the best tool for their research.
- Links:
  Weekly results
  Participate

NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data..
Reynisson B., Alvarez B., Paul S., Peters B., Nielsen M.
Nucleic Acids Res. 2020.
- Description of the dataset:
  Dataset used for training of NetMHCIIpan 4.0. See manuscript for details.
- Dataset availability: https://services.healthtech.dtu.dk/suppl/immunology/NAR_NetMHCpan_NetMHCIIpan/

Improved methods for predicting peptide binding affinity to MHC class II molecules.
Jensen KK, Andreatta M, Marcatili P, Buus S, Greenbaum JA, Yan Z, Sette A, Peters B, Nielsen M.
Immunology, 2018.
- Description of the dataset:
  Dataset used for training of NetMHCIIpan 3.2. See manuscript for details.
- Data format: Text file format.
- Dataset availability: https://services.healthtech.dtu.dk/suppl/immunology/NetMHCIIpan-3.2/

Peptide binding predictions for HLA DR, DP and DQ molecules.
Wang P, Sidney J, Kim Y, Sette A, Lund O, Nielsen M, Peters B.
BMC Bioinformatics, 2010.
- Description of the dataset: Experimentally measured peptide binding affinities for MHC class II molecules. With respect to the [Wang et al. 2008] dataset, HLA DP and DQ molecules are covered. The dataset contains 44,541 measured affinities and covers 26 MHC class II alleles.
- Date of the dataset generation: 2010
- Details on the dataset generation: Used a binding assay based on inhibition of binding of a radiolabeled probe peptide to MHC molecules.
- Data format: Compressed text files in gzip format.
- Dataset availability: http://tools.iedb.org/mhcii/download/

A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach.
Wang P, Sidney J, Dow C, Mothe B, Sette A, Peters B.
PLOS Computational Biology, 2008.
- Description of the dataset: Experimentally measured peptide binding affinities for MHC class II molecules. The dataset contains 10,017 peptide binding affinities. The data span a total of 16 human and mouse MHC class II alleles.
- Date of the dataset generation: 2008
- Details on the dataset generation: Used a binding assay based on inhibition of binding of a radiolabeled probe peptide to MHC molecules.
- Data format: Compressed text files in zip format.
- Dataset availability:
  - MHC class II peptide binding affinities for 16 alleles: peptide_affinity_dataset.zip
  - PDB structures used for binding core prediction: non_redundant_pdb_core_pep_allele.txt
  - CD4+ T-cell activation: LCMV_T_cell_activation.txt

B-cell epitope prediction

Structural analysis of B-cell epitopes in antibody:protein complexes.
Kringelum JV, Nielsen M, Padkjær SB, Lund O.
Mol Immunol., 2013.
- Description of the dataset: 107 non-similar 3D structures of antigen-antibody complexes extracted from the PDB database.
- Date of the dataset generation: 2012
- Details on the dataset generation: It was obtained from 224 unique (details are not provided) antigen antibody complexes identified in PDB. Complexes with antigens shorter than 20 amino acids were removed. The resulting 162 complexes were subjected to a similarity analysis based on contacting amino acid pairs in the antigen-antibody interface, leading to the additional removal of 53 entries, in which interactions were not mediated by the antibody variable regions, or in proximity of these. In the published study, the remaining complexes (107) were superimposed using the antibody heavy chain as template to study general features of the epitopes and paratopes.
- Data format: Word-file containing the table of PDB IDs and chain IDs for each complex
- Dataset availability: Supplementary Table 1 in the word-file available for download at http://www.sciencedirect.com/science/article/pii/S0161589012003239

Reliable B cell epitope predictions: impacts of method development and improved benchmarking.
Kringelum JV, Lundegaard C, Lund O, Nielsen M.
PLoS Comput Biol. 2012.

Dataset #1 (DiscoTope dataset)
Description of the dataset: 75 X-ray crystal structures of antigen-antibody complexes with a resolution < 3 Å, divided into 25 homology-groups based on antigen sequence.
Date of the dataset generation: 2006
Details on the dataset generation: https://pubmed.ncbi.nlm.nih.gov/17001032/
Data format: Pdf-file containing Supplementary Table S1 with PDB IDs for each complex and chain ID for each antigen. Information on biological units (as described in the publication associated with the structure) is also provided if available (obtainable for PDB entry: 1XIW, 1TZH, 1CZ8, 1BJ1, 1K4D, 1K4C, 1KYO, 1EZV, 1NCA, 1NMC, 1A14, 1NCB, 1NCC, 1NCD, 1OTS, 1AR1, 1NFD, 2HMI, 1EO8, 1QFU).
Dataset availability: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531324/bin/pcbi.1002829.s003.pdf
Dataset #2 (Independent from DiscoTope evaluation dataset)
Description of the dataset: 52 3D structures of antigen-antibody complexes in which antigens are not homologues to proteins in the DiscoTope dataset constructed from 584 PDB structures of antigen-antibody complexes provided in the IEDB database (http://www.immuneepitope.org/browse_by_3D.php?name=BCELL).
Date of the dataset generation: 2012
Details on the dataset generation: Antibody heavy/light chains were automatically identified based on homology to two databases of antibody heavy and light chains respectively, from various organisms. Protein chains not identified as light or heavy chains were initially annotated as antigens. 132 PDB entries containing no protein antigen chain and 42 entries that did not have both light and heavy chains were discarded. 5 entries containing single-chained antibodies joining light and heavy chains were included. From the remaining set of 410 antigen-antibody complexes, 52 antigens were retrieved using the criteria: 1) Structure resolved by x-ray crystallization (405 entries), 2) Size of antigen chain >150 residues (136 entries) and 3) No sequence similarity overlap to antigens in the DiscoTope dataset (Blast E-values<0.01). The 52 PDB files were manual processed into files containing one copy of the biological unit (antibody and antigen) as described in the PDB entry. Epitope residues in the 52 antigens were annotated as described above for the DiscoTope dataset, and the antigens were clustered into 33 homology groups based on antigens sequence similarity. 2 entries were considered similar if any two antigen chains from the two entries had a blast value<0.01.
Data format: Pdf-file containing Supplementary Table S4 with PDB IDs for each complex and chain ID for each antigen. Information on biological units is also provided if available.
Dataset availability: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531324/bin/pcbi.1002829.s006.pdf

Epitopia: a web-server for predicting B-cell epitopes.
Rubinstein ND, Mayrose I, Martz E, Pupko T
BMC Bioinformatics. 2009.
- Description of the dataset: 66 non-redundant manually validated antibody-antigen co-crystal structures.
- Date of the dataset generation: July 2009
- Details on the dataset generation: Unavailable
- Data format: Tab-delimited txt-file with PDB IDs and chain IDs for the antibody and antigen
- Dataset availability: http://epitopia.tau.ac.il/trainData/structure_dataset.txt

EPSVR and EPMeta: prediction of antigenic epitopes using support vector regression and multiple server results.
Liang S, Zheng D, Standley DM, Yao B, Zacharias M, Zhang C. BMC Bioinformatics. 2010
- Description of the dataset: 48 structures of antigen-antibody complexes with antigens for which unbound 3D structures in PDB are also available and epitopes were located only on one protein chain.
- Date of the dataset generation: October 2008
- Details on the dataset generation: The training set was gathered and screened from three protein data sets: 1) 22 antigen-antibody complexes and their unbound structures from protein docking Benchmark 2.0 [23]; 2) 59 representative antigen-antibody complexes compiled by Ponomarenko and Bourne [22]; 3) 17 antigen-antibody complex structures released between February 2006 and October 2008 with available unbound antigen structures, which was the test set in [21]. A complex structure was also not used if its antigenic epitope consisted of amino acid residues located on multiple chains. A complex was included if the sequence identity between its antigen and all antigens from the other complex structures was less than 35% following local sequence alignment. For an antigen with a sequence identity in the range of 35~50%, the antigen-antibody complex was accepted if the binding topology was not the same as its homologous complex. For an antigen with more than one antigenic epitope, only one was used in order to avoid confusion in subsequent application of support vector regression methods.
- Data format: Original and modified coordinate files in pdb-format (text-files). The list of names of pdb-files.
- Dataset availability: http://sysbio.unl.edu/services/EPSVR/training.tar.gz

Antibody-protein interactions: benchmark datasets and prediction tools evaluation.
Ponomarenko JV, Bourne PE.
BMC Structural Biol. 2007
- Dataset #1
- Description of the dataset:62 representative 3D structures of protein antigens, 52 of which are one-chain proteins and the rest are two-chain proteins, with structural epitopes inferred from 3D structures of antibody-protein complexes. This dataset is intended for the study of the antigenic properties of proteins as well as for development and evaluation of the methods based on protein structure alone, or protein-protein unbound docking methods, that is, if the structure of the antibody is known or can be modeled.
- Dataset #2
- Description of the dataset: 82 structures of antibody-protein complexes containing different structural epitopes: 70 structures of proteins in complexes with two-chain antibodies and 12 structures of proteins in complexes with one-chain antibodies. This dataset is useful for the study of the properties of individual epitopes as well as for development and evaluation of protein-protein docking methods.
- Date of the datasets generation: January 2006
- Details on the datasets generation: 169 structures of protein antigens (length >30 amino acids) in complex with antibody fragments have been manually collected from the PDB of January 2006 at a resolution ≤ 4 Å. Every structure has been manually curated. Structures in which the antibody binds antigen but involves no CDR residues have been excluded from the analysis; there were four such structures [PDB: 1MHH, 1HEZ, 1DEE, 1IGC]. If a structure contained several complexes in one asymmetric unit (there were 46 such structures in 165) and the authors of the structure observed no structural difference between these complexes, only one complex was selected - those that were specified as a reference complex by the authors of the article describing the structure (primary citation in the PDB); there were 18 such structures out of 46. If the authors didn't provide this information, all complexes in the structure were considered for analysis. The authors of a few structures clearly stated in their papers that antibody-protein contacts in the complexes were different: [PDB: 1MLC, 1NFD, 1OB1, 1P2C, 1QFW]. This initial curation has performed in order to correctly assign the protein-antibody complexes and decrease the number of individual complexes analyzed from 226 to 187 from a total of 169 structures. A total of 24 complexes were formed by one-chain antibody fragments and 163 complexes by two-chain antibody fragments. Alignment of protein chains was performed using the CE algorithm. The epitope was defined as consisting of protein antigen residues in which any atom of the epitope residue is separated from any antibody atom by a distance ≤ 4 Å. For further details, see http://www.biomedcentral.com/1472-6807/7/64.
- Data format: Table in the Word file.
- Dataset availability: http://www.biomedcentral.com/1472-6807/7/64/suppl/S1