- Data Note
- Open Access
Models and data of AMPlify: a deep learning tool for antimicrobial peptide prediction
BMC Research Notes volume 16, Article number: 11 (2023)
Antibiotic resistance is a rising global threat to human health and is prompting researchers to seek effective alternatives to conventional antibiotics, which include antimicrobial peptides (AMPs). Recently, we have reported AMPlify, an attentive deep learning model for predicting AMPs in databases of peptide sequences. In our tests, AMPlify outperformed the state-of-the-art. We have illustrated its use on data describing the American bullfrog (Rana [Lithobates] catesbeiana) genome. Here we present the model files and training/test data sets we used in that study. The original model (the balanced model) was trained on a balanced set of AMP and non-AMP sequences curated from public databases. In this data note, we additionally provide a model trained on an imbalanced set, in which non-AMP sequences far outnumber AMP sequences. We note that the balanced and imbalanced models would serve different use cases, and both would serve the research community, facilitating the discovery and development of novel AMPs.
This data note provides two sets of models, as well as two AMP and four non-AMP sequence sets for training and testing the balanced and imbalanced models. Each model set includes five single sub-models that form an ensemble model. The first model set corresponds to the original model trained on a balanced training set that has been described in the original AMPlify manuscript, while the second model set was trained on an imbalanced training set.
Antimicrobial peptides (AMPs) are gaining great interest as a potential replacement for progressively ineffective conventional antibiotics [1, 2]. In previous work, we built a deep learning model, AMPlify, with bidirectional long short-term memory (Bi-LSTM) and attention mechanisms for AMP prediction . The model displays state-of-the-art performance, and has identified four novel AMPs from the bullfrog genome . We have released a command-line based tool that implements AMPlify, publicly available at https://github.com/bcgsc/AMPlify.
This data note has two main objectives. First, we provide users with flexibility and convenience in accessing the data for training and testing AMPlify, or using the trained models. The original model files and data sets are easily accessible from a public repository, should users wish to embed or adapt our method into their own pipelines. Second, we provide an auxiliary model for use cases where AMPs must be predicted from imbalanced sets of peptide sequences composed of many more non-AMP sequences and few AMP sequences.
The model described in our earlier publication was trained on a balanced set containing the same number of AMP and non-AMP sequences . The balanced model works well on relatively curated candidate sets (e.g. mature sequences derived from candidate AMP precursors identified by homology approaches), as applied in our previous work . For other cases where the candidate sets are less curated and where AMPs comprise only a small proportion (e.g. filtering through an entire genomics and/or transcriptomics data set), keeping the number of false positives to a minimum would be essential. The imbalanced model reported here gives the flexibility to predict AMPs from larger sequence databases.
Training and test sets
In our study, we built balanced and imbalanced training and test sets to both train and assess our AMP prediction models. Table 1 lists two AMP and four non-AMP sequence sets that form the aforementioned training and test sets . All AMP sequences were obtained from two curated databases: Antimicrobial Peptide Database  (APD3, http://aps.unmc.edu/AP) and Database of Anuran Defense Peptides  (DADP, http://split4.pmfst.hr/dadp). All non-AMP sequences were sampled from the UniProtKB/Swiss-Prot  (https://www.uniprot.org) database.
Data set 1 (3338 AMPs) and Data set 2 (3338 non-AMPs) in Table 1 form the balanced training set, while Data set 3 (835 AMPs) and Data set 4 (835 non-AMPs) form the balanced test set. These balanced training and test sets have been published in the original study. Non-AMP sequences were selected through keyword filtering and sampled matching the number and length distribution of the AMP sequences. Detailed method of how non-AMP sequences were sampled can be found in the Methods section of the original AMPlify manuscript .
Data set 1 (3338 AMPs) and Data set 5 (102,756 non-AMPs) in Table 1 form the imbalanced training set, while Data set 3 (835 AMPs) and Data Set 6 (25,689 non-AMPs) form the imbalanced test set. Unlike the balanced training and test sets, we took all the 128,445 available non-AMP sequences selected through keyword filtering as described in the original manuscript  to form the imbalanced training and test sets without selecting the same number of samples as the positive sets. The ratio of the number of non-AMPs versus AMPs were kept the same across the imbalanced training and test sets (~ 30.8). It is worth noting that there is no overlap between any of the training and test set pairs.
All the aforementioned data sets were uploaded to Zenodo (https://zenodo.org) for better accessibility, and included in the AMPlify Github repository (https://github.com/bcgsc/AMPlify) for convenience.
In this study, we acquired two sets of models, with each set containing five single sub-models that form the final ensemble model. Single sub-models were trained on different training subsets, and their output probabilities were averaged to form the final ensemble model. Please refer to the Methods section of the original AMPlify manuscript for ensemble method details . All the models were built and saved by Keras 2.2.4  with Tensorflow 1.12.0  as the backend in Python 3.6.7. Note that we only saved the weights of the models to minimize the size of each file. An example of how to load the models can be found in the Python script AMPlify.py from our GitHub repository (https://github.com/bcgsc/AMPlify). Our models were saved in HDF5 (.h5) format, and each model set packaged into a zip file (Table 1, Data file 1 and Data file 2) .
The first set of models (Data file 1) was trained on the balanced training set with all settings described in the AMPlify manuscript . This set of five single sub-models forms the original ensemble model of AMPlify (the balanced model) that has been published (i.e. AMPlify version 1.0), and the discovery of the four novel AMPs presented in the AMPlify manuscript used this specific model . It is suitable to use for predicting putative AMPs from a relatively curated candidate set.
The second set of models (Data file 2) was trained on the imbalanced training set. Different from the training of the previous set of models that applied early stopping (i.e. the model stops training if there’s no further improvement in the next 50 epochs) , this set of models was trained using a fixed number of epochs considering the long training time on a huge training set. The best tuned number of epochs was 50. Furthermore, we tuned a best set of class weights (AMP: 0.8333, non-AMP: 0.1667) when calculating the loss function during training to ensure that the model would not bias too much on the majority class, while the loss calculations of the two classes were treated equally for the balanced model (in other words, with a weight ratio of 1:1). Both the optimal epoch number and class weights were tuned based on accuracy scores from five-fold cross-validation. This set of five single sub-models forms a new ensemble model of AMPlify (the imbalanced model), which finds applications in situations where the number of non-AMPs in the input sequence set is far greater than that of AMPs.
All models were uploaded to the repositories along with our data sets. These two sets of models together form the version 1.1 of AMPlify. We also provide a PDF file (Data file 3)  that includes a detailed analysis of the performance of our models under different input conditions, so that users can get a better understanding of which models to use.
There are limitations to both our models and data sets. Since the number of positive training samples (known AMPs) is limited compared with that of other typical deep learning applications, such as natural language processing or computer vision tasks, the performance of our AMP prediction models has some room for improvement. On the other hand, the negative samples (non-AMPs) we collected were mainly based on keyword filtering through the UniProtKB/Swiss-Prot  database, since there is little information available from public databases that reports peptide sequences without antimicrobial activities. Although we listed as many keywords as possible when doing filtering, this strategy might still leave some sequences there that do have antimicrobial activities but have never been tested or reported. We would expect the noise level in the non-AMP sets to decrease as more peptides are characterized in future studies, refining the quality of annotations.
Availability of data and materials
The data described in this Data note can be freely and openly accessed on https://zenodo.org under https://doi.org/10.5281/zenodo.7320306. Please see Table 1 and reference  for details and links to the data.
Antimicrobial Peptide Database
Bidirectional long short-term memory
Database of Anuran Defense Peptide
Reardon S. Antibiotic resistance sweeping developing world. Nature. 2014;509:141–2.
Zhang L, Gallo RL. Antimicrobial peptides. Curr Biol. 2016;26:R14–9.
Li C, Sutherland D, Hammond SA, Yang C, Taho F, Bergman L, et al. AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens. BMC Genomics. 2022;23:77.
Hammond SA, Warren RL, Vandervalk BP, Kucuk E, Khan H, Gibb EA, et al. The North American bullfrog draft genome provides insight into hormonal regulation of long noncoding RNA. Nat Commun. 2017;8:1433.
Li C, Warren RL, Birol I. Model files and data sets of AMPlify: a deep learning tool for antimicrobial peptide prediction. 2022. Zenodo. https://doi.org/10.5281/zenodo.7320306.
Wang G, Li X, Wang Z. APD3: the antimicrobial peptide database as a tool for research and education. Nucleic Acids Res. 2016;44:D1087–93.
Novković M, Simunić J, Bojović V, Tossi A, Juretić D. DADP: the database of anuran defense peptides. Bioinformatics. 2012;28:1406–7.
The UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47:D506–15.
Chollet FK. https://keras.io. 2015. Accessed 17 Apr 2019.
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: Large-scale machine learning on heterogeneous systems. 2015. https://www.tensorflow.org. Accessed 17 Apr 2019.
This work was supported by Genome BC and Genome Canada [281ANV; 291PEP]; and the National Institutes of Health [2R01HG007182-04A1]. The content of this paper is solely the responsibility of the authors, and does not necessarily represent the official views of our funding organizations. Additional support was provided by the Canadian Agricultural Partnership, a federal-provincial-territorial initiative, under the Canada-BC Agri-Innovation Program. The program is delivered by the Investment Agriculture Foundation of BC. Opinions expressed in this document are those of the authors and not necessarily those of the Governments of Canada and British Columbia or the Investment Agriculture Foundation of BC. The Governments of Canada and British Columbia, and the Investment Agriculture Foundation of BC, and their directors, agents, employees, or contractors will not be liable for any claims, damages, or losses of any kind whatsoever arising out of the use of, or reliance upon, this information.
Ethics approval and consent to participate
Consent for publication
IB is a co-founder of and executive at Amphoraxe Life Sciences Inc.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Li, C., Warren, R.L. & Birol, I. Models and data of AMPlify: a deep learning tool for antimicrobial peptide prediction. BMC Res Notes 16, 11 (2023). https://doi.org/10.1186/s13104-023-06279-1
- Antimicrobial peptide
- Deep learning
- Imbalanced classification