Machine learning pipeline of novel peptide and protein generation with refined selection for production in vivo

S.N. Dean , J.A.E. Alvarez, P.M. Legler, A.P. Malanoski
US Naval Research Laboratory,
United States

Keywords: machine learning, protein design, solubility


Certain antimicrobial peptides (AMPs) and other peptides and proteins are essential tools for combating infections. The rise of antibiotic resistance and the increasing rate of emergence of new pathogens has provided the impetus for the rapid design and development of new proteinaceous medical counter measures. Machine and deep learning (ML/DL) has proven to be valuable for generating novel, functional protein sequences. However, filtering the vast databases of sequences to train generative deep learning models and identifying the most promising candidates within the new sequences these models produce remains challenging. In addition, promising candidates that pass initial selection criteria when moved to larger scale production often have issues with expression and solubility. To overcome these problems, we report a new modular pipeline for generation of novel peptides and proteins that allows for selection of sequences with desired characteristics and higher confidence in their activities using tree preserving embeddings (TPE)- based sampling applied to recurrent neural networks (RNNs), Seq2Seq, and variational autoencoders (VAEs). We demonstrate that this method can be applied to production of new single domain antibodies (sdAbs), AMPs, and fluorescent proteins reporters, with control over the characteristics of the sequences that are produced. Finally, we report use of a new accurate protein solubility predictor that both predicts and provides local explanations via explainable artificial intelligence (XAI) so that the user is informed about why the generated sequence is or is not soluble and suggests improvements in the predicted solubility. Through our approach, a variety of different peptides and protein families can be produced and, as the pipeline is modular, different generative models can be used for sequence generation. When combined, this pipeline enables a greater degree of control and confidence that the protein medical counter measures produced will both be functional and capable of being produced in sufficient quantities to avoid wasting time and resources in the wet-lab, greatly improving time-to-production.