Predicting Antimalarial Activity of Compounds using SMILES strings and Machine Learning: A Study on the Relationship between Chemical Structure and Molecular Properties in the fight against Malaria

C. Ekenna
University at Albany,
United States

Keywords: Malaria, machine learning, molecular properties


Recent malaria research has found novel drug targets, developed new medications and drug combinations, identified genetic markers that make some people more susceptible to malaria, and used genetic engineering to build malaria-resistant mosquitoes .Researchers are also developing malaria vaccinations and using CRISPR to modify the malaria parasite's DNA . To find new malaria drugs, researchers have employed SMILES strings to find molecules with a comparable structure to known antimalarial compounds. Researchers have also utilized virtual screening to scan enormous databases of chemicals for ones that are likely to bind to malaria parasite protein targets. Researchers may be able to produce more effective and safer malaria treatments by finding novel drug candidates that bind to these targets. Computational approaches anticipate a molecule's activity, solubility, and toxicity. These computer algorithms can predict molecular attributes using chemical structure and SMILES strings. Several SMILES-based machine learning methods predict chemical characteristics. These models learn a molecule's chemical structure and characteristics from the SMILES string. For instance, a neural network-based model can be trained on a dataset of SMILES strings and actual measurements of a molecular property (e.g. activity against a target) to understand the relationship between the strings and the property and predict properties of new molecules. In drug development and materials design, SMILES molecular property prediction is a research area. These models can identify new compounds with desirable features, select compounds for synthesis and experimental testing, and analyze chemical structure-biological activity relationships. We use a two-stage (pre-training and fine-tuning) deep learning model to predict chemical attributes using a lot of labeled and unlabeled data. In this work, we implement a two-stage (pre-training and fine-tuning) deep learning model to utilize a large amount of both labeled and unlabeled data for molecular properties prediction. Our model is trained with an unsupervised learning mechanism used in Masked Language Model (MLMs) on large- scale unlabeled data. There are several research studies that have used MLMs for malaria research. MLMs are a type of deep learning model that are trained to predict missing or masked words in a sentence or phrase, based on the context provided by the surrounding words. A study published in 2020 used a masked language model (RoBERTa) to analyze scientific literature on malaria and drug discovery. The study found that the MLM was able to extract relevant information from the literature, such as drug-target interactions and molecular mechanisms of action, which could be useful for drug discovery efforts. For this work, SMILES is canonicalized and shuffled to assist large-scale pre-training using subset dataset from Chemberta. During the unsupervised pre-training stage, SMILES are tokenized as inputs for our model. We evaluate our models on several regression and classification tasks from MoleculeNet i.e., BACE, BBBP, and Tox21. These datasets cover a varied range of sample sizes (1K - 8.0K examples) with different medicinal chemistry applications (brain penetrability, toxicity, solubility, and on-target inhibition).