Leveraging Physics-Based Simulations and Machine Learning to Identify Promising Formulations for Materials Science Applications

A.K. Chew, M.A.F. Afzal, A. Chandrasekaran, M.D. Halls
United States

Keywords: machine learning, formulations, quantitative structure-property relationships, classical molecular dynamics simulations


Formulations - or a mixture of chemical ingredients - are ubiquitously found across material science applications, such as copolymer blends, consumer packaged goods, and energy storage devices. These mixtures consist of multiple chemical species with known compositional information, but their bulk properties are challenging to predict because they emerge from non-obvious intermolecular interactions arising between multiple species that heavily depend on both molecular structure and composition. Trial-and-error experimentation to optimize these formulations is cost-prohibitive because of the large chemical design space that is exacerbated by the tunability of their compositions. Computational approaches that could traverse the expansive design space offer a promising alternative solution to finding better formulations. Physics-based approaches, such as classical molecular dynamics simulations (MD), could accurately predict formulation properties by accounting for all possible interactions between multiple molecules. However, rapid screening with MD remains challenging due to its computational cost, which motivates the use of data-driven approaches to more efficiently screen the formulation design space. Given the lack of publicly available datasets, we generate a large formulation dataset from physics-based simulations consisting of more than 30,000 solvent mixtures that were selected based on experimental solubility tables. We then benchmark descriptor-based and graph-based molecular representations, as well as a variety of machine learning architectures, to identify accurate formulation-property relationships that could predict formulation properties given individual molecular structures and compositions as input. Given the large design space of chemistries and compositions, we leverage an active learning framework to iteratively suggest the next best compounds or compositions to test starting with a small dataset (~100 examples). Leveraging physics-based simulations to curate a formulation dataset and the development of accurate formulation-property relationships enables us to rapidly identify promising formulations for a wide range of materials applications, such as liquid electrolytes for batteries, copolymers for surface coating, solvent additives for perfumes or paints, and more.