N. Abu Zaid, Q. Yang, S. Changlani, D.S. Pendyala, B.P. Allen, A.V. Gulyuk, R. Chirkova, Y.G. Yingling
North Carolina State university,
United States
Keywords: knowledge graphs, Neo4j, large language models, multimodal information extraction, vector retrieval, natural-language-to-cypher, explainable AI, materials discovery, phosphate removal, circular phosphorus economy
Summary:
Phosphorus management lies at the heart of global sustainability, influencing food production, water quality, and ecosystem resilience. Yet, despite its centrality, research on phosphorus remains deeply fragmented, spread across thousands of papers, datasets, and specialized domains ranging from environmental chemistry to materials science. This fragmentation prevents the synthesis of knowledge into actionable insights, slowing innovation in recovery technologies, nutrient recycling, and circular-economy policy. Addressing this fragmentation is essential for enabling predictive, data-driven phosphorus management strategies that can reduce eutrophication, recover critical resources, and improve agricultural resilience. To overcome these challenges, we developed PASTOR (Phosphorus AI Scraping, Tracking, Optimization, and Research), an end-to-end AI platform designed to unify phosphorus research through automated knowledge extraction, multimodal data integration, and interactive reasoning. The goal is to build an interoperable ecosystem where scientific literature, experimental datasets, and computational outputs can be connected in a single, queryable infrastructure. Such integration is critical for transforming unstructured information into evidence-based knowledge that supports decision-making across environmental, agricultural, and materials systems. PASTOR combines small, domain-adapted language models with a scalable data architecture to support reasoning within specialized scientific contexts. Built on Django and React frameworks, it integrates a Neo4j knowledge graph and ChromaDB vector storage, enabling structured semantic relationships and high-performance retrieval. Using models including GPT-4, LLaMA-3, and nomic-embed-text, PASTOR processed over 3,300 peer-reviewed publications via LangChain workflows for entity extraction, relationship mapping, and classification. Multimodal understanding through LLaVA extends the system’s reach beyond text to parse quantitative data from figures and tables. This allows researchers to ask complex domain-specific questions, such as identifying biochar–zeolite composites that achieve high phosphate removal at neutral pH, and receive ranked, citation-backed results. The platform’s INTEGRATE-KG workflow ensures semantic consistency across heterogeneous data sources by standardizing vocabulary, resolving synonyms, and aligning metadata within a shared ontology. Through this process, PASTOR achieved 91% precision, 87% recall, and an F1-score of 0.89 in domain classification verified through expert validation. The GraphCypherQAChain enables natural-language queries to be translated into Cypher commands, linking graph-based reasoning with visual and statistical analytics. Integrated regression and correlation tools using Plotly allow users to test hypotheses and visualize data directly within the system, bridging literature evidence with empirical analysis. PASTOR is closely aligned with the STEPS Center mission to advance phosphorus sustainability through interdisciplinary research. It connects curated datasets from EPA, USGS, Materials Project, and PubChem, promoting open, reproducible, and transparent research. By combining small, domain-specific LLMs with structured data fusion, PASTOR provides a scalable framework for knowledge-driven discovery that directly supports circular phosphorus economy goals. The platform demonstrates how specialized AI systems can transform scattered research into a cohesive, data-linked knowledge network, accelerating progress toward sustainable phosphorus management and global environmental resilience.