T. Trapier
North Carolina State University,
United States
Keywords: FAIR data, materials informatics, organic semiconductors, conjugated polymers, open-source infrastructure, data interoperability, experimental reproducibility, synthetic data generation, AI for materials discovery, open materials databases
Summary:
Data heterogeneity and fragmentation have long been substantial bottlenecks contributing to irreproducibility and unreliability in materials science, particularly within soft and organic materials research. Despite significant progress in publishing datasets through repositories such as GitHub and major materials databases, most remain non-interoperable and inconsistent in metadata quality. The absence of standardized, machine-readable experimental data formats limits AI model validation, generalization, and integration into computational workflows. While databases such as NOMAD, the Materials Project, and the Open Materials Database have accelerated advances in materials informatics, they primarily focus on inorganic and computational data. As a result, experimental data, especially for polymeric materials and organic semiconductors, remain scarce, poorly standardized, and fragmented across publications and local repositories. SemiOrg introduces a public, open-source, FAIR-aligned data infrastructure designed for polymer and organic semiconductor datasets. The platform integrates a data portal, AI-compatible schema, user informatics dashboard, and a synthetic data laboratory that emulates experimental workflows. SemiOrg’s architecture enables synthetic multimodal data, including but not limited to GIWAXS, UV-Vis, and DSC, to be generated, ingested into a data lake, and processed within a structured data lakehouse. Both raw and processed data are publicly accessible through the SemiOrg portal, where users can visualize datasets, compare personal data to existing records, and explore cross-domain trends. An Open-Source Playground module is being developed to connect users with the broader ecosystem of materials and chemistry tools from organizations such as IBM, NVIDIA, and NIST. This environment allows researchers to apply external AI models, informatics packages, and simulation frameworks directly to FAIR-formatted datasets, supporting transparent and reproducible experimentation across the community. The SemiOrg prototype is engineered for rapid data ingestion and cross-platform interoperability, linking web, database, and cloud components into a unified system. The front-end portal and dashboard are implemented in Next.js with TypeScript and Tailwind, while the back-end utilizes FastAPI and PostgreSQL, structured for scalable deployment across cloud or local research environments. The synthetic laboratory module incorporates literature ingestion, rulebook generation, geometry and trace JSON files, and a domain-aligned embedding space. These components generate statistically consistent datasets that emulate polymer characterization outputs and enable real-world testing of ingestion and visualization pipelines prior to integrating experimental data. SemiOrg enforces FAIR principles through persistent identifiers, standardized APIs, vectorized metadata for semantic search and benchmarking, and reproducible model training supported by data and model cards. Full experimental context, protocols, and calibration metadata are accessible alongside raw and processed data. SemiOrg establishes a reproducible framework for bridging experimental polymer science and artificial intelligence. By creating interoperable, machine-readable datasets, it enables cross-study benchmarking, transparent model evaluation, and improved interpretability of structure–property relationships in soft materials. The platform serves as a testbed for autonomous experimentation, materials informatics, and FAIR data management within the organic semiconductor community. Beyond its technical impact, SemiOrg demonstrates a scalable and ethical model for open, AI-driven materials research, laying the foundation for a community-maintained ecosystem where synthetic, experimental, and computational data coexist under shared standards to accelerate discovery while maintaining transparency and reproducibility.