At the heart of the research lies an innovative idea: DNA sequences can be treated like language. Just as words are strung together to form meaningful sentences, the sequences of nucleotides in DNA carry functional messages that govern how plants grow, respond to stress, or resist disease
In a breakthrough that could reshape agriculture and plant science, researchers have found that large language models (LLMs)—the artificial intelligence systems that power tools like ChatGPT—can accurately predict plant gene functions when trained on genomic data. The study, published in the journal Tropical Plants, reveals that these AI models can unlock complex genetic codes, paving the way for innovations in crop improvement, biodiversity preservation, and global food security.
The study was conducted by researchers Meiling Zou, Haiwei Chai, and Zhiqiang Xia from Hainan University. It marks one of the first successful applications of natural language processing (NLP) models in the field of plant genomics—a domain traditionally dominated by slower and more narrowly focused machine learning tools.
At the heart of the research lies an innovative idea: DNA sequences can be treated like language. Just as words are strung together to form meaningful sentences, the sequences of nucleotides in DNA carry functional messages that govern how plants grow, respond to stress, or resist disease. By leveraging the structural parallels between DNA and human language, LLMs are capable of interpreting the genetic “syntax” and predicting key elements such as gene functions and regulatory patterns.
To test this hypothesis, the team trained several types of LLM architectures—including DNABERT (encoder-only), DNAGPT (decoder-only), and ENBED (encoder-decoder)—on vast plant genome datasets. They then fine-tuned the models using smaller sets of annotated genetic information to improve accuracy. The results were impressive. The AI models performed well in predicting biological functions like gene expression, promoter and enhancer elements, and tissue-specific activity. Plant-specific LLMs such as AgroNT and FloraBERT outperformed more generalized models, highlighting the advantages of tailoring AI tools to particular biological domains.
Traditionally, one of the major challenges in plant genomics has been the volume and complexity of genetic data. Many plant species have large, repetitive, and poorly annotated genomes. This makes it difficult for conventional tools to generate accurate predictions or useful insights. The use of LLMs addresses these challenges by identifying subtle patterns and relationships that would otherwise remain hidden.
Importantly, the study also sheds light on a significant gap in current genomic AI models: most are based on animal or microbial data, which often lack the diversity and richness of plant genetic sequences. Despite this, the LLMs demonstrated adaptability and robustness when applied to diverse plant species.
The implications are far-reaching. With climate change, population growth, and declining arable land posing major threats to food production, the ability to rapidly interpret plant genomes could help scientists develop crops that are more resilient, nutritious, and sustainable.
In merging the fields of artificial intelligence and plant science, this study opens a new frontier—one where machines learn not just human language, but the language of life itself.