Seed World

AI Will Unravel Secrets of Non-Coding Genes

From smart chatbots to apps that can write entire articles, artificial intelligence (AI) is becoming an increasingly ubiquitous part of our lives. Michael Schon, a research associate at Wageningen University & Research, is designing an AI tool that can perform comparisons of non-coding RNA on plant genomes, according to a press release. The tool is expected to accelerate and simplify the future development of new plant varieties with greater resistance to drought or diseases, for example. Schon has received a Veni grant to support his research.

Proteins are the building blocks for cells in organisms. The release notes that the instructions for making these proteins are issued (coded) by RNA from genes. Alongside these coding RNAs, some genes can produce non-coding RNAs: in other words, RNA that doesn’t include instructions to make a protein. 

Michael Schon said this type of RNA also plays an important role in the development of organisms.

“For example, they can activate genes, or do the opposite and switch them off. This will affect the appearance of a plant and the properties it has. Certain important non-coding RNAs also determine whether a plant reaches maturity at all.”

Relatives Within the Same Family

Non-coding RNA may also help explain why a plant species belongs to a particular family but exhibits different characteristics. In previous research, Schon identified non-coding RNAs in Arabidopsis thaliana (thale cress), a model organism widely used by plant scientists.

“Arabidopsis belongs to the Brassicaceae family, along with important crops like broccoli, cauliflower and kohlrabi,” Schon said. “This family is also known as the mustard or crucifer family. However, it’s difficult to compare non-coding RNAs of Arabidopsis with that of other plants in the mustard family because previous work in these species has focused mainly on protein coding genes.”

Limited Annotation of Non-coding RNA

This necessitates separate gene annotation for non-coding RNA in each crop to facilitate plant comparisons. Through his Veni project, Schon seeks innovative methods to identify non-coding RNAs by leveraging knowledge from related species.

“More than 200 genome sequences are available for plants within the mustard family. Each genome is stored as a large text file consisting of millions of letters that represent the bases of a DNA molecule (A, C, T and G). Because the non-coding bits aren’t catalogued (annotated) properly in these genomes, it’s impossible to compare all the non-coding genes scattered inside this mountain of data. We need new strategies and tools for that. I’m trying to develop those.”

A Small Part of Each Genome

The release states that the first problem is knowing where in the genome to look. One of the tools Schon is developing is something he calls GeneSketch. To find the corresponding parts of different genomes, he’s using a method called Minimizer Sketch. 

“The idea behind the Minimizer Sketch is that you only need to look at a small piece of DNA – a sketch – rather than the entire sequence,” says Schon. “That means you only have to pay attention to a few thousand characters per genome to perform a comparison, rather than millions. The Minimizer Sketch was previously used to build a tree of primate evolution, which includes humans and their closest relatives. It turned out that a very accurate family tree of our ancestors can be made from sketches made of less than 1% of the whole genomes. A minimizer sketch therefore is a very efficient way to estimate how similar pieces of DNA are to each other, so it should also be useful for comparing genomes within the mustard family.”

Same Technology as ChatGPT

Once you know where to look, the next step is understanding what you are seeing. The technology Schon plans to use in GeneSketch is similar to that used in other AI tools, such as ChatGPT.

“It’s something called ‘transformer’ technology,” says Schon. “You can ask a transformer to fill in a missing word in a sentence, for example. Initially, the transformer gives you a random word because it has never seen words before. But if you train it on millions of example sentences, it slowly learns to guess the right words by paying attention to patterns in the text. After training, a large language model like ChatGPT becomes very good at certain tasks, like answering questions or translating from one language to another. A transformer can be trained to learn not just human languages, but also the language of DNA, which has its own distinct patterns. I am working on a model to detect patterns in the DNA of many different species and translate those patterns into a language that we as humans can understand.”

Model Must Be Trained

Schon will train the transformer for GeneSketch to focus on how genes, particularly non-coding genes, vary across different species. However, he anticipates encountering several challenges throughout this process.

Schon says that one important issue is reliability. The transformer is a relatively new technology, and it makes mistakes. 

“ChatGPT, for example, was trained on many different sources of text, but if you ask it a topic it never saw during training, it needs to make something up. You hope that it makes up something reasonable based on the patterns it has seen, but this is never a guarantee. You obviously want to avoid nonsense output. The more you train a transformer, the less nonsense it produces, but training can cost a lot of time and money. Is it better to train the model completely from scratch or build off of existing models? I am trying both approaches.”

Potential of the GeneSketch

Schon aims to develop a prototype of GeneSketch by the end of the project’s first year, which began in October 2023. He plans to use this tool to create gene annotations for the entire mustard family. According to Schon, the tool could benefit not only the research sector but also the agricultural industry. 

“It could, for example, provide seed breeders with a quick way of understanding the DNA of a crop and its wild relatives. By learning more about how crops have been able to develop unique traits over the centuries, breeders could make more informed decisions for improving traits, such as making crops more resilient to climate change. So, the potential impact could be huge.”