DNA contains the genetic information that influences everything from eye color to susceptibility to disease and disorders. Genes, around 20,000 pieces of DNA in the human body, perform various vital tasks in our cells. Even so, these genes make up less than 2% of the genome. The remaining base pairs in the genome are called “non-coding”. They contain less well-understood instructions as to when and where genes should be created or expressed in the human body.
Previous studies of gene expression used convolutional neural networks as key building blocks. However, their accuracy and usefulness have been hampered by problems in modeling the influence of distal enhancers on gene expression. The proposed new method is based on Basenji2, a program that can predict regulatory activity from DNA sequences of up to 40,000 base pairs.
The team called for a fundamental change in architecture to capture extended sequences and understand whether the regulatory elements of DNA affect expression over longer distances.
The new model is based on Transformers to use self-attention processes to record significantly more DNA background. The transformers are built in such a way that they “read” greatly expanded DNA sequences because they are suitable for viewing long sections of text. The model architecture can describe the influence of critical regulatory regions, called enhancers, on gene expression from further away within the DNA sequence. It does this by successfully processing sequences to account for interactions at distances greater than five times (i.e., 200,000 base pairs) than previous methods.
Researchers used post ratings to highlight which parts of the input sequence were most relevant to the prediction to better understand how Enformer interprets the DNA sequence to make more accurate predictions. The results suggest that the model was paying attention to enhancers even if they were more than 50,000 base pairs from the gene, which is biological intuition. Enformer’s contribution values are comparable to existing techniques designed specifically for the task, which is impressive. Enformer also learned about isolator elements that separate two independently regulated areas of DNA.
Although it is now possible to analyze the DNA of an organism, understanding the genome requires complex studies. Despite extensive research, the vast majority of DNA regulation of gene expression remains a mystery. Enformer partially recognizes the vocabulary of the DNA sequence, similar to a spell check, and can thus point out modifications that may change gene expression.
The main purpose of this new approach is to predict what changes in DNA letters, commonly known as genetic variations, would affect the expression of the gene. Enformer outperforms previous models in predicting the impact of genetic variants on gene expression, both natural genetic variants and synthetic variants that alter critical regulatory regions. This property helps decipher the growing number of disease-associated variations that have been discovered in genome-wide association studies.
Enformer is a major advancement in complex genome sequence studies. The team intends to work with other researchers and organizations interested in using computational models to solve the big problems of genomics.