Google researchers use machine learning approach to annotate protein domains


Proteins play an important role in the structure and function of all living organisms. Every protein consists of a chain of amino acid building blocks. Much like an image can have many things, a protein can have multiple components known as protein domains.

Researchers have grappled extensively with the challenging task of understanding the relationship between a protein’s amino acid sequence and its structure or function.

Many people are familiar with DeepMind’s AlphaFold, which uses computational methods to predict protein structure from amino acid sequences. While existing methods have successfully predicted the function of hundreds of millions of proteins, many more remain unidentified. The difficulty of reliably predicting function for highly divergent sequences is becoming more serious as the volume and diversity of protein sequences in public databases increases rapidly.

The Google AI team introduces one ML technique to consistently predict protein function. The team added around 6.8 million entries to Pfam, the widely used protein family database that contains highly detailed computational annotations describing the function of a protein domain. They will release it as ProtENN, which will allow users to input a sequence and get real-time results for a predicted protein function in the browser, with no setup required.

Researchers began by developing a protein domain classification model to categorize complete protein sequences. Given the amino acid sequence of a protein domain, they formulate the problem as a multi-class classification task, in which they predict a single marker from 17,929 classes (in the Pfam database).

The main disadvantage of current state-of-the-art methods is that they are based on linear sequence alignment and do not take into account interactions between amino acids in different parts of protein sequences. Proteins, on the other hand, don’t just remain a series of amino acids. Rather, they fold into themselves, resulting in non-adjacent amino acids having strong interactions.

A fundamental step in current prior art approaches is to align a new query sequence with one or more sequences with established functions. Because of this dependence on sequences with known functions, it is difficult to predict the function of a new sequence that is extremely different from any sequence with known functions. In addition, alignment-based approaches are computationally expensive, making their application to large datasets such as the MGnify metagenomic database, which contains over a billion protein sequences, prohibitive.

The team proposes that dilated convolutional neural networks (CNNs) are well suited to model non-local paired amino acid interactions. In addition, they can run on modern ML hardware such as GPUs. They train ProtCNN (1-dimensional CNNs) and ProtENN (an ensemble of independently trained ProtCNN models) to predict the classification of protein sequences.

Because proteins evolved from common ancestors, much of their amino acid sequence is generally shared among them. It is possible for the test set to be dominated by samples that closely resemble the training data if not given enough attention. This leads to models that just “remember” the training data rather than learning to generalize it more broadly.

Therefore it is important to test the model performance with different setups. They stratify the model accuracy as a function of the similarity between each sustained test sequence and the closest sequence of the train set for each rating.

The team first evaluates the model’s ability to generalize to produce correct predictions for out-of-distribution data. To do this, they used a clustered split training and testing set with protein sequence samples grouped according to their sequence similarity. Since entire clusters are assigned to the train or test sets, each test case differs by at least 75% from each training example.

They use a randomly split training and testing set for the second assessment to stratify samples based on how difficult they will be to classify. The similarity between a test sample and the next training sample and the number of training samples from the real class are two measures of difficulty.

You will test the effectiveness of the most commonly used base models and assessment setups, focusing on:

  • BLAST, a nearest neighbor method that uses sequence alignment to quantify distance and infer function
  • Profiles of Hidden Markov Models (TPHMM and phmmer).

The team worked with the Pfam team at the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI) to see if their approach could be used to tag real-world sequences. They combined the two approaches to identify more sequences than either method could do alone. The resulting Pfam-N, a collection of 6.8 million additional protein sequence annotations, was made available. The results show that ProtENN learns information that is complementary to alignment-based methods.

They studied these networks to determine whether the embeddings were generally effective after observing the success of these methods and classification tests. To do this, they created an interactive manuscript that allows users to explore the relationship between model predictions, embeddings, and input sequences. They discovered that comparable sequences were clustered in the embedding space.

Furthermore, because they used an extended CNN as the network architecture, they were able to use previously developed interpretability methods such as Class Activation Mapping (CAM) and Adequate Input Subsets (SIS) to identify the subsequences important for neural network predictions. Using this method, they find that their network predicts the function of a sequence by focusing on the relevant elements of the sequence.




Comments are closed.