Название: Biological Language Model
Автор: Qiwen Dong
Издательство: Ingram
Жанр: Медицина
Серия: East China Normal University Scientific Reports
isbn: 9789811212963
isbn:
Table 3-1 The hydrophobic properties of 20 standard acid sides.
Some physicochemical property-based amino acid encodings have been proposed in previous studies. Fauchère et al.18 established 15 physicochemical descriptors of side chains for 20 natural and 26 non-coded amino acids which reflect hydrophobic, steric, electronic, and other properties of amino acid side chains. Radzicka and Wolfenden19 obtained digitized indications of the tendencies of amino acids to leave water and enter a truly nonpolar condensed phase in their experiments. Lohman et al.20 represented amino acids by using seven physicochemical properties to predict transmembrane protein sequences, and the properties are hydrophobicity, hydrophilicity, polarity, volume, surface area, bulkiness and refractivity. Atchley et al.15 used multivariate statistical analyses to produce multi-dimensional patterns of attribute covariation for the 20 standard amino acids, which reflect the polarity, secondary structure, molecular volume, codon diversity and electrostatic charge of amino acids.
3.2.3 Evolution-based encoding
The evolution-based encoding methods extract the evolutionary information of residues from sequence alignments or phylogenetic trees to represent amino acids, mainly by using the amino acid substitution probability. These evolution-based encoding methods can be categorized into two groups based on position relevance: position-independent methods and position-dependent methods.
The position-independent methods encode amino acids by using fixed encodings, regardless of the amino acid position in the sequence and the amino acid composition of the sequence. The most commonly used position-independent encoding methods are the PAM matrices and the BLOSUM matrices, and a common flowchart is shown in Fig. 3-1. The point accepted mutation (PAM) matrices represent the replacement probabilities for change from a single amino acid to another single amino acid in homologous protein sequences,13 which are focused on the evolutionary process of proteins. The PAM matrices are calculated from protein phylogenetic trees and related protein sequence pairs. The assumption of the PAM matrices is that the accepted mutation is similar in physical and chemical properties to the old one and the likelihood of amino acid X replacing Y is the same as that of Y replacing X; thus, the PAM matrices are 20 ∗ 20 symmetry matrices where each row and column represents one of the 20 standard amino acids. Corresponding to different lengths of evolution time, different PAM matrices can be generated. The 250 PAMs, which means the amino acid replacements to be found after 250 evolutionary changes, was found by the authors to be an effective scoring matrix for detecting distant relationships,13 and it is now widely used in related research.21,22 The blocks amino acid substitution matrices (BLOSUM)23 are amino acid substitution matrices derived based on conserved regions constructed by the PROTOMAT24 from non-redundant protein groups. The values in the BLOSUM matrices represent the probabilities that amino acid pairs will exchange places with each other. To reduce the contributions of most closely related protein sequences, the sequences are clustered within blocks. Different BLOSUM matrices can be generated by using different identical percentages for clusters, and the BLOSUM62 matrix performed better overall.23
Figure 3-1 The flowchart of position-independent amino acid encoding methods. First, the target proteins are selected (step 1). Then, the sequence alignments are constructed based on some criteria (step 2). Finally, the mutation matrix is calculated and is regarded as the amino acid encoding (step 3).
Different from position-independent matrices, the position-dependent methods encode amino acids at different positions by using different encodings, even if the amino acid types are the same. The position-dependent encodings are deduced from the multiple sequence alignments (MSAs) of target sequences; the flowchart for this is shown in Fig. 3-2. The position-specific scoring matrix (PSSM) is the most widely used encoding method. The PSSM is also called the position weight matrix (PWM), which represents the log-likelihoods of the occurrence probabilities of all possible molecule types at each location in a given biological sequence.25 Generally, the Position-Specific Iterative Basic Local Alignment Search Tool (PSI-BLAST)26 is used to execute sequence alignment and generate MSA for the target protein sequence. Then the corresponding PSSM is calculated from the MSA. For a protein sequence with length L, its PSSM is an L ∗ 20 matrix, in which each row represents the log-likelihoods of the probabilities of 20 amino acids occuring at its corresponding position. Besides the PSI-BLAST, the HMM-HMM alignment algorithm HHblits is also widely used to generate the probabilities profile, which is more sensitive than the sequence-profile alignment algorithm PSI-BLAST, as demonstrated by Remmert et al.27
Figure 3-2 The flowchart of position-dependent amino acid encoding methods. First, the target protein sequence is selected (step 1). Then, multiple sequence alignments are constructed by searching the protein sequence database (steps 2 and 3). Finally, the position weight is calculated by columns and is regarded as the corresponding amino acid encodings (step 4).
3.2.4 Structure-based encoding
The structure-based amino acid encoding methods, which can also be called statistical-based methods, encode amino acids by using structure-related statistical potentials, mainly using the inter-residue contact energies.28 The basic assumption is, in a large number of native protein structures, the average potentials of inter-residue contacts can reflect the differences of interaction between residue pairs,29 which play an important role in the formation of protein backbone structures.28 The inter-residue contact energies of the 20 amino acids are usually estimated based on amino acid pairing frequencies from native protein structures.28 The typical procedure to calculate the contact energies comprises three steps. First, a protein structure set is constructed from known native protein structures. Then, the inter-residue contacts of the 20 amino acids observed in those structures are counted. Finally, the contact energies are deduced from the amino acid contact frequencies by using the predefined energy function, and different contact energies reflect different contact potentials of amino acids in native structures.
Many previous studies have focused on structure-based encodings. In order to account for medium- and long-range interactions which determine the protein folding conformations, Tanaka and Scheraga28 evaluated the empirical standard free energies to formulate amino acid contacts from the contact frequencies. By employing the lattice model, Miyazawa and Jernigan29 estimated contact energies by using quasi-chemical approximation with an approximate treatment of the effects of chain connectivity. Later, they reevaluated the contact energies based on a larger set of protein structures and also estimated an additional repulsive packing energy term to provide an estimate of the overall energies of inter-residue interactions.30 To investigate the validity of the quasi-chemical approximation, Skolnick et al.31 estimated the expected number of contacts by using two reference states, the first of which treats the protein as a Gaussian random coil polymer and the second of which includes the effects of chain connectivity, secondary structure and chain compactness. The comparison results show that the quasi-chemical approximation is, in general, sufficient for extracting the amino acid pair potentials. To recognize native-like protein structures, Simmons et al.32 used distance-dependent statistical contact СКАЧАТЬ