Название: Biological Language Model
Автор: Qiwen Dong
Издательство: Ingram
Жанр: Медицина
Серия: East China Normal University Scientific Reports
isbn: 9789811212963
isbn:
Linguistic Feature Analysis of Protein Sequences
2.1Motivation and Basic Idea
Proteins play an important role in the function of complex biological systems. But the relationship between primary sequences, three-dimensional structures and functions of proteins is one of the most important unanswered questions in biology. With the completion of the Human Genome Project and all kinds of work in assessing biological sequences accurately, a large number of genomic and proteomic sequences are available for different organisms at present. The exponential increase of these data provides an opportunity for us to attack the sequence–structure–function mapping problem with sophisticated data-driven methods. Such methods have been successfully used in the domain of natural language processing. There are analogies between biological sequences and natural language. In linguistics, some words and phrases can form a meaningful sentence, while in biology, some tactic nucleotides denote genes and some fixed protein sequences can determine the structure and function of the protein.1 But is there a “language” in biological sequences?
Mantegna2 analyzed the linguistic features of noncoding DNA and emphasized that there exists a “language” in noncoding DNA. Although there are some insufficiencies in the work,3–5 many methods used in natural language processing have been used in biological sequences. N-grams of DNA6 and protein7 have been extracted. A bio-dictionary has been built and used to annotate proteins.8 Latent semantic analysis has been used to characterize the secondary structure of proteins.9 Probabilistic models from speech recognition have been used to enhance the protein domain discovery.10
The n-gram analysis method is one of the most frequently used techniques in computational linguistics. It takes the assumption that only the previous n − 1 words in a sentence have an effect on the probabilities for the next word.11 It has been successfully used in automatic speech recognition, document classification, information extraction, statistical machine translation and other challenging tasks in natural language. In this chapter, the n-grams of whole genome protein sequences have been extracted, their Zipf’s law has been analyzed and some statistical features have been extracted from the n-grams.
2.2Comparative n-gram Analysis
Amino acids are treated as words, since each amino acid carries a chemical “meaning”. In order to extract the n-gram from whole genome protein sequences, all the proteins of the same organism were arranged in series and split by blank, e.g. protein1 protein2 protein3 etc. Due to the large size of the genomic data, the suffix array12,13 was used to reduce the computational cost. To extract the n-gram statistical data, we developed a toolkit that can carry out the following functions:
1.Count protein number and length.
2.Count n-grams and most frequent n-grams.
3.Count n-grams of specified length.
4.Determine relative frequencies of specific n-grams across organisms.
5.Assess the distribution of n-gram frequencies in a specific organism.
The method was applied to protein sequences derived from whole genome sequences of 20 organisms. The protein sequence data was downloaded from the Swiss-prot database.14 The number of proteins varies from 484 (Mycoplasma genitalium) to 25612 (Human).
We developed a modification of Zipf-like analysis that could reveal differences between word usage in different organisms. First, the amino acid n-grams of a given length were sorted in descending order by frequency for the organism of choice. The comparative n-gram plots comparing the n-grams of one organism to those of other organisms were drawn using the top 20 n-grams. Figure 2-1 shows the comparative n-gram analysis of Human (A) for n = 3 and R_norvegicus (B) for n = 4. The x-axis represents the ranked ngrams of a specific organism. The y-axis represents the corresponding frequency. The sorted n-grams of the organism of choice are shown as the bold line. Thick lines indicate the frequencies of n-grams with given rank in other organisms. Table 2-1 shows the 20 organisms used in this book.
In natural language, there are some words that are used frequently and some rarely; similarly in proteins, the frequencies of usage of the 20 amino acids are different. From the uni-gram plot of 20 organisms, Leucine was found to be one of the most frequent amino acids, ranked among the top three. Tryptophan and Cysteine, on the other hand, are the most rare amino acids, and their ranks occupy the last three spots. In language, frequent words are usually not closely related to the actual meaning of the sentence, whereas the rare words often are. So too is the case with the rare amino acids, which may be important for the structure and function of the protein.
Another statistical feature of n-grams is that there are organism-specific “phrases” in the protein sequences. Examples are shown in Fig. 2-1. In Human (Fig. 2-1(A)), the phrases “PPP” “PGP” and “SSP” are among the top 20 most frequently used 3-grams, but they are used in other organisms with very low frequencies. Also in R_norvegicus (Fig. 2-1(B)), similar phrases are “HTGE”, “GEKP”, “CGKA”, “GKAF”, “IHTG” and “PYKC”. These highly idiosyncratic n-grams suggest that there are organism-specific usages of “phrases” in protein sequences.
Table 2-1 Organism names used in the plot.
Organism | Organism |
A_thaliana | Human |
Aeropyrum_pernix | Methanopyrus_kandleri |
arabidopsis | Streptomyces_avermitilis |
Archaeoglobus_fulgidus | Mycoplasma_genitalium |
Bacillus_anthracis_Ames | Neisseria_meningitidis_MC58 |
Bifidobacterium_longum | Pasteurella_multocida |
Borrelia_burgdorferi | R_norvegicus |
Buchnera_aphidicola_Sg | s_pombe |
Encephalitozoon_cuniculi | Worm |
Fusobacterium_nucleatum | Yeastpom |
Figure 2-1 Comparative n-gram analysis of Human (A) for n = 3 and R_norvegicus (B) for n = 4.
2.3The Zipf Law Analysis
Claiming Zipf’s law in a data set seems to be simple enough: if n values, xi (i = 1, 2 . . . n), are ranked by x1 ≥ x2 ≥ . . . xr . . . ≥ xn, Zipf’s law15 states that
СКАЧАТЬ