The task of human genome sequencing was completed in 2003, and life science from then on stepped into the post-gene era. The research focuses are gradually shifting from accumulating data to methods to interpret the data, i.e. how to extract structural and functional information from sequence data. Post-genome sequencing research includes comparative genomics, structural genomics, functional genomics, proteomics, holistic biology and pharmacogenomics.

The proteome¹ is a dynamic concept that is not only different in different tissues and different cells of the same organism but is constantly changing throughout the developmental stages of the same organism until the final demise of that organism. The complex pattern of gene expression leads to a variety of complex life activities. In fact, each form of movement in the stages of life is the result of different combinations of specific protein groups that appear at different times and spaces. The sequence of the genetic DNA does not provide this information, so the language of the nucleic acid alone is not sufficient to describe the entire life activity. It can be seen that the research task of both the whole and the dynamic proteome is very heavy and is a follow-up part of the genomic research that is indispensable for elucidating the nature of life activities. Post-genome or -proteome research will undoubtedly become the main task of relay genome research in life science research in the 21st century.

The mapping relationship between a biological sequence and its structure and function is similar to the word-to-semantic mapping relationship in a language.² In linguistics, words can be arranged into meaningful sentences; in biology, amino acid arrangement represents the structure and function of proteins. The arrangment of amino acids to form a protein can be regarded as similar to a meaningful arrangement of words, thereby leading to the specific structures and functions of proteins. The words in a document map directly to the semantics and contain relevant information about the topic of the article; similarly, the protein sequence can be regarded as the original text, containing information about structure and function, which can be used to further understand the mutual interaction between proteins.

As protein primary structure sequencing technology matures, the amount of genomic and proteomic sequence data continues to increase, as does the associated structural and functional data. These data will increase exponentially over the next decade, making it possible to use a data-driven approach to solve protein sequence–structure–function mapping problems. Data-driven methods have been successfully applied in many areas of natural language processing, such as speech recognition, text categorization, information extraction and machine translation.³

The emergence of a large number of corpora has promoted the development of computational linguistics. Similarly, the emergence of a large amount of protein sequence–structure–function data has enabled computational methods and information techniques to be applied in this field. Computational linguistic tools including statistical language models, text classification techniques, machine learning methods and higher-level language processing methods have been applied to understand the structure and function of proteins in cells. The purpose of this book is to introduce relevant techniques of biological language modeling in bioinformatics and promote the development of protein sequence–structure–function mapping.

1.2Related Topics

1.2.1 Linguistic feature analysis of protein sequences

Protein sequences are similar to the sentences seen in natural language, as both are made up of linear arrangements of basic units. The mapping of sequences to the structures and functions of proteins is conceptually similar to the mapping of words to meanings. This analogy has been studied by a growing body of research,⁴ but are there any linguistic features in protein sequences? What are the basic units in protein sequence language?

1.2.2 Amino acid encoding for protein sequence

In general, protein sequences are represented by using twenty letters of the amino acid alphabet. Since such a representation cannot be directly processed before it is converted to digital representation, obtaining the digital representation for an amino acid^5,6 is the first step of machine-learning-based protein structure and function prediction methods, and effective digital representation⁷ is crucial to the final success of these methods.

1.2.3 Remote homology detection

With the rapid development of completely sequenced genomes, a large amount of sequence data has been deposited in databases, and now their structure and function need to be elucidated. In general, the easiest way to annotate newly sequenced proteins is to transfer annotations from well-characterized homologous proteins.⁸ Therefore, the development of a novel algorithm for protein homology detection is of great importance.^9,10 This is especially so since remote homology detection — the detection of homologous relationship with low sequence identities — remains a challenging problem in computational biology.^11,12

1.2.4 Structure prediction

With the success of a series of genome-sequencing projects, the number of known protein sequences has grown exponentially. The amount of sequence data in the current molecular database far exceeds the amount of structural data, and the acquisition of structural information is very important to reveal the biological function of proteins. However, due to technical difficulties and the laborious nature of structural biology experiments, the speed of protein structure determination lags far behind the increase in the number of sequences. Studying protein structure prediction¹³ has great theoretical and СКАЧАТЬ

Biological Language Model. Qiwen Dong Чтение книги онлайн.

Читать онлайн книгу Biological Language Model - Qiwen Dong страница 4

Biological Language Model. Qiwen Dong
Чтение книги онлайн.