Название: Biological Language Model
Автор: Qiwen Dong
Издательство: Ingram
Жанр: Медицина
Серия: East China Normal University Scientific Reports
isbn: 9789811212963
isbn:
Proteins play a key role in various basic biological processes. As the material basis of life activities, proteins participate in various life processes, such as catalyzing almost all chemical reactions in biological cells, regulating gene activity and participating in the formation of most cell structures. In view of the key role of proteins in life activities, the study of protein structure and function has always been the focus of life science research.
Protein sequences are similar to sentences in natural language, as they are both linear arrangements of basic units. The mapping of sequences to structures and functions of proteins is conceptually similar to the mapping of words to meanings. This analogy has been studied by a growing body of research, but are there any linguistic features in protein sequences? What are the basic units in protein sequence language? Large amounts of genomic protein sequence data for Homo sapiens and other organisms have recently become available together with a growing body of protein structure and function data. The expected exponential increase in the amount of the data in the coming decade creates an opportunity for attacking the sequence–structure–function mapping problem with sophisticated data-driven methods. Such methods have been proven to be immensely successful in the domain of natural language.
The purpose of this book is to introduce the relevant techniques of biological language modeling into bioinformatics and promote the development of protein sequence–structure–function mapping. In view of the above purpose, the linguistic features of protein sequences are analyzed and several amino acid encoding schemes are explored. Then, several research topics including remote homology detection, protein structure prediction and protein function prediction are investigated by using biological language model approaches. Finally, a brief summary and future perspective are proposed. We hope that this book will be helpful for research in the field of bioinformatics, especially the mapping of protein sequences to their structure and function.
Qiwen Dong
Xiuzhen Hu
Xiaoyang Jing
Aoying Zhou
Acknowledgments
This work was supported by the National Key Research and Development Program of China under grant 2016YFB1000905 and the National Natural Science Foundation of China (Grant No. U1401256, U1711262, U1811264, 61672234, 61961032, 31260203, 61402177).
We would like to thank all the people who have made contributions to and given their valuable suggestions regarding this book, especially Bin Liu, Ming Gao, Dingjiang Huang and Daocheng Hong. We would also like to express our sincere thanks and appreciation to the people at University Press, for their generous help throughout the publication preparation process.
Contents
East China Normal University Scientific Reports
1.3Organization of the Book Content
2.Linguistic Feature Analysis of Protein Sequences
2.2Comparative n-gram Analysis
2.4Distinguishing the Organisms by Uni-Gram Model
3.Amino Acid Encoding for Protein Sequence
3.4The Assessment of Encoding Methods for Protein Secondary Structure Prediction
3.5Assessments of Encoding Methods for Protein Fold Recognition
4.2Related Work
4.3Latent Semantic Analysis
4.4Auto-cross Covariance Transformation
4.5Conclusions
References