Biological Language Model. Qiwen Dong
Чтение книги онлайн.

Читать онлайн книгу Biological Language Model - Qiwen Dong страница 5

Название: Biological Language Model

Автор: Qiwen Dong

Издательство: Ingram

Жанр: Медицина

Серия: East China Normal University Scientific Reports

isbn: 9789811212963

isbn:

СКАЧАТЬ people to systematically and completely understand the whole process of transferring biological information from DNA to biologically active proteins as well as to clarify the central law more completely.14 Having a deeper understanding of the various phenomena in the life process ultimately promotes the rapid development of life sciences.15 As regards application, it is beneficial for people to analyze disease pathogenesis and find treatment methods, and design proteins with novel biological functions, thereby promoting the rapid development of medicine, agriculture and animal husbandry. Thus, developing efficient computer-based algorithms to predict high-resolution 3D protein structures from their sequences becomes increasingly important.16,17

       1.2.5 Function prediction

      Proteins are one of the most important molecules in biology as they have a role in many life processes, such as transcription, metabolism and regulation. It is thus of great importance to perform function analysis on proteins to help understand the processes of life.18 Due to the huge amount of proteins present, it is difficult to verify the function of each and every protein. Computational approaches for function prediction are necessary to assist in the functional identification of the proteome.19 The related research9 includes such aspects as interaction prediction and ontology-based function prediction. Since proteins perform their function by binding with other ligands, including proteins, metal ions, DNA, RNA, etc., it is essential to predict the binding sites of proteins to further explore in detail the function of proteins.20

      The structure of this book is organized as follows. First, it begins by providing an introduction to the proteome, the biological language model and its application. Then, several research topics of the biological language model are proposed, with detailed introductions on the background and a description of the methods, i.e. linguistic feature analysis of protein sequences, amino acid encoding for protein sequences, protein remote homology detection, protein structure prediction and protein function prediction. For the topic of linguistic feature analysis of protein sequences, the n-grams of whole genome protein sequences from 20 organisms were extracted to obtain statistical sequence analysis results for a large number of genomic and proteomic sequences available for different organisms. Their linguistic features were analyzed by two tests — Zipf’s power law and Shannon’s entropy — developed for analysis of natural languages and symbolic sequences. As regards amino acid encoding, a comprehensive review of the available methods for this is proposed, and these methods are grouped into five categories according to their information sources and information extraction methodologies, which are as follows: binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding and machine-learning encoding. For protein remote homology detection, latent semantic analysis is used to extract and represent the contextual-usage meaning of words of protein sequences by statistical computations, and the auto-cross covariance transformation is introduced to transform protein sequences into fixed-length vectors. For the protein structure prediction topic, a novel index at the profile level is presented for protein domain linker prediction, a building-block library-based method has been presented to predict the local structures and the folding fragments of proteins, conformational entropy is used as an indicator of protein flexibility and a class of novel nonlinear knowledge-based mean force potentials is presented. For the protein function prediction topic, profile-level interface propensities are used for binding site prediction, sequence composition information is used for gene ontology-based protein function prediction and the n-gram biological language model from natural language processing has been used to filter the missing proteins. Finally, the conclusion and future perspectives are proposed.

      [1]Wasinger V.C. Progress with gene-product mapping of the mollicutes: Mycoplasma genitalium. Electrophresis, 1995, 16(7): 1090–1094.

      [2]Ganapathiraju M., Balakrishnan N., Reddy R., Klein-Seetharaman J. Computational biology and language. Ambient intelligence for scientific discovery. LNAI, 2005, 3345: 25–47.

      [3]Manning C.D., Schütze H. Foundations of Statistical Natural Language Processing. 1999. Cambridge, MA: MIT Press.

      [4]Ganpathiraju M., Weisser D., Rosenfeld R., Carbonell J., Reddy R., Klein-Seetharaman J. Comparative n-gram analysis of whole-genome protein sequences. In Proceedings of the Human Language Technologies Conference, San Diego, 2002, pp. 1367–1375.

      [5]Tanaka S., Scheraga H.A. Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules, 1976, 9(6): 945–950.

      [6]Yang K.K., Wu Z., Bedbrook C.N., Arnold F.H. Learned protein embeddings for machine learning. Bioinformatics, 2018, 34(15): 2642–2648.

      [7]Asgari E., McHardy A.C., Mofrad M.R. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci Rep, 2019, 9(1): 3577.

      [8]Moult J., Fidelis K., Kryshtafovych A., Schwede T., Tramontano A. Critical assessment of methods of protein structure prediction (CASP) — Round XII. Proteins: Structure, Function, and Bioinformatics, 2018, 86: 7–15.

      [9]Guo Y., Yu L., Wen Z., Li M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res, 2008, 36(9): 3025–3030.

      [10]Haandstad T., Hestnes A.J., Saetrom P. Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinformatics, 2007, 8(1): 23.

      [11]Lingner T., Meinicke P. Remote homology detection based on oligomer distances. Bioinformatics, 2006, 22(18): 2224–2231.

      [12]Yang Y., Tantoso E., Li K.B. Remote protein homology detection using recurrence quantification analysis and amino acid physicochemical properties. J Theor Biol, 2008, 252(1): 145–154.

      [13]Li J., Cai J., Su H., Du H., Zhang J., Ding S., Liu G., Tang Y., Li W. Effects of protein flexibility and active site water molecules on the prediction of sites of metabolism for cytochrome P450 2C19 substrates. Mol Biosyst, 2016, 12(3): 868–878.

      [14]Manoharan P., Chennoju K., Ghoshal N. Target specific proteochemometric model development for BACE1 — Protein flexibility and structural water are critical in virtual screening. Mol Biosyst, 2015, 11(7): 1955–1972.

      [15]Antunes D.A., Devaurs D., Kavraki L.E. Understanding the challenges of protein flexibility in drug design. Expert Opin Drug Discov, 2015, 10(12): 1301–1313.

      [16]Yang J., Wang Y., Zhang Y. ResQ: An approach to unified estimation of B-Factor and residue-specific error in protein structure prediction. J Mol Biol, 2016, 428(4): 693–701.

      [17]Sharma A., Manolakos E.S. Efficient multicriteria protein structure comparison on modern processor architectures. Bio Med Res Int, 2015, 2015: 13.

      [18]Tetko I.V., Rodchenkov I.V., Walter M.C., Rattei T., Mewes H.W. Beyond the ‘best’ match: Machine learning annotation of protein sequences by integration of different sources of information. Bioinformatics, 2008, 24(5): 621–628.

      [19]Kim M.-S., Pinto S.M., Getnet D., Nirujogi R.S., Manda S.S., Chaerkady R., Madugundu A.K., Kelkar D.S., Isserlin R., Jain S. A draft map of the human proteome. Nature, 2014, 509(7502): 575–581.

      [20]Nanni L., Lumini A. An ensemble of K-local hyperplanes for predicting protein-protein interactions. Bioinformatics, 2006, 22(10): 1207–1210.

СКАЧАТЬ