Название: Semantic Web for Effective Healthcare Systems
Автор: Группа авторов
Издательство: John Wiley & Sons Limited
Жанр: Программы
isbn: 9781119764151
isbn:
Figure 1.6 LDA framework.
For example, the Word 2 is categorized under two different topics say, “topic 1” and “topic 2.” The context of this word varies and it is determined by the co-occurrence of other words. So, the word 2 with the context “topic 1” is more relevant to “Doc 1” and the same word with the context “topic 2” is more relevant to “Doc 2.” Identifying latent concepts thus improves the accuracy of feature categorization.
LDA is a matrix factorization technique. It reduces TD matrix into two low dimensional matrices, M1 and M2. M1 is a document-topic matrix (NxK) and M2 is the topic-term matrix (KxM), where N is the number of documents, K is the number of topics and M is the number of terms. LDA uses sampling techniques to improve these matrices. The model assumes all word–topic mapping is correct except the current word. This technique iterates for each word “w” for each document “d” and adjusts the current topic assignment of “w” by multiplying two probabilities p1 and p2, where p1 is p(topict/documentd) and p2 is p(wordw/topict), the probability of assignment of topic “t” over all the documents for the word “w.” Steady state is reached after the number of iterations and distributions of words-topics and topics-documents may fairly be good at one instant.
For LDA model, the number of topics K has to be fixed in prior. It assumes the generative process for a document w = (w1, . . . ,wN) of a corpus D containing N words from a vocabulary consisting of V different terms, w ϵ {1, …, V} for all i = {1, … , N}. LDA consists of the following steps [12]
1 (1) For each topic k, draw a distribution over words Φ(k) ~ Dir(α).
2 (2) For each document d,(a) Draw a vector of topic proportions θ(d) ~ Dir(β).(b) For each word i,(i) Draw a topic assignment zd,i ~ Mult(θd), zd,n ϵ {1, …, K},(ii) Draw a word wd,i ~ Mult(Φz d,i), wd,i ϵ {1, …, V}
where α is a Dirichlet prior on the per-document topic distribution, and β is a Dirichlet prior on the per-topic word distribution. Let θtd be the probability of topic t for document d, zdi be the topic distribution, and let Φtw be the probability of word w in topic t. The probability of generating word w in document d is:
Equation 1.2 gives the weighted average of the per-topic word probabilities, where the weights are the per-document topic probabilities. The resulting distribution p(w|d) varies from document to document, as the topic weights change among documents. Corpus documents are fitted into LDA model by inferring a collection of hidden variables. These variables are denoted by θ = {θtd}, the |K| × |D| matrix of per-document topic weights, and Φ = {Φtw}, the |K| × |N| matrix of per-topic word weights. Inference for LDA is the problem of determining the joint posterior distribution of θ and Φ after observing a corpus of documents, which are influenced by LDA parameters.
Simple LDA model gives term-topic probabilities for all terms under each topic. According to literature [59, 60], only the top 5 or 10 terms under each topic were selected for modeling. However, the CFSLDA model (Contextual Feature Selection LDA) selects the set of probable terms from the data set which represent the topic or concept of a domain. It builds the contextual model using LDA and correlation technique for selecting the list of probable and correlated terms under each topic (or feature) for the data set. The lists of terms represent the topic or concept of a domain and establish the context between the terms. The plate notation of CFSLDA topic modeling is shown in Figure 1.7. The notations used in CFSLDA model are:
D—number of documents
N—number of words or terms
K—number of topics
α—a Dirichlet prior on the per-document topic distribution
β—a Dirichlet prior on the per-topic word distribution
θtd—probability of topic t for document d
Φtw—probability of word w in topic t
zd,i—topic assignment of term “i”
wd,i—word assignment of term “i”
C—correlation between the terms
Figure 1.7 Plate notation of CFSLDA model.
LDA associates documents with a set of topics where each topic is a set of words. Using the LDA model, the next word is generated by first selecting a random topic from the set of topics T, then choosing a random word from that topic's distribution over the vocabulary W. The hidden variables θ and Φ are determined by fitting the LDA model to a set of corpus documents. CFSLDA model uses Gibbs sampling for performing the topic modeling of text documents. Given values for the Gibbs settings (b, n, iter), the LDA hyper-parameters (α, β, and k), and TD matrix M, a Gibbs sampler produces “n” random observations from the inferred posterior distribution of θ and Φ [60].
Here Gibbs parameters include “b”—burn-in iterations, “n”—number of samples, and “iter”—number of sample intervals. Gibbs sequences produce θ and Φ from the desired distribution, but only after a large number of iterations. For this reason, it is necessary to discard (or burn) the initial observations “b” [60]. The Gibbs setting “n” determines how many observations from the two Gibbs sequences are kept. The setting “iter” specifies how many iterations the Gibbs sampler runs before returning the next useful observation [60]. The procedure CFSLDA model is shown:
1.5 Ontology Development
Ontology development includes various approaches like Formal Concept Analysis (FCA) or Ontology Learning. FCA applies a user-driven step-by-step methodology for creating domain models, whereas Ontology learning refers to the task of automatically creating domain Ontology by extracting concepts and relations for the given data set [27]. This chapter СКАЧАТЬ