Название: Deep Learning Approaches to Text Production
Автор: Shashi Narayan
Издательство: Ingram
Жанр: Программы
Серия: Synthesis Lectures on Human Language Technologies
isbn: 9781681738215
isbn:
where º is the Hadamard product, followed by a sum over all elements, ReLU is a rectified linear activation and b ∈ R is a bias term. ReLU activation function is often used as it is easier to train and often achieves better performance than sigmoid or tanh functions [Krizhevsky et al., 2012]. A max-pooling over time [Collobert et al., 2011] is applied over the feature map f to get fmax = max(f) as the feature corresponding to this particular filter K. Multiple filters Kh of width h are often used to compute a list of features fKh. In addition, filters of varying widths are applied to learn a set of feature lists
Figure 3.3: Convolutional neural network for sentence encoding.
We describe in Chapter 5 how such convolutional sentence encoders can be useful for better input understanding for text production. Importantly, through the use of convolutional filters, CNNs facilitate sparse interactions, parameter sharing and equivariant representations. We refer the reader to Chapter 9 of Goodfellow et al. [2016] for more details on these properties.
3.1.2RECURRENT NEURAL NETWORKS
Feed-forward and CNNs fail to adequately represent the sequential nature of natural languages. In contrast, RNNs provide a natural model for them.
RNNs updates its state for every element of an input sequence. Figure 3.4 presents an RNN on the left and its application to a natural language text “How are you doing?” on the right. At each time step t, it takes as input the previous state st−1 and the current input element xt, and updates its current state as:
where U and V are model parameters. At the end of an input sequence, it learns a representation, encoding information from the whole sequence.
Figure 3.4: RNNs applied to a sentence.
Figure 3.5: Long-range dependencies. The shown dependency tree is generated using the Stanford CoreNLP toolkit [Manning et al., 2014].
Most work on neural text production has used RNNs due to their ability to naturally capture the sequential nature of the text and to process inputs and outputs of arbitrary length.
3.1.3LSTMS AND GRUS
RNNs naturally permit taking arbitrary long context into account, and so implicitly capture long-range dependencies, a common phenomenon frequently observed in natural languages. Figure 3.5 shows an example of long-range dependencies in a sentence “The yogi, who gives yoga lessons every morning at the beach, is meditating.” A good representation learning method should capture that “the yogi” is the subject of the verb “meditating” in the sentence.
In practice, however, as the length of the input sequence grows, RNNs are prone to losing information from the beginning of the sequences due to vanishing and exploding gradients issues [Bengio et al., 1994, Pascanu et al., 2013]. This is because, in the case of RNNS, back propagation applies through a large number of layers (the multiple layers corresponding to each time step). Since back propagation updates the weights in proportion to the partial derivative (the gradients) of the loss, and because of the sequential multiplication of matrices as the RNN is unrolled, the gradient may become either very large, or (more commonly), very small, effectively causing weights to either explode or never change at the lower/earlier layers. Consequently, RNNs fail to adequately model the long-range dependencies of natural languages.
Figure 3.6: Sketches of LSTM and GRU cells. On the left, i, f, and o are the input, forget, and output gates, respectively. c and
Long short-term memory (LSTM, [Hochreiter and Schmidhuber, 1997]) and gated recurrent unit (GRU, [Cho et al., 2014]) have been proposed as alternative recurrent networks which are better prepared to learning long-distance dependencies. These units are better in learning to memorise only the part of the past that is relevant for the future. At each time step, they dynamically update their states, deciding on what to memorise and what to forget from the previous input.
The LSTM cell (shown in Figure 3.6, left) achieves this using input (i), forget (f), and output (o) gates with the following operations:
where W* and b* are LSTM cell parameters. The input gate (Eq. (3.3)) regulates how much of the new cell state to retain, the forget gate (Eq. (3.2)) regulates how much of the existing memory to forget, and the output gate (Eq. (3.4)) regulates how much of the cell state should be passed forward to the next time step. The GRU cell (shown in Figure 3.6, right), on the other hand, achieves this using update (z) and reset (r) gates with the following operations:
where W* are GRU cell parameters. The update gate (Eq. (3.8)) regulates how much of the candidate activation to use in updating the cell state, and the reset gate (Eq. (3.9)) regulates how much of the cell state to forget. The LSTM cell has separate input and forget gates, while the GRU cell performs both of these operations together using its reset gate.
In СКАЧАТЬ