DEV Community

Cover image for Attention models: a brief overview
Marcos
Marcos

Posted on • Edited on

Attention models: a brief overview

Machine learning has revolutionized various fields over the past decade, including computer vision, natural language processing, and speech recognition. Attention models have emerged as a powerful technique in machine learning, enabling models to selectively focus on relevant parts of the input, which has resulted in significant performance improvements in various tasks. From their first proposal with neural machine translation, attention models have rapidly evolved and have become a key component of many state-of-the-art machine learning models. In this article, we will provide a brief history of attention models in machine learning, including their evolution, major breakthroughs, current advancements, and impact of attention models on the field of machine learning. Here, we focus on temporal attention for some but the core functions are similar for other types like spatial attention.

We will start by briefly review its foundations, such as basic concepts of Recurrent Neural Networks (RNNs). Then we bring the history of attention mechanism, its definition, and the different formulations proposed in the literature.

Recurrent Neural Networks

This neural network family is commonly adopted when dealing with sequential data x₁, ..., xₜ. The main idea is that the outputs from the previous step are fed to the current step, creating a recurrent dependency among the outputs. This means that, in theory, RNNs can “memorize” computations made over a long period to return the current response, although this does not happen in practice. RNNs have proven effective in learning time-dependent signals whose structure varies over short periods. However, when there are long-term dependencies on the data, these methods suffer from the “gradient vanishing” problem. This problem occurs when gradients propagated through various stages tend to “vanish” or arrive close to zero [1]. The best-known architecture for addressing this problem is the Long Short Term Memories (LSTM) [2]. LSTMs follow the gated scheme based on creating paths through the time where the gradients can flow for a long duration.

Attention Mechanism

RNNs can be used to map an input sequence to output sentences, which usually have different sizes. This idea can be used in several applications, such as speech recognition, question answering, and machine translation. An RNN encoder processes an input and emits a context vector C, usually computed using a aggregation function over the encoder hidden layers. Subsequently, an RNN decoder, based on the fixed-length vector C, generates the output sequence Y = (y₁, ..., yₜ). The major difference from this model to other architectures is that the inputs and outputs’ sizes can be varied. Sutskever et al. [3] independently developed an encoder-decoder architecture that obtained state-of-the-art results on English to French translation task from the Conference on Machine Translation (WMT) 2014 workshop.

A potential issue with this encoder-decoder approach is that a neural network needs to encode all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences where the last representation of the RNN does not capture important information of the sentences due to problems like gradient exploding, for example. A more effective approach is to read every sequence and then produce the translated words one at a time, each time focusing on different relevant parts of the input sequence [4]. This mechanism was proposed by Bahdanau et al. [5] using the context vector to align the input source and target by “attending to” certain parts of the input. Figure 1 illustrates the attention method proposed.

<br>
Figure 1: Illustration of the attention mechanism, known as Soft Attention, proposed by Bahdanau et al. [5] for neural machine translation. The method works by computing the weighted average of all the hidden representations h with attention weights α to form the context vector c. The attention weights α(t) are continuous values between [0, 1] learned by a feedforward neural network which is jointly trained with the proposed model. Figure reproduced from Goodfellow et al. [4].

More specifically, the network learns the attention weights by incorporating an additional Feedforward Network (FFN) trained jointly with the main architecture, producing the attention weights as a function of a candidate hidden state and a query state [6]. The whole idea is inspired by Neuroscience, based on the aspect of many animals focus on specific parts of their visual inputs to compute the adequate responses [7]. The idea has been successfully translated into neural networks so that the models focus their actions on relevant regions rather than using all available information.

Typically, there are three types of attention: (1) hard attention, (2) soft attention, and (3) self-attention. In soft attention, we compute a weight ai for each input xi and use it to calculate a weighted average for xi as the recurrent network input. These added weights are equal to 1, which can be interpreted as the probability that xi is the area that we should pay attention to. Hard attention employs a stochastic sampling process to focus on specific regions of the input xi. On the other hand, self-attention, first introduced for machine-reading by Cheng et al. [8], computes a representation of an input sequence by relating different positions of the sequence itself (Figure 2). The authors observed that the self-attention mechanism enables the LSTM to learn the correlation between the current words and past words of the sentence. Xu et al. [9] explored soft and hard attention for neural image captioning task. In this problem, the model needs to generate captions given an image. The authors adopted an encoder-decoder design, where a CNN (encoder) provides features to an LSTM (decoder) modified with an attention mechanism that allows the decoder to focus on relevant parts of the input. For this purpose, the decoder uses the previous hidden state, the previously generated word, and the context vector to generate the next word using an attention function.

Figure 2: Self-attention scores of the model proposed by Cheng et al. [8] for machine reading. The current word being analyzed is expressed by the red color and the blue represents the sentence importance.

In short, the attention mechanism can be seen as a dynamic pooling in which the weights are learning along with the training. In some cases, these weights can be used as a tool to provide interpretability to the model, although it is not applied in specific scenarios [10]. This is an important feature due to the growing interest in fairness and transparency in Machine Learning. Doughty et al. [11] employed multiple attention filters to discover essential parts in long videos for skill determination task. They also introduced a loss function that encourages the attention filters to be complementary.

Another interesting and intuitive attention-based method is proposed by Hermann et
al. [12] for reading comprehension in real documents. In addition to introducing a new supervised reading comprehension dataset, covering a gap in the literature, the authors also built four deep learning models incorporating attention mechanisms in RNNs. The attention weights allow the models to focus on specific parts of the document to answer the question. Comparing the results obtained by the attention-based method to a range of baselines and traditional heuristic methods, the authors obtained state-of-the-art results in the proposed dataset. Figure 3 shows the attention heat maps obtained by one of the attention-based methods proposed.

Figure 3: Attention scores of one of the attention-based methods proposed by Hermann et al. [12] for reading comprehension task. The model works by focusing on specific parts of the document that better fits to the question in the bottom. The crucial part here is that there are a lot of text being ignored in this context.

Motivated by the computational inefficiency of RNNs, since the sequential computation inhibits the parallelization, Vaswani et al. [13] introduced the Transformer Network for machine translation. This architecture is built without recurrences and convolutions, using only attention blocks. The model consists of six Encoder blocks and six Decoder blocks, in which each one of them presents are built using the same modules: Feed Forward Network and Multi-Head Self-Attention. First, the Multi-Head Self Attention layer helps the encoder look at other words in the input sentence as it encodes a specific word. Particularly, this module is named Multi-Head because several attention layers are stacked in parallel, with different linear transformations of the same input [14]. Then, the vector of the Multi-Head Self Attention step is then fed into the Position-wise Feed-Forward Network, which consists of two linear transformations with a ReLU activation in between. The Transformer model achieved state-of-the-art performance on English-to-German and English-to-French translation using significant parallel processing, producing higher accuracy for translation and not using any recurrent component.

Figure 4 illustrates the whole architecture of the Transformer proposed by Vaswani et al.

Figure 4: Transformer network proposed by Vaswani et al. [13] for machine translation. The authors introduces a new encoder-decoder architecture. The encoder is composed of a stack of multi-head attention and feed-forward layers, each of them has a residual connection and a normalization layer. The decoder works similarly, with the addition of a third sub-layer that applies multi-head attention over the output of the encoder.

Due to the success of the Transformer, many variants have been proposed in the literature, improving the original model in terms of computation and memory efficiency. The main focus of improvement is the self-attention model, responsible for computing similarity scores for all the sequence pairs. The original formulation is O(n2) both in time and space, dropping the efficiency for larger input sequences. Most of the proposed methods are based on the concept of sparse attention, which uses a subset of the input sequence to apply the attention [15]. One of the first approaches to cope with this problem was Image Transformer (Parma et al. [16]). For a more detailed review of Transformed-based approaches focusing on efficiency improvements, please refer to Tay et al. [15].

Attention mechanisms are an effective way for neural networks to enhance their capability and interpretability [17]. In sequence learning, attention is broadly employed on encoder-decoder models to solve the limitation of encoding the input sequence to one fixed-length vector to decode each output time step. To solve this, attention mechanisms learn a vector of scores for each time step t observed, representing its relevance. An attention module will tell the decoder to look more at targeted sub-components of the source to be translated. Pei et al. [17] combined gated RNNs and attention networks to detect salient parts of the sequence and encode this information through a custom RNN. The temporal attention weights returned from this mechanism provide meaningful value for the salience of each time step in the sequence, which gives the model a higher degree of interpretability. They showed that the learned weights automatically filter noisy parts of the sequence, producing a reasonable interpretation of each time-steps relevance to the model (see Figure 5). One disadvantage is using the last hidden representation as input to the fully connected network, which contains all previous frames’ information, retaining high values along with the video.

Figure 5: Overview of the Temporal Attention-gated Model architecture [81]. The bottom is the Temporal Attention Module, which generates attention weights for each frame. At the top, Recurrent Attention-Gated Units use these weights to refine the internal representation.

We can express an attention mechanism as a function that maps an input vector to an output based on a weight vector. This weight vector represents the relevance of each input. Internally, the network tries to learn to focus on specific salient “regions” and capture somewhat global information rather than solely to infer based on one hidden state. Focusing on one instance of attention methods that have been commonly used in machine translation works [5], we can describe the attention mechanism for video classification by the following steps.

Given an input sequence h of length N, the attention mechanism produces a context vector c computed as a weighted sum of hᵢ, where the alignment scores are the weights.

c = ∑ᵢ₌₁ ᴺ aᵢhᵢ
Enter fullscreen mode Exit fullscreen mode

The weight of each $a_i$ is computed by

aᵢ = exp(eᵢ) / ∑ⱼ₌₁ ⁿ exp(eⱼ),

eᵢ = α(sₜ₋₁,hₜ),

Enter fullscreen mode Exit fullscreen mode

where α, known as alignment model, is a single layer neural network that computes matching between inputs around the current position i. The network computes the weight based on the annotation vector hᵢ and the previously hidden state vector of the RNN. We referred to this vector as key since the attention model seeks the most related weighted average of the input vectors according to these keys. This network can be jointly trained with the other components of the RNN used.

Zadeh et al. [18] proposed a new neural architecture for multimodal sequential learn- ing called the Memory Fusion Network (MFN) that explicitly accounts for both inter- actions in neural architecture and continuously models them through time. This model has three modules: 1) System of LSTMs: multiple LSTMs, one for each view, encoding view specific interactions; 2) Delta-memory Attention Network: a particular attention mechanism to discover temporal interactions across the System of LSTMs. 3) Multi-view Gated Memory: storing cross-view interactions over time. With the best configurations, this approach achieved state-of-the-art results on six different multimodal datasets.
Girdhar et al. [19] extended the well-know Transformer network for action recognition and localization in videos. This architecture discards the use of RNNs, replacing them with multi-head attention modules that learn to attend to the frames’ relevant regions. Figure 6 shows an overview of the architecture. Meng et al. [20] introduced a model that combines spatial and temporal attention simultaneously employing a set of regularizers to force the attention models to attend to coherent regions of the videos. The model was evaluated for action classification, and also evaluated for action localization, trained in a weakly-supervised.

Figure 6: Diagram of the Action Transformer network for action localization in videos proposed by Girdhar et al.[19]

In summary, the attention mechanism has been adopted in literature for its simplicity (in the case of soft attention) and the benefits of the method. One of these benefits is the sequential ordering of relevance data, which can implicitly help filter input noise. Another important feature is the interpretability of the weights associated with the inputs xᵢ, which can be used as a good indicator of the relevance of that time-step to the scene. This may be useful in systems where not only final prediction is needed, but also some other indicator that reinforces that decision. Finally, this mechanism has the flexibility to be able to be added anywhere in the network, as long as it makes sense for the desired purpose.

And, yes, the cover image of this post was automatically generated by an attention-based deep neural network :)

References

[1] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998. 25
[2] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural com- putation, 9(8):1735–1780, 1997. 25
[3] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014. 26
[4] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learn- ing, volume 1. MIT press Cambridge, 2016. 26
[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, International Conference on Learning Representations, 2015. 8, 16, 26, 30, 46, 47, 59, 64, 68, 70
[6] Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, and Varun Mithal. An attentive survey of attention models. arXiv preprint arXiv:1904.02874, 2019. 26, 28, 76
[7] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998. 26
[8] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. In Conference on Empirical Methods in Natural Language Processing, pages 551–561, 2016. 8, 16, 27
[9]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057, 2015. 27, 46, 56
[10] Sofia Serrano and Noah A. Smith. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Florence, Italy, July 2019. Association for Computational Linguistics. 27
[11] Hazel Doughty, Walterio Mayol-Cuevas, and Dima Damen. The pros and cons: Rank-aware temporal attention for skill determination in long videos. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7862–7871, 2019. 27, 46, 47, 56, 59, 64, 68, 70
[12] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and com- prehend. In Advances in Neural Information Processing Systems, pages 1693–1701, 2015. 8, 28
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017. 8, 16, 28, 29, 31, 77
[14] Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, and Varun Mithal. An attentive survey of attention models. arXiv preprint arXiv:1904.02874, 2019. 26, 28, 76
[15] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2020. 29
[16] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International Conference on Machine Learning, pages 4055–4064, 2018. 29, 77
[17] Wenjie Pei, Tadas Baltrušaitis, David MJ Tax, and Louis-Philippe Morency. Tem- poral attention-gated model for robust sequence classification. In IEEE Conference on Computer Vision and Pattern Recognition, pages 820–829, 2017. 8, 30, 32, 33, 46, 47, 56, 59, 64, 68, 70
[18]
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Memory fusion network for multi-view sequential learning. arXiv preprint arXiv:1802.00927, 2018. 31
[19]Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. Video ac- tion transformer network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 244–253, 2019. 31, 46
[20] Lili Meng, Bo Zhao, Bo Chang, Gao Huang, Wei Sun, Frederick Tung, and Leonid Sigal. Interpretable spatio-temporal attention for video action recognition. In IEEE International Conference on Computer Vision Workshops, 2019. 16, 31

Top comments (0)