The post On the Attention Mechanism first appeared on Evangelos Sariyanidi.

]]>Approaches just before the attention mechanism ^{[2]}^{[3]} were based on a fixed content vector $c$ to summarize the entire content of the sentence in the input language. On the contrary, the attention mechanism leads to a separate content vector $c_t$ per translated word $t$. The disadvantage of having a fixed content vector $c$ is that we are asking too much from it: A single vector is supposed to encompass the entire meaning in the sentence and then provide relevant information for each word that is being translated. As the sentences get longer, the words are expected to be mushed together, and the translation performance is expected to drop. Imagine that we have a sentence of 50 words and that we are translating the 30th word. Most of the words in the original sentence are completely irrelevant for the translation of the 30th word, yet a single context word $c$ uses all of them, and the few words that are actually very relevant are drowned in this pool of irrelevant information. The attention mechanism precisely prevents this by placing more emphasis on the right words.

Bahdanau et al.^{[4]} paved the way for the advent of attention mechanisms. Their approach was based on an RNN-based encoder-decorder framework, which is summarized below. It is helpful to understand the approach of Bahdanau et al. because it’s very intuitive and helps us understand most recent approaches.

Suppose that our goal is to translate a sentence of $T$ words, $x_1, x_2, \dots, x_T$ to another language, where the translation has $T’$ words, $y_1, y_2, \dots, y_T’$. The approach of Bahdanau et al. used two RNNs to produce the translated words; one for encoding and another for decoding the sequence. The hidden states of the encoder are denoted as $$h_1, h_2, \dots, h_T$$ and the hidden states of the decoder as $$s_1, s_2, \dots s_T.$$

The hidden states of the encoder are computed with a rather standard, bi-directional RNN network. That is, each $h_t$ is a function of the input words $\{x_t\}_t$ as well as other hidden states. The hidden states of the decoder are somewhat more complicated—they were a bit difficult for me to grasp at the beginning (in part because I think Fig. 1 in the article of Bahdanau et al. doesn’t show all dependencies). The $t$th hidden state of the decoder network is computed as $$s_t = f (s_{t−1}, y_{t−1}, c_t),$$ which means that each output word depends on (i) the previous output hidden state, (ii) the previous output word $y_{t−1}$ and the *context vector* $c_t$. (The initial hidden state $s_0$ is simply a function of $h_1$; see Appendix A.2.2 in article).

The crucial part here is the context vector $c_t$, which, as we mentioned in the beginning, is the key part of the attention mechanism. In a few words, $c_t$ is responsible for looking at the input words $\{x_t\}_t$, finding those that are most relevant to the $t$th output $y_t$ and placing higher emphasis to them. (This happens in a “soft” way). This may look like a complicated task, but it’s not — $c_t$ is nothing but a weighted average of the hidden states of the encoder $h_t$: $$c_t = \sum\limits_{j=1}^{T_x} \alpha_{tj} h_j.$$ Clearly, the crucial task here is to determine the weights $\alpha_{tj}$. And this is where things get slightly but not too complicated — one simply needs to allow the time to digest. The weights $\alpha_{tj}$ are creating some dependencies that are not clear from Figure 1 of the article of Bahdanau et al., but we’ll try to make them more explicit.

The weights $\alpha_{tj}$ are determined with the following softmax function to have a set of weights that sum to 1: $$\alpha_{tj} = \frac{\exp (e_{tj})}{\sum_k \exp(e_{tk})}$$

OK now we need to understand what the $e_{tk}$ are, but once we do, we are almost done — we’ll see all the dependencies, understand how the content vector is computed, and more importantly, grasp the whole point of the attention mechanism. The $e_{tk}$ are the *alignment scores*; a high score $e_{tk}$ indicates that the $t$th word in the translation is highly related to the $k$th word in the original, input sentence. We need to repeat, because this is truly the **heart** of the attention mechanism and what sets it apart from all previous approaches. A high alignment score $e_{tk}$ indicates that the $k$th word in the input sentence will have a high influence when deciding the $t$th word in the output sentence. This is precisely what we mean by dynamic weight allocation; we apply a different set of weights for each output word. Of course, the important question now is how the alignment scores $e_{tk}$ are computed.

Before we move on, it must be noted that we are now entering a point where attention mechanisms start to differentiate. In other words, what we explained up to this point seems to be fairly common across different attention approaches, but the rest of this section will provide some details specific to the approach of Bahdanau et al. The weights $e_{tk}$ are determined by using a learning-based approach; a standard, feed-forward network (an MLP) that uses the most recent decoder state and the $j$th state of the encoder: $$e_{tj} = a(s_{t-1}, h_j).$$ This makes sense; our goal is to find how the upcoming (i.e., $t$th) word in the translation is related to the $j$th word of the input sentence, and the MLP that we use compares the most recent decoder state with the $j$th encoder state. The MLP $a(\cdot)$ is trained jointly together with all other networks (i.e., the encoder and decoder RNNs).

It is worth doing a re-cap to see the entire structure of dependencies in the output

- The $t$th word $y_t$ depends on the
*decoder*state $s_t$ - The state $s_t$ depends on the previous decoder state $s_{t-1}$, previous word $y_{t-1}$ and the current context vector $c_t$
- The context vector $c_t$ depends on
*all**encoder states*$\{h_j\}_j$ and weights $\{\alpha_{tj}\}_j$ - The weights $\{\alpha_{tj}\}_j$ depend on the
*alignment scores*$\{e_{tj}\}_j$ for the $t$th word - The alignment scores $\{e_{tj}\}_j$ depend on the most recent
*decoder*state $s_{t-1}$ as well as all*encoder*states $\{h_j\}_j$.

While the approach of Bahdanau et al. uses MLP for determining the alignment scores, more recent approaches in fact use simpler strategies based on inner products. In particular, the Transformer model of Vaswani et al.^{[5]}, which hugely boosted the popularity of attention mechanisms, relies on Scaled dot-product attention (Section 3.2.1). Before we move on, it is worth spending some time to make sure that we fully grasp the meaning of terms that are now standard, namely the **Query**, **Key** and **Values** — these terms need to be our second nature, otherwise we’ll have difficulty understanding the operations of the Transformed model.

This terminology comes from the world of information retrieval (search engines, DB management etc.), where the goal is to find the values that match a given query by comparing the query with some keys. In the case of the attention mechanism, these terms can be thought of as below:

**Query**: this is the entry for which we try to find some matches. For example, in the case of translation, the query is the entry that we use while translating the most recent word in a sentence. In the case of Bahdanau et al. above, this would be the hidden state vector $s_{t-1}.$ Just like when we make a Google or database search we have one query, so in this case we have one vector entry, $s_{t-1}$.**Key**: These are the entries that are matched against the query. That is, we compare all keys with the query, we quantify the similarity between each query-key pair. In the case of Bahdanau et al., the keys are the encoder hidden state vectors $h_1, h_2, \dots, h_T$. When we do a Google search, we match the query against ~all entries (i.e., keys) in the database. That is why our keys are all the hidden states; we are trying to fetch the states that are most relevant to the query.**Value:**The value $v_j$ is the entry corresponding to the $j$th key In the case of Bahdanau et al., it is once again the hidden state vectors $h_j$ that appear on the right hand side of the equation $$c_t = \sum\limits_{j=1}^{T_x} \alpha_{tj} h_j. $$ (See also Figure 16.5 in Rashka). Note that values are not the outputs — but outputs are weighted sums of the values.

As seen above, in the case of Bahdanau et al., the keys and values are of the same kind –they are the encoder’s hidden states $h_j$– but used for different purposes. When they are keys, they are used to quantify the similarity between the query and the decoder’s hidden states $s_{t-1}$, but when they are values they are the entries that we average over to produce the final context vector $c_t$. This does not need to be the case; keys and values can be different, as we’ll see in other examples of attention mechanisms.

In the remainder of the post, we’ll denote the query with $q_t$, the keys with $k_j$ and the values corresponding to each key with $v_j$.

The Transformer model relies on attention mechanisms on three distinct places. To see where, we first need to have a better understanding of the Transformer model.

The Transformer model is similar to the approach of Bahdanau et al. in that it also relies on an Encoder-Decoder architecture. The main difference is that the RNN’s of Bahdanau et al. are completely replaced with attention-based mechanisms. Hence the title of the paper of Vaswani et al.: “Attention is all you need”.

The encoder is responsible for taking the $n$ input words $x_1, \dots, x_n$ and producing an encoded representation for each word, $z_1, z_2, \dots, z_n$. Then, the decoder takes these encoded representations and produces the translated output. While producing each translated word $y_t$, the Transformer uses all the encoded words $z_1, z_2, \dots, z_n$ and all the words that have been produced up to the moment $t$.

The attention mechanisms are then used in three different places, with different query-key-value combinations (Section 3.2.3):

- Between the encoder and decoder. The query comes from the previous decoder layer and the keys come from the encoder layers.
- (Self-attention) Within the encoder, where all the keys, values and queries come from the same place, namely the output of the previous layer of the encoder. Each entry can attend to all positions.
- (Self-attention) Within the decoder, where all the keys, queries and entries come from the decoder layers but only up to the current position (to maintain the causality needed for the auto-regressive property).

Some more examples are helpful to further grasp the utility of these mechanisms. The attention between the encoder and the decoder is rather obvious, as we already discussed it with the network of Bahdanau et al.: The goal is to place higher emphasis on the input words that are more relevant to a particular output word. The second type is less obvious: Why do we need to apply self-attention between the input words? The examples in Figure 3 of Vaswani et al. are very helpful: The goal is to identify the words in the sentence that are related to one another.

As we mention above, in the case of self-attention, the queries, keys and values all come from the same exact place. For example, they can be the hidden state of a word or they can even be the word itself. To take advantage of the learning that takes place in deep networks, we can always add some parameters that will be tuned from data. This is also a way to slightly differentiate between the queries, keys and values.

For example, if $x_i$ is the embedding or the hidden state vector of the $i$th word, then we can learn a simple query matrix, key and value matrices $U_q$, $U_v$ and $U_k$ that produce the query, key and matrix corresponding to this word as \begin{align}q_i &= U_q x_i \\ k_i &= U_k x_i \\ v_i &= U_v x_i \end{align}. Then, the alignment scores between the $i$th and the $j$th word can be computed as (sorry for the change of notation) $$\omega_{ij} = q_i^T k_j = x_i^T U_q^T U_k x_j.$$ This is still an inner product, but one that involves learned parameters for more flexibility. The context vector corresponding to the $i$th word is still computed using essentially the same formula above as $$c_i = \sum_j \alpha_{ij} v_j, $$ where $\alpha_{ij}$ is once again computed via softmaxing but also dividing by vector length $d_k$ (see last para of Section 3.2.1 of Vaswani et al.): $$\alpha_{ij} = \text{softmax}(\omega_{ij}/\sqrt{d_k})$$

- (2016): Neural Machine Translation by Jointly Learning to Align and Translate. 2016, (arXiv:1409.0473 [cs, stat]).
- (0000): Sequence to Sequence Learning with Neural Networks. In: 0000.
- (2014): Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. 2014, (arXiv:1406.1078 [cs, stat]).
- (2016): Neural Machine Translation by Jointly Learning to Align and Translate. 2016, (arXiv:1409.0473 [cs, stat]).
- (2017): Attention is all you need. In: Advances in neural information processing systems, vol. 30, 2017.

The post On the Attention Mechanism first appeared on Evangelos Sariyanidi.

]]>The post LSTM Examples #1: Basic Time Series Prediction first appeared on Evangelos Sariyanidi.

]]>Below we provide all the commands and the python script needed to generate the slides above.

```
# Simplest case
python LSTMtutorial1D.py --data_types=triangle --num_timesteps=1 --use_conv=False
# Simplest case but doesn't work so well if we don't observe sufficient data in given series
python LSTMtutorial1D.py --data_types=triangle --num_timesteps=1 --use_conv=False init_percentage=0.07
# Scale+sign is also no problem
python LSTMtutorial1D.py --data_types=triangle --num_timesteps=1 --use_conv=False --scale_variation=True --sign_variation=True
# we add different types of waves too
python LSTMtutorial1D.py --data_types=triangle,sine --num_timesteps=1 --use_conv=False --scale_variation=True --sign_variation=True
# We even add cube
python LSTMtutorial1D.py --data_types=triangle,sine,cube --num_timesteps=1 --use_conv=False --scale_variation=True --sign_variation=True
# We even add cube, a non-pre-defined shape is predicted particularly when we predict before observing sufficient data
python LSTMtutorial1D.py --data_types=triangle,sine,cube --num_timesteps=1 --use_conv=False --scale_variation=True --sign_variation=True --init_percentage=0.07
# Closely inspect what happens above: What happens when we have only one non-zero entry (i.e., the first value of the wave?) The algorithm doesn't know what to do and completes in a "middle-of-the-way" manner
python LSTMtutorial1D.py --data_types=triangle,sine,cube --num_timesteps=1 --use_conv=False --scale_variation=True --sign_variation=True --num_epochs=50
# Performance somewhat improves when we have use more model parameters
python LSTMtutorial1D.py --data_types=triangle,sine,cube --num_timesteps=1 --use_conv=False --scale_variation=True --sign_variation=True --num_epochs=50 --rnn_hidden_size=100
# We kind of break down when we add the box prediction
python LSTMtutorial1D.py --data_types=triangle,sine,cube,box --num_timesteps=1 --use_conv=False --scale_variation=True --sign_variation=True --num_epochs=100 --rnn_hidden_size=100
# Performance visibly improves when we use two consecutive time frames (it allows the model to capture time derivative, which can uniquely determine the type of shape)
python LSTMtutorial1D.py --data_types=triangle,sine,cube,box --num_timesteps=2 --use_conv=True --scale_variation=True --sign_variation=True --num_epochs=100 --rnn_hidden_size=100
# Performance imrpoves a little more when we use a more complicated model, but still limited
python LSTMtutorial1D.py --data_types=triangle,sine,cube,box --num_timesteps=2 --use_conv=True --scale_variation=True --sign_variation=True --num_epochs=100 --rnn_hidden_size=150
```

```
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Jul 27 18:34:08 2023
@author: v
"""
import os
import matplotlib.pyplot as plt
import numpy as np
import torch.nn as nn
import torch
import torch.nn.functional as F
import random
import argparse
def str2bool(v):
if isinstance(v, bool):
return v
if v.lower() in ('yes', 'true', 't', 'y', '1'):
return True
elif v.lower() in ('no', 'false', 'f', 'n', '0'):
return False
else:
raise argparse.ArgumentTypeError('Boolean value expected.')
parser = argparse.ArgumentParser(
prog='LSTM-sequence_prediction',
description='Sequence prediction examples via LSTM')
parser.add_argument('--data_types', type=str, default='triangle', required=False,
help="""Which kind of sequences will be used during training & testing?
We have three types: triangle, sine and box. You can use any subset
by separating with commas (e.g., "sine,box" or "triangle,sine,box"")""")
parser.add_argument('--num_timesteps', type=int, default=1, required=False,
help="""How many time steps will be used during preduction?""")
parser.add_argument('--use_conv', default=False, type=str2bool, required=False,
help="""Will convolution layer be added before LSTM? """)
parser.add_argument('--batch_size', default=50, type=int, required=False,
help="""Size of batch (determines num of training samples)""")
parser.add_argument('--num_tra_batches', default=100, type=int, required=False,
help="""Number of batches to use during training (determines """)
parser.add_argument('--num_epochs', default=50, type=int, required=False,
help="""Number of training iterations (i.e., epochs) """)
parser.add_argument('--seq_length', default=30, type=int, required=False,
help="""Length of sequences """)
parser.add_argument('--rnn_hidden_size', default=30, type=int, required=False,
help="Number of nodes in LSTM layer")
parser.add_argument('--scale_variation', default=False, type=str2bool, required=False,
help="Add scale variation to waves")
parser.add_argument('--sign_variation', default=False, type=str2bool, required=False,
help="Add sign variation to waves")
parser.add_argument('--learning_rate', default=0.001, type=float, required=False,
help="Learning rate for optimizer")
parser.add_argument('--init_percentage', default=0.3 , type=float, required=False,
help="Learning rate for optimizer")
args = parser.parse_args()
models_dir = 'models'
if not os.path.exists(models_dir):
os.mkdir(models_dir)
figs_dir = 'figures'
if not os.path.exists(figs_dir):
os.mkdir(figs_dir)
model_path = '%s/Ntra%d-Nep%d-Sc%d-sgn%d-conv%d-Nt%d-T%d-Q%d-%s' % (models_dir, args.num_tra_batches, args.num_epochs, args.sign_variation,
args.use_conv, args.num_timesteps, args.scale_variation, args.seq_length,
args.rnn_hidden_size, args.data_types)
wave_types = args.data_types.split(',')
T = args.seq_length
num_epochs = args.num_epochs
batch_size = args.batch_size
num_tra_batches = args.num_tra_batches
num_tes_batches = 20#args.num_tes_batches
device = 'cuda'
torch.cuda.set_per_process_memory_fraction(0.5, 0)
torch.manual_seed(1907)
random.seed(1907)
def create_single_wave(T, btype = 'sin'):
Ta = round(T/2)
if T % 2 == 0:
Tb = round(T/2)-1
else:
Tb = round(T/2)
if btype == 'triangle':
a = torch.cat((torch.ones(Ta), -torch.ones(Tb)))
x = torch.cumsum(a, 0)
elif btype == 'sine':
x = torch.sin(torch.arange(0,T)*(np.pi/(T-1)))
elif btype == 'box':
x = torch.ones(T)
elif btype == 'cube':
x = (-torch.abs(torch.arange(-Ta, Tb))**3).float()
x = (x-x.min())/(x.max()-x.min())
x = x/torch.norm(x, torch.inf)
return x
def create_naturalistic_signal(T):
Ta = T/3
# acceleration 1, acceleration 2 and acceleration 3
acc1 = create_single_wave(Ta, 'sine')
acc2 = -2*create_single_wave(Ta, 'sine')
acc3 = create_single_wave(Ta, 'sine')
a = np.concatenate((acc1, acc2, acc3), axis=0)
x = np.cumsum(a)
# x = np.cumsum(ca)
return x/np.linalg.norm(x, np.inf)
def generate_shape_sequences(batch_size, wave_types=['triangle'], T = 100, use_diff=False):
Nchannels = 1
if use_diff:
Nchannels += 1
data = torch.zeros(batch_size, T, Nchannels)
for b in range(batch_size):
wave_type = wave_types[random.randint(0, len(wave_types)-1)]
t0 = torch.randint(0, int(T*.40), (1,))+1
tf_0 = T-torch.randint(0, int(T*.40), (1,))
if (tf_0-t0) % 2 == 1:
tf_0 -= 1
tf = tf_0
Tq = tf-t0
Tq = Tq[0].item()
sign = 1
if args.sign_variation:
sign = 1 if torch.rand(1)[0].item() > 0.5 else -1
scale = 1
if args.scale_variation:
scale = torch.rand(1)[0].item()+0.5 # > 0.5 else -1
wave = create_single_wave(Tq, wave_type)*scale*sign
data[b,t0:t0+len(wave), 0] = wave
if use_diff:
data[b,:T-1,1] = data[b,:,0].diff()
data = data.to(device)
return data
class LSTM_seq_prediction(nn.Module):
def __init__(self, input_dim=1, rnn_hidden_size=50, num_layers=1, use_conv=False):
super().__init__()
self.num_layers = num_layers
self.use_conv = use_conv
LSTM_in_channels = input_dim
# print(LSTM_in_channels)
if self.use_conv:
LSTM_in_channels = 4
self.conv = nn.Conv1d(in_channels=1, out_channels=LSTM_in_channels,
kernel_size=2, bias=False)
self.rnn_hidden_size = rnn_hidden_size
self.rnn = nn.LSTM(LSTM_in_channels, rnn_hidden_size,
num_layers=self.num_layers,
batch_first=True)
self.fc = nn.Linear(rnn_hidden_size, 1)
def forward(self, x, hidden, cell):
if self.use_conv:
# Time is typically the second index (after batch_id),
# and we need to move it to last index so that convolution
# is applied over time
x = x.permute(0,2,1)
x = self.conv(x)
x = x.permute(0,2,1)
else:
x = x.permute(0,2,1)
out, (hidden, cell) = self.rnn(x, (hidden, cell))
out = self.fc(out)
return out, hidden, cell
def init_hidden(self, batch_size):
hidden = torch.zeros(self.num_layers, batch_size, self.rnn_hidden_size, device=device)
cell = torch.zeros(self.num_layers, batch_size, self.rnn_hidden_size, device=device)
return hidden, cell
#%%
#
# GENERATE DATA HERE
#
# use_conv = True
# use_diff = False
use_diff = False
# wave_types = ['box', 'triangle', 'sine']
tra_batch_sets = []
tra_batch_label_sets = []
for k in range(num_tra_batches):
batch_data = generate_shape_sequences(batch_size, wave_types=wave_types, T=T, use_diff=use_diff)
tra_batch_sets.append(batch_data.to(device=device))
tes_batch_sets = []
tes_batch_label_sets = []
for k in range(num_tes_batches):
batch_data = generate_shape_sequences(batch_size, wave_types=wave_types, T=T, use_diff=use_diff)
tes_batch_sets.append(batch_data.to(device=device))
#%%
lr = 0.001
model = LSTM_seq_prediction(input_dim=args.num_timesteps,
rnn_hidden_size=args.rnn_hidden_size, num_layers=2,
use_conv=args.use_conv)
model = model.to(device)
loss_fn = nn.MSELoss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
tra_losses = []
tes_losses = []
t0 = args.num_timesteps
if not os.path.exists(model_path):
for e in range(num_epochs):
closses = []
print(e)
for k in range(num_tra_batches):
hidden, cell = model.init_hidden(batch_size)
data = tra_batch_sets[k]
data = (data-data.mean(axis=1).unsqueeze(1))/data.std(axis=1).unsqueeze(1)
optimizer.zero_grad()
loss = 0
for t in range(t0,data.shape[1]-1):
output, hidden, cell = model(data[:,t-t0+1:t+1,:], hidden, cell)
loss += loss_fn(output[:,:,0], data[:,t+1,0:1])
loss.backward()
optimizer.step()
closses.append(loss.item())
tra_losses.append(np.mean(closses))
with torch.no_grad():
closses = []
for k in range(num_tes_batches):
loss = 0
hidden, cell = model.init_hidden(batch_size)
data = tes_batch_sets[k]
data = (data-data.mean(axis=1).unsqueeze(1))/data.std(axis=1).unsqueeze(1)
for t in range(t0,data.shape[1]-1):
output, hidden, cell = model(data[:,t-t0+1:t+1,:], hidden, cell)
loss += loss_fn(output[:,:,0], data[:,t+1,0:1])
closses.append(loss.item())
tes_losses.append(np.mean(closses))
if (e+1) % int(round(args.num_epochs/3)) == 0 or e == args.num_epochs-1:
plt.subplot(121)
plt.plot(tra_losses)
plt.plot(tes_losses)
plt.subplot(122)
plt.semilogy(tra_losses)
plt.semilogy(tes_losses)
#plt.show()
torch.save(model.state_dict(), model_path)
else:
model.load_state_dict(torch.load(model_path))
#%%
data = tes_batch_sets[4].clone()#.unsqueeze(2)# (batch_size, T, 1)
data = (data-data.mean(axis=1).unsqueeze(1))/data.std(axis=1).unsqueeze(1)
data_orig = data.clone()
M = int(args.init_percentage*T)
# for b in range(batch_size):
# data[b,0:M,0] = torch.arange(M)/(M*10)
for b in range(batch_size):
data[b,M:,:] = 0 # wave[0:M]#/(M*10)
# for b in range(batch_size):
# data[b,M-1:M,0] = 1 # torch.arange(M)/(M*10)
# data = (data-mu)/std
# data_orig = (data_orig-mu)/std
with torch.no_grad():
closses = []
hidden, cell = model.init_hidden(batch_size)
for t in range(t0,M):
output, hidden, cell = model(data[:,t-t0:t,:], hidden, cell)
for t in range(M,T):
output, hidden, cell = model(data[:,t-t0:t,:], hidden, cell)
data[:,t:t+1,0:1] = output[:,:,0:1]
data[:,t:t+1,1:2] = 1*(data[:,t:t+1,0:1]-data[:,t-1:t,0:1])
lw = 3
plt.figure(figsize=(20, 15))
for b in range(35):
plt.subplot(7, 5, b+1)
plt.plot(data[b,:,0].to('cpu').squeeze(), linewidth=lw)
plt.plot(data_orig[b,:,0].to('cpu').squeeze(), ':', linewidth=lw)
plt.plot(data_orig[b,:M,0].to('cpu').squeeze(), 'g:', linewidth=lw*2)
plt.axis('off')
plt.legend(['predicted', 'true', 'initialization'])
fig_path = '%s/Ntra%d-Nep%d-Sc%d-sgn%d-conv%d-Nt%d-T%d-Q%d-%s-%.3f.jpg' % (figs_dir, args.num_tra_batches, args.num_epochs, args.sign_variation,
args.use_conv, args.num_timesteps, args.scale_variation, args.seq_length,
args.rnn_hidden_size, args.data_types, args.init_percentage)
plt.savefig(fig_path,bbox_inches='tight')
# plt.show()
# plt.show()
if args.use_conv and args.num_timesteps == 2:
print('Convolution parameters are')
print('==========================')
print(next(iter(model.conv.parameters())))
```

The post LSTM Examples #1: Basic Time Series Prediction first appeared on Evangelos Sariyanidi.

]]>The post Self-supervision on Deep Nets first appeared on Evangelos Sariyanidi.

]]>Here are some notes (mostly to myself) about self-supervision.

There are two standard ways to make self supervision: Auto-regression and de-noising. Auto-regression typically involves a causal model, i.e., we aim to predict the next word given the previous words, without using words “from the future” (e.g., if we predict the *n*th word of a sentence, we have access only to the first *n-i* words). De-noising typically does not underlie causal model. An example to de-noising would be taking an image and hiding or adding noise to some part of it, and then predicting the noise-free (i.e., original) image.

More to be added to this post:

- Discrete vs. Continuous (i.e., the power of embedding)
- Conv-nets vs. Attention-based nets
- Probabilistic vs. deterministic
- Latent space vs. original space

The post Self-supervision on Deep Nets first appeared on Evangelos Sariyanidi.

]]>The post State of the art in 3D face reconstruction may be wrong first appeared on Evangelos Sariyanidi.

]]>Also, taking the liberty of this being my personal blog space, I want to share an even more concerning consequence even though I don’t have data for that (and it *is* difficult to obtain data for that). Given that pretty much all recent 3D reconstruction methods are based on deep learning, it is possible to “hack” the Chamfer metric—to learn a network that minimizes Chamfer error without necessarily reducing the true error. In any case, it is not unreasonable to say that Chamfer error cannot be the sole authority for a method’s superiority.

Our IJCB’23 presented a **meta-evaluation framework** for evaluating a geometric error estimator. Given that the Chamfer approach is subpar, we believe that it’s important to have criteria for determining which geometric error estimators are good.

You can read more about the study in the paper (and poster) below:

The post State of the art in 3D face reconstruction may be wrong first appeared on Evangelos Sariyanidi.

]]>The post Our paper on limitations of 3D reconstruction *benchmark metrics* was presented at IJCB’23 first appeared on Evangelos Sariyanidi.

]]>Here is my blog post with the paper link and a summary of the study.

The post Our paper on limitations of 3D reconstruction *benchmark metrics* was presented at IJCB’23 first appeared on Evangelos Sariyanidi.

]]>The post 3DI: Face Reconstruction via Inequality Constraints first appeared on Evangelos Sariyanidi.

]]>3DI is an optimization-based 3DMM fitting (3D reconstruction) method that enforces inequality constraints on 3DMM parameters and landmarks, thus significantly restricts the search space and rules out implausible solutions (Figure 1). 3DI is not a learning-based method, thus is more flexible; e.g., it can straightforwardly incorporate camera matrix or be adapted to an arbitrary morphable model relatively easily.

**Code (github) of the method:** http://github.com/Computational-Psychiatry/3DI

**Project page:** http://computational-psychiatry.github.io/3DI/

The 3DI method can be used for a variety of facial analysis tasks including

- 3D face reconstruction
- Pose estimation
- Expression quantification
- 2D landmark detection
- 3D landmark detection
- On world coordinates (with pose and expression variation)
- Canonicalized (i.e., pose and/or identity effect removed)

The post 3DI: Face Reconstruction via Inequality Constraints first appeared on Evangelos Sariyanidi.

]]>The post SyncRef: Fast & Scalable Way to Find Synchronized Time Series first appeared on Evangelos Sariyanidi.

]]>**Codebase of the method**: https://github.com/sariyanidi/SyncRef**Paper of the method**: (CVPR’20 link)

Figure 1 illustrates how the method works in an example scenario.

Figure 1 illustrates how SyncRef finds a synchronized set of sequences. (a) A face video of 100$\times$100 frames. (b) Input to SyncRef: The set $X$ of 10,000 sequences, where each sequence corresponds to the optical flow magnitude of a pixel w.r.t. the first frame. (c) Illustration of the PCA representation of sequences in (b); for visualization we use two PCA coefficients $u_1$, $u_2$. Each rectangular region, defined by dashed lines (i.e., thresholds $θ^j_k$ ), is a cluster. $C_j$ is the most populated cluster. (d) The $\epsilon$-expanded cluster $C^j_\epsilon$ . (e) The identified synchronized set $S\mathcal S$; all points within the circle are correlated at least by $\rho_θ = 0.80$. (f) The synchronized set of sequences illustrated back on the time domain: Those sequences correspond to pixels around the mouth region activated with the smile in (a).

The method was validated on data from three different domains: Video data (pixel movements), stock market data, and brain data. It was shown that the method can find synchronized time series hundreds of times faster than an alternative method based on (approximate) discovery of maximal cliques on graphs.

More about the method and the experimental results be read in the paper.

The post SyncRef: Fast & Scalable Way to Find Synchronized Time Series first appeared on Evangelos Sariyanidi.

]]>The post Is Pose & Expression Separable with WP Camera? first appeared on Evangelos Sariyanidi.

]]>More about the study and the experiments can be found below:

The post Is Pose & Expression Separable with WP Camera? first appeared on Evangelos Sariyanidi.

]]>The post Can I swap one matrix norm with another? first appeared on Evangelos Sariyanidi.

]]>Thus, the following question becomes relevant: Can we use some matrix norm in place of some other norm? Fortunately, for some applications, we can do this.

First of all, for analyzing limiting behavior, it makes no difference whether we use any of the following norms (see Exercise 5.12.3 of Carl D. Meyer):

- 1-norm
- 2-norm
- $\infty$-norm
- Frobenius norm.

In other words, if a sequence of matrices $\{A_k\}_k$ converges w.r.t. a given matrix norm, then they converge w.r.t. any of the norms above. Thus, for example, we do not have to rely on the computationally costly 2-norm but can use any other. This is possible because a specific norm of a matrix $||A||_i$ is bounded above by another norm multiplied by a constant coefficient. For example, for square matrices of size $n\times n$, it holds that (see p425, Exc 5.12.3 of Carl D. Meyer for more):

- $||A||_1\leq \sqrt{n} ||A||_2$
- $||A||_1\leq n ||A||_\infty$
- $||A||_2\leq ||A||_F$
- $||A||_F\leq \sqrt{n}||A||$

This brings us to a second case where we can make use of one matrix norm instead of another: If we are just interested whether the norm of a matrix exceeds a certain value or not, using a cheaply computed upper bound (e.g., 1-norm, Frobenius norm or $\infty$-norm). For example, in a Branch & Bound optimization scheme, we can eliminate a large sets of candidates for the solution via upper and lower bounds. Admittedly, the bounds can be too relaxed for large $n$. If one is interested in computing in sharper bounds for the 2-norm of a matrix (which is the computationally most involved in the list above), one should consider using the Gerschgorin circles.

The post Can I swap one matrix norm with another? first appeared on Evangelos Sariyanidi.

]]>The post Does it really take 10¹⁴¹ years to compute the determinant of a 100×100 matrix? first appeared on Evangelos Sariyanidi.

]]>Let’s recall the definition first. The determinant of a matrix $\mathbf A$ is $$\text{det}(\mathbf A) = \sum\limits_p \sigma(p) a_{1p_1} a_{2p_2} \dots a_{np_n},$$ where the sum is taken over the $n!$ permutations $p=(p_1,p_2,\dots,p_n)$ of the numbers $1,2,\dots,n$. The total number of multiplications in this definition are $n! (n-1)$. Based on my quick estimations from this link, a modern CPU can make approximately $6\times 10^12$ multiplications per second (which sounds quite insane frankly). However, it is peanuts compared to what it takes to compute the determinant of an even moderately sized matrix using the definition above. Even for a matrix of $100\times 100$, the total number of multiplications is approximately $9\times 10^159$. If you do the math, it turns out that computing the determinant of a $100\times 100$ matrix with direct application of the definition **takes more than … 10¹⁴¹ years**! If this number didn’t impress you, let me tell you that it’s **well beyond the total number of atoms on earth**.

I hopefully got your attention. Now at least two questions should come to your mind:

- How can the determinant of a $100\times 100$ matrix be computed in
**millisecond(s)**then? (which is the case) - What is the point of even knowing this definition, since it’s clearly is not used during computations?

The answer to the first question is that, we can computing the determinant of the matrix through its QR or LU composition, which is computationally very efficient (at least for a 100 x 100 matrix). Specifically, we make use of the properties of determinants upper/lower triangular or orthogonal matrices in this case (it would be a simple exercise to show).

And the point of knowing the definition is that it’s still useful for theoretical purposes. This stems from the fact that the determinant definition is a continuous function of its entries, allowing us to study show that the determinant or the *inverse of a matrix* vary continuously with its entries (p480), which, among other things, leads to the closed form of the derivative of a determinant (see p471 of Meyer)

${}^\dagger$ (This article is inspired and derived from Exercise 6.1.21 of Carl D. Meyer).

The post Does it really take 10¹⁴¹ years to compute the determinant of a 100×100 matrix? first appeared on Evangelos Sariyanidi.

]]>