> The encoder is five-layer bidirectional LSTM (long short-term memory) network....

isoprophlex · on Jan 22, 2019

Notwithstanding the memory complexity, I love that they reach such impressive results with relatively straightforward components: theres no attention mechanism, and encoding into latent space is simply max pooling over the last LSTM layer.

I don't fully understand how the training step works though. They train only on translation to english and Spanish?

Anyway very cool stuff.

pmalynin · on Jan 22, 2019

It’s not 1024 cells. The size of the hidden vector is 1024. Which is roughly an order of megabyte per cell (1024x1024 matrix). Here they have a cell per word, which is reasonable.

minimaxir · on Jan 22, 2019

Max-Pooling a 1024-cell output LSTM will result in a 1024-sized vector.

Looking at the code for the Encoder (https://github.com/facebookresearch/LASER/blob/fec5c7d63daa2...), each LSTM has the same amount of hidden cells. (although the default parameters of that class don't quite match the ones used in the post; so I assume it's 512x2x5).

pmalynin · on Jan 22, 2019

Yes but they're not maxpooling in the last dimension. They're max pooling over the sequence length [0], (the other way doesn't really make sense in this context).

The output size is 1024, the hidden vector size is 512 but they're using bidirectional LSTMs which concatenates the outputs of each direction -- so the total is 1024 [1].

[0] https://github.com/facebookresearch/LASER/blob/fec5c7d63daa2...

[1] https://pytorch.org/docs/stable/nn.html#lstm

minimaxir · on Jan 22, 2019

Gotcha, that makes sense. (I'm less familiar with dimension ordering in PyTorch)