That's the paper that introduced the #Transformer architecture, dispensing with recurrence and convolutions to achieve much faster training times and higher performance in a language task.
Any good sources on what the outputs of the attention blocks in a transformer represent? I expected that for "The bank of the plane took it around the savings bank on the bank of the river", the vectors corresponding to "bank" would diverge -- "rotation things/money things/rivery things" -- but AFAICT that doesn't clearly happen. Here are the dot prods of the normalized vectors (aka "cosine similarity") against themselves after embedding layer and attention block 5: #ML#Transformer
Quite interesting but confusing, as I come from #backpropagation DL.
If I got it right, the authors focus on showing how and why biological neural networks would benefit from being Energy Based Models for Predictive Coding, instead of Feedforward Networks employing backpropagation.
I struggled to reach where they explain how to optimize a ConvNet in PyTorch as an EB model, but they do: there is an algorithm and formulae, but I'm curious about how long and stable training is, and whether all that generalizes to typical computer vision architectures (ResNets, MobileNets, ViTs, ...).
Code is also #opensource at https://github.com/YuhangSong/Prospective-Configuration
I would like to sit a few hours at my laptop and try to better see and understand, but I think in the next days I will go to Modern #HopfieldNetworks. These too are EB and there's an energy function that is optimised by the #transformer 's dot product attention.
I think I got what attention does in Transformers, so I'm quite curious to get in what sense it's equivalent to consolidating/retrieving patterns in a Dense Associative Memory. In general, I think we're treating memory wrong with our deep neural networks. I see most of them as sensory processing, shortcut to "reasoning" without short or long term memory surrogates, but I could see how some current features may serve similar purposes...
Wild to see people on a friend's Facebook post throwing an absolute FIT over Hasbro "reissuing" HasLab stuff today when it's pretty clear from how fast they sold out that they were just clearing out a few spares. Unicron sold out in less than a minute. Victory Saber was gone in two. I'd be surprised if they had more than a dozen of either of them. But no, Hasbro LIED and REISSUED them. 😑
More efficient inference for #LLMs: #RecycleGPT: An Autoregressive Language Model with Recyclable Module
It trains a small student #RNN which takes the whole #Transformer decoder hidden state and its output token embedding as input, and produces the next hidden state (which can be mapped and sampled to produce the next output token).
It is not trained as an RNN, which would be inefficient because of the token-wise sequential dependencies, but in training time it can depend on the previous hidden states produced by the transformer in parallel, so the RNN can be trained efficiently in parallel.
It is interlaced in inference so that the small student network can produce every other output token efficiently, without significant quality degradation.
Improvement suggestions from me:
This might benefit from adding routing which can decide whether to use the student model or the full model at every token based on another small model which predicts the quality degradation.
The small model doesn't need to be small either, it can still be more efficient in inference than the transformer is, but it can be large enough to be competitive in quality without suffering from quadratic complexity over the sequence length.
How come #transformer models aren't made to go back and change their answer as they work? If you ask a human to write something, they will very rarely just spit out an entire document word for word and be done. Most human work involves revising your own output as you work. If you prompt an #LLM to do this, you will get a better result, so why not build the model to do this from the get go?
People working in #LLM#Transformer models are talking about making models deep and/or long.
The depth of the Transformer model tends to induce more capabilities while increasing the inference time of the model only linearly. The main requirement for scaling that is the amount of good quality training data available.
Making the models longer in time, that is, increasing the context length, makes the models more useful in situations which require more context length. Currently GPT-4 supports a maximum context length of 32k tokens which is more than enough for many valuable use cases. I have so far gotten by with GPT-3.5 context length of 4,096 tokens with some clever optimization methods.
Some use cases such as maintaining huge existing codebases would benefit from even larger context lengths, and larger context lengths would also allow companies to use in-context learning instead of fine-tuning to make the model customized to specific use.
We can also make systems which selectively put only currently needed stuff into the context.
Scaling up the context length is more difficult than making the models deeper because of Transformer self-attention layers. Each output token is produced by computation which scans through all the previous input tokens so far, which makes scaling quadratic.
There are methods to mitigate this scaling difficulty, using for example attention-free Transformers, RNN models, memory tokens, state space models like Hungry Hungry Hippos and Hyena model utilizing fourier transformations and convolutions in clever ways to increase the receptive fields of convolutions.
It seems like much of the self-attention layer computational capacity is actually wasted at least in inference (see Hyena paper), so there is much algorithmic room for improvement in principle. However, it is a common theme in deep neural networks that an apparent wasted capacity is actually needed to train effectively even if it isn't needed in the final inference.
It is still an open question how much more efficient we can make large Transformer-type models, but the work has barely even started.
[R] Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers (www.reddit.com)
[R] VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation (www.reddit.com)
Arxiv...