br00t4c, to random
@br00t4c@mastodon.social avatar
albertcardona, to machinelearning
@albertcardona@mathstodon.xyz avatar

The as the public record and version control system of a scientific manuscript: 7 versions spanning 6 years.

"[Submitted on 12 Jun 2017 (v1), last revised 2 Aug 2023 (this version, v7)]"

"Attention Is All You Need" by Vaswani et al.
https://arxiv.org/abs/1706.03762

That's the paper that introduced the architecture, dispensing with recurrence and convolutions to achieve much faster training times and higher performance in a language task.

Lobrien, to ML

Any good sources on what the outputs of the attention blocks in a transformer represent? I expected that for "The bank of the plane took it around the savings bank on the bank of the river", the vectors corresponding to "bank" would diverge -- "rotation things/money things/rivery things" -- but AFAICT that doesn't clearly happen. Here are the dot prods of the normalized vectors (aka "cosine similarity") against themselves after embedding layer and attention block 5:

Heatmap showing identical vectors for identical word embeddings

itnewsbot, to random
@itnewsbot@schleuss.online avatar

Parts We Miss: The Mains Transformer - About two decades ago there was a quiet revolution in electronics which went unnot... - https://hackaday.com/2024/02/14/parts-we-miss-the-mains-transformer/ #mainstransformer #originalart #powersupply #transformer #featured #interest #parts

bornach, to llm
@bornach@masto.ai avatar

[AI Coffee Break with Letitia] explains the architecture behind
https://youtu.be/ec9IQMiJBhs

fabrice13, to ArtificialIntelligence Italian
@fabrice13@neuromatch.social avatar

On vs and
Just skimmed through "Inferring neural activity before plasticity as a foundation for learning beyond backpropagation" by Yuhang Song et al. https://www.nature.com/articles/s41593-023-01514-1

Quite interesting but confusing, as I come from DL.
If I got it right, the authors focus on showing how and why biological neural networks would benefit from being Energy Based Models for Predictive Coding, instead of Feedforward Networks employing backpropagation.
I struggled to reach where they explain how to optimize a ConvNet in PyTorch as an EB model, but they do: there is an algorithm and formulae, but I'm curious about how long and stable training is, and whether all that generalizes to typical computer vision architectures (ResNets, MobileNets, ViTs, ...).
Code is also at https://github.com/YuhangSong/Prospective-Configuration

I would like to sit a few hours at my laptop and try to better see and understand, but I think in the next days I will go to Modern . These too are EB and there's an energy function that is optimised by the 's dot product attention.
I think I got what attention does in Transformers, so I'm quite curious to get in what sense it's equivalent to consolidating/retrieving patterns in a Dense Associative Memory. In general, I think we're treating memory wrong with our deep neural networks. I see most of them as sensory processing, shortcut to "reasoning" without short or long term memory surrogates, but I could see how some current features may serve similar purposes...

j_bertolotti, to random
@j_bertolotti@mathstodon.xyz avatar

Something that having a kid taught me is that nowadays are significantly worse toys than those I played with ~40 years ago.

trixter, to Transformers
@trixter@retro.pizza avatar

Wild to see people on a friend's Facebook post throwing an absolute FIT over Hasbro "reissuing" HasLab stuff today when it's pretty clear from how fast they sold out that they were just clearing out a few spares. Unicron sold out in less than a minute. Victory Saber was gone in two. I'd be surprised if they had more than a dozen of either of them. But no, Hasbro LIED and REISSUED them. 😑

#transformers #HasLab #Hasbro #ToyCollecting #transformer #unicron #VictorySaber

SinclairSpeccy, to random

Thinking about how great the wiki tfwiki.net is with how they write their articles.

Take Starscream for example and some others 😂

Takara! Tomy! Superlink!
MY CROTCH! MY EVIL CROTCH!!!
"No! Please! I was only in the movie for 42 seconds!"

SinclairSpeccy, to random

I wish there were more decent games for the PC. War for Cybertron and Fall of Cybertron get boring after a while

br00t4c, to random
@br00t4c@mastodon.social avatar

Fulcrum Point presents the midwest premiere of La Monte Young's The Second Dream of the High-Tension Line Stepdown Transformer

https://chicagoreader.com/music/fulcrum-point-la-monte-young/

itnewsbot, to random
@itnewsbot@schleuss.online avatar

Flipped Transformer Powers Budget-Friendly Vacuum Tube Amp - If you’ve ever wondered why something like a radio or a TV could command a hefty f... - https://hackaday.com/2023/11/07/flipped-transformer-powers-budget-friendly-vacuum-tube-amp/

davidaugust, to me
@davidaugust@mastodon.online avatar

A passageway, under the tracks and beyond the technology. It awaits us all.

📷 by

bortzmeyer, to llm French
@bortzmeyer@mastodon.gougere.fr avatar

A good article about how and the model work https://ig.ft.com/generative-ai/

50years_music, to transgender
@50years_music@mastodon.online avatar

"Walk on the Wild Side" is a song by American musician from his second solo studio album, (1972). It was produced by and and released as a with "". Known as a counterculture anthem, the song received wide radio coverage and became Reed's biggest hit and while touching on topics considered taboo at the time, such as people, , , and .
https://youtu.be/oG6fayQBm9w

br00t4c, to random
@br00t4c@mastodon.social avatar
nothingistrue, to vinyl
tero, to LLMs
@tero@rukii.net avatar

More efficient inference for :
: An Autoregressive Language Model with Recyclable Module

It trains a small student which takes the whole decoder hidden state and its output token embedding as input, and produces the next hidden state (which can be mapped and sampled to produce the next output token).

It is not trained as an RNN, which would be inefficient because of the token-wise sequential dependencies, but in training time it can depend on the previous hidden states produced by the transformer in parallel, so the RNN can be trained efficiently in parallel.

It is interlaced in inference so that the small student network can produce every other output token efficiently, without significant quality degradation.

Improvement suggestions from me:

This might benefit from adding routing which can decide whether to use the student model or the full model at every token based on another small model which predicts the quality degradation.

The small model doesn't need to be small either, it can still be more efficient in inference than the transformer is, but it can be large enough to be competitive in quality without suffering from quadratic complexity over the sequence length.

https://arxiv.org/abs/2308.03421

Tae156, to random
@Tae156@borahae.love avatar

So is actually training himself to be a tree?

itnewsbot, to hardware
@itnewsbot@schleuss.online avatar

AC-DC Converter is Reliable, Safe, and Efficient - When first starting an electronics project, it’s not uncommon to dive right in to ... - https://hackaday.com/2023/07/12/ac-dc-converter-is-reliable-safe-and-efficient/ -modepowersupply

hperrin, to ai

How come models aren't made to go back and change their answer as they work? If you ask a human to write something, they will very rarely just spit out an entire document word for word and be done. Most human work involves revising your own output as you work. If you prompt an to do this, you will get a better result, so why not build the model to do this from the get go?

(I revised this post 4 times before posting it.)

itnewsbot, to random
@itnewsbot@schleuss.online avatar

Don’t Let The Baluns Float Over Your Head - Most ham radio operators will build an antenna of some sort when they first start ... - https://hackaday.com/2023/04/29/dont-let-the-baluns-float-over-your-head/

tero, to random
@tero@rukii.net avatar

People working in models are talking about making models deep and/or long.

The depth of the Transformer model tends to induce more capabilities while increasing the inference time of the model only linearly. The main requirement for scaling that is the amount of good quality training data available.

Making the models longer in time, that is, increasing the context length, makes the models more useful in situations which require more context length. Currently GPT-4 supports a maximum context length of 32k tokens which is more than enough for many valuable use cases. I have so far gotten by with GPT-3.5 context length of 4,096 tokens with some clever optimization methods.

Some use cases such as maintaining huge existing codebases would benefit from even larger context lengths, and larger context lengths would also allow companies to use in-context learning instead of fine-tuning to make the model customized to specific use.

We can also make systems which selectively put only currently needed stuff into the context.

Scaling up the context length is more difficult than making the models deeper because of Transformer self-attention layers. Each output token is produced by computation which scans through all the previous input tokens so far, which makes scaling quadratic.

There are methods to mitigate this scaling difficulty, using for example attention-free Transformers, RNN models, memory tokens, state space models like Hungry Hungry Hippos and Hyena model utilizing fourier transformations and convolutions in clever ways to increase the receptive fields of convolutions.

It seems like much of the self-attention layer computational capacity is actually wasted at least in inference (see Hyena paper), so there is much algorithmic room for improvement in principle. However, it is a common theme in deep neural networks that an apparent wasted capacity is actually needed to train effectively even if it isn't needed in the final inference.

It is still an open question how much more efficient we can make large Transformer-type models, but the work has barely even started.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • kavyap
  • thenastyranch
  • ethstaker
  • osvaldo12
  • mdbf
  • DreamBathrooms
  • InstantRegret
  • magazineikmin
  • Youngstown
  • ngwrru68w68
  • slotface
  • GTA5RPClips
  • rosin
  • megavids
  • cubers
  • everett
  • cisconetworking
  • tacticalgear
  • anitta
  • khanakhh
  • normalnudes
  • Durango
  • modclub
  • tester
  • provamag3
  • Leos
  • lostlight
  • All magazines