kromem,

OP, you do realize that this paper is about image generation and classification based on related data sets and only relates to the image processing features of multimodal models, right?

How do you see this research as connecting to the future scope of LLMs?

And why do you think that the same leap we’ve now seen with synthetic data transmitting abstract capabilities in text data won’t occur with images (and eventually video)?

Edit: Which LLMs do you see in the models they tested:

Models. We test CLIP [91] models with both ResNet [53] and Vision Transformer [36] architecture, with ViT-B-16 [81] and RN50 [48, 82] trained on CC-3M and CC-12M, ViT-B-16, RN50, and RN101 [61] trained on YFCC-15M, and ViT-B-16, ViT-B-32, and ViT-L-14 trained on LAION400M [102]. We follow open_clip [61], slip [81] and cyclip [48] for all implementation details.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • futurology@futurology.today
  • InstantRegret
  • ngwrru68w68
  • everett
  • mdbf
  • modclub
  • rosin
  • osvaldo12
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • Youngstown
  • GTA5RPClips
  • slotface
  • kavyap
  • JUstTest
  • ethstaker
  • tacticalgear
  • tester
  • cubers
  • Durango
  • normalnudes
  • khanakhh
  • Leos
  • anitta
  • cisconetworking
  • provamag3
  • megavids
  • lostlight
  • All magazines