Machine Learning - kbin.social

Machine Learning Beginner Info/Resources

MOOCs...

[D] Why do we need encoder-decoder models while decoder-only models can do everything? (www.reddit.com)

What's In My Big Data? (arxiv.org)

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen...

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI (arxiv.org)

The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts...

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks (aclanthology.org)

Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse...

Demystifying CLIP Data (arxiv.org)

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP...

GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems (arxiv.org)

There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples, a wide spread belief in their iterative self-critique capabilities persists. In this...

PaLI-3 Vision Language Models: Smaller, Faster, Stronger (arxiv.org)

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We...

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning (arxiv.org)

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The...

Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models (openaccess.thecvf.com)

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for...

A Long Way to Go: Investigating Length Correlations in RLHF (arxiv.org)

Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. Open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering,...

Think before you speak: Training Language Models With Pause Tokens (arxiv.org)

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token?...

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No (arxiv.org)

Out-of-distribution (OOD) detection refers to training the model on an in-distribution (ID) dataset to classify whether the input images come from unknown classes. Considerable effort has been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, zero-shot OOD...

Language Modeling Is Compression (arxiv.org)

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive...

Scaling Vision-Language Models with Sparse Mixture of Experts (arxiv.org)

The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these...

Hydra-MoE: A new class of Open-Source Mixture of Experts (github.com)

Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks (arxiv.org)

In recent times there has been a surge of multi-modal architectures based on Large Language Models, which leverage the zero shot generation capabilities of LLMs and project image embeddings into the text space and then use the auto-regressive capacity to solve tasks such as VQA, captioning, and image retrieval. We name these...

Foundational Models Defining a New Era in Vision: A Survey and Outlook (arxiv.org)

Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and...

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training (aclanthology.org)

Multilingual Vision-Language Pre-training (VLP) is a promising but challenging topic due to the lack of large-scale multilingual image-text pairs. Existing works address the problem by translating English data into other languages, which is intuitive and the generated data is usually limited in form and scale. In this paper, we...

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks (arxiv.org)

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of...

Vision Language Transformers: A Survey (arxiv.org)

Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision language modeling. Transformer models...

SeamlessM4T: Multimodal Model for Speech Translation

Meta releases SeamlessM4T, a general multilingual speech/text model claimed to surpass OpenAI’s Whisper. It’s available on github and everything can be used for free in a non-commercial setting....

Hugging Face Releases IDEFICS: An Open-Access 80B Visual Language Model Replicating DeepMind's Flamingo

Hugging Face released IDEFICS, an 80B open-access visual language model replicating DeepMind’s unreleased Flamingo. Built entirely on public data, it’s the first of its size available openly. Part of its training utilized OBELICS, a dataset with 141M web pages, 353M images, and 115B text tokens from Common Crawl....

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use (arxiv.org)

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like...

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback (arxiv.org)

A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human...