Newest - Machine Learning

Machine Learning Beginner Info/Resources

MOOCs...

[R] Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers (www.reddit.com)

[R] QuIP#: SOTA 2 bit LLMs (www.reddit.com)

We're pleased to introduce QuIP#, a new SOTA LLM quantization method that uses incoherence processing from QuIP (the paper) & lattices to achieve 2 bit LLMs with near-fp16 performance! Now you can run LLaMA 2 70B on a 24G GPU w/out offloading!...

[R] Chain of Code: Reasoning with a Language Model-Augmented Code Emulator (www.reddit.com)

arXiv: https://arxiv.org/abs/2312.04474...

[R] Gated Linear Attention Transformers with Hardware-Efficient Training (www.reddit.com)

Paper: https://arxiv.org/abs/2312.06635...

[D] Can someone describe how the SSM in Mamba is much different than the concepts in a GRU / LSTM Cell? (www.reddit.com)

[D] Is DPO all you need in RLHF? (www.reddit.com)

[D] Length Generalizability of Transformers. (www.reddit.com)

[R] VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation (www.reddit.com)

Arxiv...

[D] Why do we need encoder-decoder models while decoder-only models can do everything? (www.reddit.com)

Theoretical Foundations of Graph Neural Networks - Seminar (www.youtube.com)

cross-posted from: slrpnk.net/post/3892266...

What's In My Big Data? (arxiv.org)

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen...

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI (arxiv.org)

The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts...

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks (aclanthology.org)

Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse...

Demystifying CLIP Data (arxiv.org)

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP...

GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems (arxiv.org)

There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples, a wide spread belief in their iterative self-critique capabilities persists. In this...

PaLI-3 Vision Language Models: Smaller, Faster, Stronger (arxiv.org)

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We...

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning (arxiv.org)

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The...

Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models (openaccess.thecvf.com)

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for...

A Long Way to Go: Investigating Length Correlations in RLHF (arxiv.org)

Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. Open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering,...

Think before you speak: Training Language Models With Pause Tokens (arxiv.org)

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token?...

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No (arxiv.org)

Out-of-distribution (OOD) detection refers to training the model on an in-distribution (ID) dataset to classify whether the input images come from unknown classes. Considerable effort has been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, zero-shot OOD...

Language Modeling Is Compression (arxiv.org)

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive...

Scaling Vision-Language Models with Sparse Mixture of Experts (arxiv.org)

The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these...