Commented - Machine Learning

Machine Learning Beginner Info/Resources

MOOCs...

Language Modeling Is Compression (arxiv.org)

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive...

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing (arxiv.org)

Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and...

Extending Context Window of Large Language Models via Positional Interpolation (arxiv.org)

Interesting technique to increase the context window of language models by finetuning on a small number of samples after pretraining....

Retentive Network: A Successor to Transformer for Large Language Models (arxiv.org)

This is an exciting new paper that replaces attention in the Transformer architecture with a set of decomposable matrix operations that retain the modeling capacity of Transformer models, while allowing parallel training and efficient RNN-like inference without the use of attention (it doesn't use a softmax)....

PaLI-3 Vision Language Models: Smaller, Faster, Stronger (arxiv.org)

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We...

Think before you speak: Training Language Models With Pause Tokens (arxiv.org)

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token?...

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback (arxiv.org)

A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human...

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks (arxiv.org)

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous...

The Curse of Recursion: Training on Generated Data Makes Models Forget (arxiv.org)

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic...

[D] Why do we need encoder-decoder models while decoder-only models can do everything? (www.reddit.com)

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training (aclanthology.org)

Multilingual Vision-Language Pre-training (VLP) is a promising but challenging topic due to the lack of large-scale multilingual image-text pairs. Existing works address the problem by translating English data into other languages, which is intuitive and the generated data is usually limited in form and scale. In this paper, we...

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks (arxiv.org)

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of...

Real-Time Radiance Field Rendering (huggingface.co)

Achieves SOTA on quality AND on training time AND renders in real-time (60fps+)

Universal and Transferable Attacks on Aligned Language Models (llm-attacks.org)

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time (proceedings.mlr.press)

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large...

NeurIPS 2023 Machine Unlearning Challenge (unlearning-challenge.github.io)

Deep neural networks are at the center of rapid progress in AI, with applications to computer vision, natural language processing, speech recognition and others. While this progress offers many exciting opportunities, it also introduces new challenges, as we researchers bear the responsibility to understand and mitigate the...

GitHub - mazzzystar/Queryable: Run CLIP on iPhone to Search Photos. (github.com)

The open source version of Queryable, an iOS app the CLIP model on iOS to search the Photos album offline....

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training (arxiv.org)

Abstract:...

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models (arxiv.org)

Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive...

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models (arxiv.org)

Abstract:...

r/MachineLearning finally received a warning from u/ModCodeOfConduct

Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks (arxiv.org)

Large language models (LLMs) are remarkable data annotators. They can be used to generate high-fidelity supervised training data, as well as survey and experimental data. With the widespread adoption of LLMs, human gold--standard annotations are key to understanding the capabilities of LLMs and the validity of their results....

Privacy Preserving Bayesian Federated Learning in Heterogeneous Settings (arxiv.org)

In several practical applications of federated learning (FL), the clients are highly heterogeneous in terms of both their data and compute resources, and therefore enforcing the same model architecture for each client is very limiting. Moreover, the need for uncertainty quantification and data privacy constraints are often...

Meta Releases Open Source MusicGen AI (ai.honu.io)

Code, paper, and online demos available