KingsmanVince
KingsmanVince avatar

KingsmanVince

@KingsmanVince@kbin.social

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks (aclanthology.org)

Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse...

Demystifying CLIP Data (arxiv.org)

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP...

PaLI-3 Vision Language Models: Smaller, Faster, Stronger (arxiv.org)

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We...

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning (arxiv.org)

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The...

Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models (openaccess.thecvf.com)

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for...

Think before you speak: Training Language Models With Pause Tokens (arxiv.org)

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token?...

KingsmanVince,
KingsmanVince avatar

IIRC DeTr generate a sequence to predict boxes of objects. I think this paradigm can be applied to such models. "Think before you locate" could be a new path to explore.

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No (arxiv.org)

Out-of-distribution (OOD) detection refers to training the model on an in-distribution (ID) dataset to classify whether the input images come from unknown classes. Considerable effort has been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, zero-shot OOD...

Scaling Vision-Language Models with Sparse Mixture of Experts (arxiv.org)

The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these...

Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks (arxiv.org)

In recent times there has been a surge of multi-modal architectures based on Large Language Models, which leverage the zero shot generation capabilities of LLMs and project image embeddings into the text space and then use the auto-regressive capacity to solve tasks such as VQA, captioning, and image retrieval. We name these...

Foundational Models Defining a New Era in Vision: A Survey and Outlook (arxiv.org)

Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and...

itnewsbot, to machinelearning
@itnewsbot@schleuss.online avatar

OpenAI admits that AI writing detectors don’t work - Enlarge (credit: Getty Images)

Last week, OpenAI published tip... - https://arstechnica.com/?p=1966483 -3 -4

KingsmanVince,
KingsmanVince avatar

@itnewsbot Good, now all teachers and professors (who rely on these AI detector tools solely) should be informed.

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training (aclanthology.org)

Multilingual Vision-Language Pre-training (VLP) is a promising but challenging topic due to the lack of large-scale multilingual image-text pairs. Existing works address the problem by translating English data into other languages, which is intuitive and the generated data is usually limited in form and scale. In this paper, we...

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks (arxiv.org)

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of...

Vision Language Transformers: A Survey (arxiv.org)

Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision language modeling. Transformer models...

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use (arxiv.org)

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like...

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback (arxiv.org)

A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human...

RecycleGPT: An Autoregressive Language Model with Recyclable Module (arxiv.org)

Existing large language models have to run K times to generate a sequence of K tokens. In this paper, we present RecycleGPT, a generative language model with fast decoding speed by recycling pre-generated model states without running the whole model in multiple steps. Our approach relies on the observation that adjacent tokens...

Multitask Pretraining with Structured Knowledge for Text-to-SQL Generation (aclanthology.org)

Many machine learning-based low-code or no-code applications involve generating code that interacts with structured knowledge. For example, one of the most studied tasks in this area is generating SQL code from a natural language statement. Prior work shows that incorporating context information from the database schema, such as...

Retentive Network: A Successor to Transformer for Large Language Models (arxiv.org)

This is an exciting new paper that replaces attention in the Transformer architecture with a set of decomposable matrix operations that retain the modeling capacity of Transformer models, while allowing parallel training and efficient RNN-like inference without the use of attention (it doesn't use a softmax)....

  • All
  • Subscribed
  • Moderated
  • Favorites
  • provamag3
  • rosin
  • thenastyranch
  • Durango
  • DreamBathrooms
  • ngwrru68w68
  • magazineikmin
  • cubers
  • Youngstown
  • mdbf
  • slotface
  • osvaldo12
  • GTA5RPClips
  • kavyap
  • megavids
  • InstantRegret
  • everett
  • tacticalgear
  • vwfavf
  • tester
  • normalnudes
  • modclub
  • ethstaker
  • khanakhh
  • cisconetworking
  • anitta
  • Leos
  • JUstTest
  • All magazines