Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks

In recent times there has been a surge of multi-modal architectures based on Large Language Models, which leverage the zero shot generation capabilities of LLMs and project image embeddings into the text space and then use the auto-regressive capacity to solve tasks such as VQA, captioning, and image retrieval. We name these architectures as "bridge-architectures" as they project from the image space to the text space. These models deviate from the traditional recipe of training transformer based multi-modal models, which involve using large-scale pre-training and complex multi-modal interactions through co or cross attention. However, the capabilities of bridge architectures have not been tested on complex visual reasoning tasks which require fine grained analysis about the image. In this project, we investigate the performance of these bridge-architectures on the NLVR2 dataset, and compare it to state-of-the-art transformer based architectures. We first extend the traditional bridge architectures for the NLVR2 dataset, by adding object level features to faciliate fine-grained object reasoning. Our analysis shows that adding object level features to bridge architectures does not help, and that pre-training on multi-modal data is key for good performance on complex reasoning tasks such as NLVR2. We also demonstrate some initial results on a recently bridge-architecture, LLaVA, in the zero shot setting and analyze its performance.

Image

Image alternative text

Federation

Status:

Instances:

/m/machinelearning

Microblog (481)

Thread

KingsmanVince

@KingsmanVince@kbin.social

Added: 8 months ago
Online: -
Ratio: 1 (50%)

Magazine

Machine Learning

@machinelearning@kbin.social

Machine learning (ML) is a field devoted to understanding and building methods that let machines "learn" – that is, methods that leverage data to improve computer performance on some set of tasks.

Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

Rules

Be nice: no offensive behavior, insults or attacks
Make your post clear and comprehensive
Limit self promotion

Created: 11 months ago
Owner: genesis
Subscribers: 1024
Online: -

Tags

#machine #learning #ml #ai #artificial #intelligence

Moderators

genesis
nsa

Active people

Related posts

Sweet memories of the future / Never forgotten moments yet to come...

Show more

7 days ago to nature

Do LLMs learn foundational concepts required to build world models? (less than expected)...

Show more

10 days ago to llm

Open Source Hardware iMX8MPlus SOM and EVB for Industrial applications, Machine learning and Machine vision with 2.3 TOPS running mainline Linux and operate in industrial grade temperature range...

Show more

5 hours ago to linux

Fine Tuning LLM Models – Generative AI Course 👇🏼...

Show more

5 days ago to llm

Related threads

Visions of Chaos Tutorials

Show more

11 months ago to visionsofchaos

Interdimensional Machine Room [EXPERIMENTAL VQGAN]

Show more

11 months ago to visionsofchaos

How AI is helping airlines mitigate the climate impact of contrails

Show more

9 months ago to science

Inside the messy ethics of making war with machines

Show more

9 months ago to technology

Support Us