Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Abstract

Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability.

Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner.

Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing both 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning.

Comparison between end-to-end model, human beings, and our framework. A specific end-to-end model is needed for each task, while our framework would share a logical reasoner similar to human intelligence.

Our Two-Stage Perspective

As described above, visual reasoning can be divided into two stages: the symbolization stage extracts symbolic representations of the underlying data, and the reasoning stage performs logical reasoning. For humans, different modalities of visual and auditory information collected from our sensors are converted into electrical signals through different pathways and then sent to the cerebellar cortex to perform logical reasoning. Analogously, separated task-specific symbolizers and a shared domain-independent reasoner would be a reasonable choice for a general visual reasoning machine. Besides, the reasoner should be capable of performing unified reasoning on input information from various modalities. In other words, the essence of reasoning lies in its generalization ability.

Symbolization Stage. During the stage of symbolization, we implement various task-oriented feature extraction networks. These networks employ symbol encoders tailored for each task, transforming multi-modal inputs (text, image, video) into symbol representations. Formally, suppose we have n tasks. For the i-th task, we have the input data xi and the task ti, and the task-oriented encoder \(E^i\). Then we get the symbol representation set \(f^i\) via: \[f^i = E_i(x^i | t^i)\]

Reasoning Stage. The reasoner is fed by symbolic representations for each specific task, in a bid to capture a deeper and more comprehensive understanding of the underlying patterns and relationships embedded within the data. For symbol representation sets \(\{f^{(i)}\}_n^{i=1}\) of all tasks, we send them into the reasoner \(R\), and get its reasoning result set \(\{c^i\}_n^{i=1}\) after the logic processing to facilitate problem-solving across various modalities: \[\{c^i\}_n^{i=1} = R(\{f^i\}_n^{i=1}).\]

Task-specific Heads. The final part of our framework is the task-specific heads, which take the reasoning results from the reasoner as input and generate task-specific answers. For different tasks, we need to construct task-specific classification or regression heads \(H^i\) to get the final output \(s^i\). That says: \[s^i = H^i(c^i | t^i).\]

4 Types of disentanglement arrangements of encoders and reasoners. Through experiments, we demonstrate that Type 4: Separated-Encoder-Shared-Reasoner is the most effective.

LLM-based Models Reasoning

To analyze the performance of LLM-based models, we probe the task according to our two-stage framework design and examine them separately: (1) Symbolization: whether LLM-based models can recognize the elements of the problem. (2) Conceptual: whether LLM-based models can learn specific concepts behind the tasks and reason about them. (3) Answer Generation: whether LLM-based models can utilize the concepts it learns to solve problems. Using MiniGPT-4 as a representative, we summarize the typical responses of LLM-based models to three-level problems in RAVEN and Bongard.

We find that LLMs may encounter certain hallucination circumstances while solving visual reasoning tasks. As shown in the figure below, for the RAVEN problem, MiniGPT-4 succeeds in the first level of identifying the object while failing in the second stage of reasoning with the arrangement rule. For RAVEN problems, MiniGPT-4 fails to accurately identify the logical patterns. For the Bongard problem, MiniGPT-4 succeeds in the first level of recognizing human activity and the second level of grasping reasoning logically, yet it fails at the answer generation level and gets lost when utilizing rules to answer questions. Given the above cases, we can gain an understanding of the shortcomings of the LLM-based models in reasoning tasks, namely its good concept comprehension ability but insufficient performance in logical reasoning and answer generation.

Failure case analysis of LLM-based model. On RAVEN, it fails on the Symbolization level; while on Bongard-HOI, it fails on the Answer Generation level.

BibTeX

@article{zhang2024take,
  author    = {Zhang, Mingyu and Cai, Jiting and Liu, Mingyu and Xu, Yue and Lu, Cewu and Li, Yong-Lu},
  title     = {Take A Step Back: Rethinking the Two Stages in Visual Reasoning},
  journal   = {ECCV},
  year      = {2024},
}