R-C2: Cross-Modal Cycle Consistency Rewards Improve Multimodal Reasoning

Zirui Zhang¹, Haoyu Dong ², Kexin Pei ³, Chengzhi Mao ¹

¹Rutgers University ²Columbia University ³University of Chicago

https://R-C2-cmlab.github.io/Cross-Modal-Cycle-Consistency-Rewards/

Modality gap example where screenshot and HTML views of the same webpage yield conflicting answers. — Modality Gap Given the same webpage, an MLLM produces different answers when queried on the Screenshot versus the HTML view, motivating R-C2.

Abstract

Robust perception and reasoning requires consistency across sensory modalities. Yet, current multimodal models often violate this principle, yielding contradictory predictions for visual versus textual representations of the same concept. Rather than masking these failures with standard voting mechanisms—which amplify systematic biases—we demonstrate that cross-modal inconsistency provides a rich, natural signal for learning. We introduce R-C2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switches modalities, and reliably reconstruct the answer via forward inference, we establish a dense, label-free reward. This cyclic constraint forces the model to autonomously align its representations. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not just from scaling data, but from enforcing a structurally consistent understanding of the world.

When Majority Voting Breaks

Cross-modal disagreements are common: the same input yields different answers from the screenshot and the HTML view. Simple majority voting over these inconsistent predictions can reinforce the wrong answer instead of correcting it.

Failure of multimodal majority voting when image and text predictions are inconsistent.

Modality Gap and Majority Voting

For the same webpage, screenshot and HTML views drive the MLLM to different answers. Aggregating image-only and text-only predictions by majority vote can still lock in the wrong label, especially when both modalities share a systematic bias—motivating consistency-based training instead of simple counting.

R-C2: Cross-Modal Cycle Consistency Rewards

R-C2 turns cross-modal contradictions into rewards. From a candidate answer, the model performs backward reasoning to synthesize queries and then runs forward inference across text and image views, checking whether the cycle returns to the original answer.

Overview of the R-C2 cross-modal cycle consistency framework.

Turning Cycles into Rewards

Starting from a candidate answer, R-C2 runs answer-to-query backward inference in both text and image modalities, then closes four forward cycles: text→text, text→image, image→text, and image→image. Cycles that faithfully reconstruct the original answer are rewarded, while those that drift are penalized, providing dense label-free signals for GRPO training.

Answer-to-Query Backward Inference

Backward inference asks the model to justify its own answer: “for this answer to be correct, what query must have been asked?” R-C2 applies this in both text and image views, together with our reconstructed VisualWebArena multiple-choice dataset.

Backward Inference Across Domains

Across very different tasks, backward query generation reliably turns an answer into the right kind of question—showing that R-C2 can extract meaningful supervision anywhere the model makes a prediction.

Dataset-Level Visualization

Examples from the reconstructed VisualWebArena multiple-choice dataset, covering product classification, attribute queries, price reasoning, etc.

Swipe horizontally, drag, or use the carousel arrows to browse backward-inference and dataset visualizations.

Cross-Modal Case Studies Across Benchmarks

We evaluate R-C2 on six multimodal reasoning benchmarks and observe consistent gains in both accuracy and cross-modal consistency. The carousel below summarizes quantitative improvements and representative qualitative case studies.

Overall accuracy of base model, voting baselines, and R-C2 across all benchmarks.

Accuracy Gains Over Baselines

R-C2 improves accuracy on all six benchmarks compared to the base MLLM and multimodal voting. Improvements reach +7.6 points on the most challenging datasets, showing that resolving cross-modal contradictions translates directly into better task performance.

Cross-modal agreement between screenshot and HTML views for different training methods.

Cross-Modal Consistency Improvements

Beyond accuracy, R-C2 substantially increases agreement between screenshot and HTML views. Disagreement rates drop across all datasets, indicating that cycle-consistency rewards make the model’s visual and textual predictions more stable and aligned.

Qualitative examples comparing base model, voting, and R-C2 across multiple benchmarks.

Case Studies Across Benchmarks

Qualitative examples from ScienceQA, ChartQA, DocVQA, InfoVQA, MathVista, and VisualWebArena show that when the base model or voting baselines are confused by modality-specific biases, R-C2 leverages backward–forward cycles to recover answers that are correct and cross-modally consistent.

Swipe horizontally, drag, or use the carousel arrows to browse quantitative and qualitative results.

Which Cycles Matter Most?

R-C2 supports many combinations of backward and forward modalities. Ablations reveal which paths contribute most to accuracy and self-consistency.

Heatmaps showing the impact of different backward/forward cycle paths.

Impact of Different Cycle Paths

The heatmaps compare accuracy and agreement when enabling different pairs of backward and forward modalities. Mixed cycles that traverse both text and image views consistently outperform single-modality cycles, confirming that explicitly enforcing cross-modal consistency is key to the gains of R-C2.

Learning from Cross-Modal Conflicts

Not all training examples are equally informative. Samples where image and text views strongly disagree turn out to be the most valuable for improving both accuracy and consistency.

Effect of training on increasingly inconsistent samples.

Training on Hard Inconsistencies

As R-C2 focuses training on samples with larger cross-modal disagreement, both task accuracy and agreement between modalities improve. The hardest conflicts—where text and image most disagree—provide the strongest self-supervised signal for aligning multimodal reasoning.

BibTeX

@article{rc22025cross,
  title   = {R-C2: Cross-Modal Cycle Consistency Rewards Improve Multimodal Reasoning},
  author  = {Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao},
  journal = {arXiv preprint},
  year    = {2025}
}

Acknowledgements

This work used Purdue Anvil GPU through allocation 250774 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by the U.S. National Science Foundation under grants 2138259, 2138286, 2138307, 2137603, and 2138296. We thank Guangxing Han for the insightful discussion.