Personalized Daily Arxiv Papers 07/15/2025

This project is adapted from tatsu-lab/gpt_paper_assistant. The source code of this project is at lxysl/gpt_paper_assistant

Topics

Paper selection prompt and criteria (jump to the section by clicking the link):

1. Multimodal Large Language Models

2. Unified Multimodal Large Language Models for Understanding and Generating

3. Large Language Models

4. Self-Supervised Learning and Vision-Language Pre-training

5. Image Generation (Diffusion, Autoregressive, Tokenizer, etc.)

6. Reinforcement Learning in Large or Multimodal Models & Reasoning During Inference

7. Evaluation Sets and Datasets for Multimodal Large Models

8. AI Agents & Embodied Intelligence (especially involving LLMs/MLLMs)

Go beyond


Today's Spotlight Papers

DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs [topic 1]

Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset [topic 7]

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments [topic 7, topic 8]

Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning [topic 6]

A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images [topic 1]

IGD: Instructional Graphic Design with Multimodal Layer Generation [topic 2]

LLM-Stackelberg Games: Conjectural Reasoning Equilibria and Their Applications to Spearphishing [topic 8, topic 6]

EduFlow: Advancing MLLMs' Problem-Solving Proficiency through Multi-Stage, Multi-Perspective Critique [topic 1]

AlphaVAE: Unified End-to-End RGBA Image Reconstruction and Generation with Alpha-Aware Representation Learning [topic 5]

Scaling Laws for Optimal Data Mixtures [topic 1, topic 3]

ExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments [topic 1, topic 7]

Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning [topic 1, topic 6]

Frequency Regulation for Exposure Bias Mitigation in Diffusion Models [topic 5]

Latent Diffusion Models with Masked AutoEncoders [topic 5]

GenAI-based Multi-Agent Reinforcement Learning towards Distributed Agent Intelligence: A Generative-RL Agent Perspective [topic 6, topic 8]

From Wardrobe to Canvas: Wardrobe Polyptych LoRA for Part-level Controllable Human Image Generation [topic 5]

Text-to-Remote-Sensing-Image Retrieval beyond RGB Sources [topic 4]

Memorization Sinks: Isolating Memorization during LLM Training [topic 3]

Lizard: An Efficient Linearization Framework for Large Language Models [topic 3]

MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models [topic 1]

Learning Diffusion Models with Flexible Representation Guidance [topic 5]

Behavioral Exploration: Learning to Explore via In-Context Adaptation [topic 8]

Warm Starts Accelerate Generative Modelling [topic 5]

Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models [topic 3]

AirScape: An Aerial Generative World Model with Motion Controllability [topic 8]

ProactiveBench: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models [topic 7]

GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them? [topic 7]

VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding [topic 7]

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models [topic 5, topic 4]

PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection [topic 1]

Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization [topic 8]

Demonstrating the Octopi-1.5 Visual-Tactile-Language Model [topic 1, topic 8]

DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models [topic 3]

Rethinking Prompt Optimization: Reinforcement, Diversification, and Migration in Blackbox LLMs [topic 3, topic 6]

Quantize-then-Rectify: Efficient VQ-VAE Training [topic 5]

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models [topic 1, topic 6]

wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models [topic 6]

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation [topic 3]

Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering? [topic 6]

Memory-Efficient Personalization of Text-to-Image Diffusion Models via Selective Optimization Strategies [topic 5]

Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding [topic 1]

Stable Score Distillation [topic 5]

Prompt Informed Reinforcement Learning for Visual Coverage Path Planning [topic 8]

Self-Improving Model Steering [topic 3]

OTAS: Open-vocabulary Token Alignment for Outdoor Segmentation [topic 8]

Is Human-Written Data Enough? The Challenge of Teaching Reasoning to LLMs Without RL or Distillation [topic 3]

Graph World Model [topic 1, topic 8]

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination [topic 3, topic 6]

VDInstruct: Zero-Shot Key Information Extraction via Content-Aware Vision Tokenization [topic 1]

Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces [topic 3]

Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers [topic 3, topic 1]

ViSP: A PPO-Driven Framework for Sarcasm Generation with Contrastive Learning [topic 3, topic 6]

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation [topic 7]


Topic 1: 1. Multimodal Large Language Models (23 papers)

ArXiv: 2507.10302 [page] [pdf] [kimi]

Authors: Jiahe Zhao, Rongkun Zheng, Yi Wang, Helin Wang, Hengshuang Zhao

TLDR: DisCo:一种新颖的视频MLLM视觉封装方法,通过引入视觉概念判别器(VCD)和时序焦点校准器(TFC)模块,解决传统线性投影器在视频中引入的语义模糊和时间不连贯问题,显著提升视频理解基准测试性能,同时实现更高的token效率,代码已开源。

Abstract

arXiv:2507.10302v1 Announce Type: new Abstract: In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Through extensive experiments on multiple video MLLM frameworks, we demonstrate that DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks, while also achieving higher token efficiency thanks to the reduction of semantic indistinctness. The code: https://github.com/ZJHTerry18/DisCo.

Comment: Criterion: 1

Relevance: 10 Novelty: 9 Back to [topic] [top]

ArXiv: 2507.10202 [page] [pdf] [kimi]

Authors: Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, Seonjoo Kim

TLDR: ECP:一种训练无关、任务无关的框架,用于提升多模态大语言模型(MLLMs)在处理高分辨率图像时的性能。该方法利用MLLM在下采样图像上的隐式定位线索,通过先提取候选区域再进行预测的两阶段策略,有效解决了MLLMs在高分辨率图像上细粒度理解和推理的挑战,在4K GUI接地和4K、8K MLLM感知上分别带来+21.3%、+5.8%、+5.2%的绝对性能提升。代码已开源。

Abstract

arXiv:2507.10202v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition behind ECP is that while MLLMs struggle with high-resolution images, their predictions on downsampled images still contain implicit localization cues. By first identifying candidate region using the coarse prediction and then predicting the final output based on candidate region, ECP effectively preserves fine-grained details while mitigating the challenges posed by high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared to baseline respectively, demonstrating its effectiveness. Code is available at https://github.com/yenncye/ECP.

Comment: Criterion: 1

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09374 [page] [pdf] [kimi]

Authors: Chenglin Zhu, Tao Zhang, Chong Li, Mingan Lin, Zenan Zhou, Jian Xie

TLDR: EduFlow:一种端到端框架,旨在通过多阶段、多视角批判提升多模态大语言模型(MLLMs)在科学问题解决上的能力。它引入了过程感知奖励模型EduPRM进行细粒度批判,并结合领域自适应搜索框架EduMCTS以促进反思性错误校正,显著增强了推理一致性和连贯性;构建了EduMCTS-160K大型教育推理轨迹数据集,代码和模型将开源。

Abstract

arXiv:2507.09374v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) still perform poorly on scientific tasks, particularly those requiring multi-step and interpretable reasoning. Their limitations include insufficient scientific reasoning patterns, lack of global coherence in multi-step inference, and the absence of reflective self-correction, making them unreliable in structured scientific contexts. We introduce EduFlow, the first end-to-end framework that covers the full pipeline of educational scientific reasoning, including data selection, MCTS-based trajectory construction, model training, and output optimization. At its core is EduPRM, a process-aware reward model that critiques reasoning steps with tags and justifications. EduPRM is trained via curriculum learning on three complementary supervision sources: MCTS-guided trajectories, error-injected critiques, and teacher-student dialogues, enabling dynamic adaptation to multi-stage problem solving and iterative refinement during inference. We further propose EduMCTS, a domain-adapted search framework that introduces bootstrapping actions specifically designed for educational reasoning, such as a self-reflection mechanism that promotes reflective error correction. It further leverages EduPRM's fine-grained feedback to guide the search toward higher-quality reasoning trajectories. By applying self-consistency and rejection sampling, we constructed EduMCTS-160K, a large-scale dataset of educational reasoning trajectories. Extensive experiments demonstrate that EduFlow enhances reasoning consistency and coherence. Code, data, and models will be released.

Comment: Criterion: 1

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09404 [page] [pdf] [kimi]

Authors: Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin

TLDR: 本文提出一种基于缩放定律的系统性方法,用于确定大型基础模型(包括LLM、NMM和LVM)的最佳数据混合比例,该方法能准确预测模型性能并从少量小规模训练中估计参数,为模型训练提供了一种原理性替代方案,避免了昂贵的试错过程。

Abstract

arXiv:2507.09404v1 Announce Type: new Abstract: Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size $N$ trained with $D$ tokens and a specific domain weight vector $h$. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget ($N$,$D$), providing a principled alternative to costly trial-and-error methods.

Comment: Criterion: 1, 3

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09693 [page] [pdf] [kimi]

Authors: Jiali Chen, Yujie Jia, Zihan Wu, Jinyu Yang, Jianpeng Chen, Xusen Hei, Jiayuan Xie, Yi Cai, Qing Li

TLDR: ExpStar:一种用于多学科科学实验自动评论生成的模型,通过构建首个包含7K步级评论的ExpInstruct数据集,并利用检索增强机制自适应地访问和利用外部知识。该模型在实验中显著超越了14个主流大型多模态模型(LMMs),在视频理解和精细评论生成方面表现出卓越能力,有望推动AI辅助科学教学。

Abstract

arXiv:2507.09693v1 Announce Type: new Abstract: Experiment commentary is crucial in describing the experimental procedures, delving into underlying scientific principles, and incorporating content-related safety guidelines. In practice, human teachers rely heavily on subject-specific expertise and invest significant time preparing such commentary. To address this challenge, we introduce the task of automatic commentary generation across multi-discipline scientific experiments. While recent progress in large multimodal models (LMMs) has demonstrated promising capabilities in video understanding and reasoning, their ability to generate fine-grained and insightful experiment commentary remains largely underexplored. In this paper, we make the following contributions: (i) We construct \textit{ExpInstruct}, the first dataset tailored for experiment commentary generation, featuring over 7\textit{K} step-level commentaries across 21 scientific subjects from 3 core disciplines (\ie, science, healthcare and engineering). Each sample includes procedural descriptions along with potential scientific principles (\eg, chemical equations and physical laws) and safety guidelines. (ii) We propose ExpStar, an automatic experiment commentary generation model that leverages a retrieval-augmented mechanism to adaptively access, evaluate, and utilize external knowledge. (iii) Extensive experiments show that our ExpStar substantially outperforms 14 leading LMMs, which highlights the superiority of our dataset and model. We believe that ExpStar holds great potential for advancing AI-assisted scientific experiment instruction.

Comment: Criterion: 1, 7

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10007 [page] [pdf] [kimi]

Authors: Zijun Chen, Wenbo Hu, Richang Hong

TLDR: 本文提出一种新颖的方法,通过利用模型内在的真值编码来校准思维链(CoT)推理的准确性,发现特定注意力头激活能可靠反映CoT推理步骤的真实性,并据此训练置信度预测器,通过束搜索动态选择最合理路径。该方法在单模态和多模态设置下的数学、符号和常识推理任务中显著优于现有SOTA基线,提供了提升CoT推理可靠性的新途径。

Abstract

arXiv:2507.10007v1 Announce Type: new Abstract: Chain of Thought (CoT) reasoning has demonstrated remarkable deep reasoning capabilities in both large language models (LLMs) and multimodal large language models (MLLMs). However, its reliability is often undermined by the accumulation of errors in intermediate steps. This paper introduces an novel approach to calibrate the CoT reasoning accuracy by leveraging the model's intrinsic veracity encoding. We discover that specific attention head activations reliably reflect the truthfulness of reasoning steps in CoT. Based on this insight, we train a confidence predictor to evaluate the correctness of each reasoning step using these truthfulness-sensitive activations, dynamically selecting the most plausible reasoning path via beam search. Experimental results demonstrate that our method significantly outperforms the state-of-the-art baselines (e.g., Few-Shot CoT, Self-Consistency, and Self-Evaluation Guided Beam Search) across the mathematical, symbolic, and commonsense reasoning tasks, exhibiting superior accuracy and reliability in both unimodal and multimodal settings. We further validate the approach on large reasoning models, confirming its applicability to specialized reasoning models. Additionally, we explore the role of the model's self-correction ability in CoT reasoning. This work provides a novel reliability improvement path for CoT reasoning with broad application potential.

Comment: Criterion: 1, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09184 [page] [pdf] [kimi]

Authors: Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, Yun Xing, Xiaosong Yuan, Feilong Tang, Sinan Fan, Xuhang Chen, Xuyao Zhang, Dahan Wang

TLDR: MCA-LLaVA:一种针对大型视觉语言模型(LVLMs)幻觉问题的新方法,通过揭示旋转位置编码(RoPE)的长期衰减导致的图像对齐偏差,提出基于曼哈顿距离的二维多向空间衰减机制,整合图像令牌的一维序列顺序与二维空间位置进行位置建模,显著缓解幻觉并增强多模态对齐。代码已开源。

Abstract

arXiv:2507.09184v1 Announce Type: new Abstract: Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction's perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.

Comment: Criterion: 1

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.08979 [page] [pdf] [kimi]

Authors: Mahdiyar Molahasani, Azadeh Motamedi, Michael Greenspan, Il-Min Kim, Ali Etemad

TLDR: PRISM:一种新颖的无数据、任务无关的VLM偏见缓解方案。通过LLM引导生成场景描述以发现虚假关联,并采用对比式去偏损失学习嵌入投影,有效最小化虚假关联同时保持图像与文本对齐,在Waterbirds和CelebA数据集上性能超越现有方法。代码已开源。

Abstract

arXiv:2507.08979v1 Announce Type: new Abstract: We introduce Projection-based Reduction of Implicit Spurious bias in vision-language Models (PRISM), a new data-free and task-agnostic solution for bias mitigation in VLMs like CLIP. VLMs often inherit and amplify biases in their training data, leading to skewed predictions. PRISM is designed to debias VLMs without relying on predefined bias categories or additional external data. It operates in two stages: first, an LLM is prompted with simple class prompts to generate scene descriptions that contain spurious correlations. Next, PRISM uses our novel contrastive-style debiasing loss to learn a projection that maps the embeddings onto a latent space that minimizes spurious correlations while preserving the alignment between image and text embeddings.Extensive experiments demonstrate that PRISM outperforms current debiasing methods on the commonly used Waterbirds and CelebA datasets We make our code public at: https://github.com/MahdiyarMM/PRISM.

Comment: Criterion: 1

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09985 [page] [pdf] [kimi]

Authors: Samson Yu, Kelvin Lin, Harold Soh

TLDR: Octopi-1.5:该研究展示了最新的视觉-触觉-语言模型Octopi-1.5,它能处理多物体部分的触觉信号并结合RAG模块,显著提升了触觉推理与实时学习能力,通过TMI手持触觉界面无需机器人即可实时交互演示,代码已开源。

Abstract

arXiv:2507.09985v1 Announce Type: new Abstract: Touch is recognized as a vital sense for humans and an equally important modality for robots, especially for dexterous manipulation, material identification, and scenarios involving visual occlusion. Building upon very recent work in touch foundation models, this demonstration will feature Octopi-1.5, our latest visual-tactile-language model. Compared to its predecessor, Octopi-1.5 introduces the ability to process tactile signals from multiple object parts and employs a simple retrieval-augmented generation (RAG) module to improve performance on tasks and potentially learn new objects on-the-fly. The system can be experienced live through a new handheld tactile-enabled interface, the TMI, equipped with GelSight and TAC-02 tactile sensors. This convenient and accessible setup allows users to interact with Octopi-1.5 without requiring a robot. During the demonstration, we will showcase Octopi-1.5 solving tactile inference tasks by leveraging tactile inputs and commonsense knowledge. For example, in a Guessing Game, Octopi-1.5 will identify objects being grasped and respond to follow-up queries about how to handle it (e.g., recommending careful handling for soft fruits). We also plan to demonstrate Octopi-1.5's RAG capabilities by teaching it new items. With live interactions, this demonstration aims to highlight both the progress and limitations of VTLMs such as Octopi-1.5 and to foster further interest in this exciting field. Code for Octopi-1.5 and design files for the TMI gripper are available at https://github.com/clear-nus/octopi-1.5.

Comment: Criterion: 1, 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09279 [page] [pdf] [kimi]

Authors: Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel

TLDR: Prompt4Trust:首个面向多模态大语言模型(MLLM)置信度校准的强化学习提示增强框架,通过轻量级LLM生成上下文感知辅助提示,提升下游MLLM的预测准确性与置信度校准,在PMC-VQA医学视觉问答基准上实现SOTA性能,并展示对更大MLLM的零样本泛化能力,有效增强MLLM在安全关键场景下的可信赖性,代码已开源。

Abstract

arXiv:2507.09279v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/vccrl-llm.

Comment: Criterion: 1, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09334 [page] [pdf] [kimi]

Authors: Wencan Huang, Daizong Liu, Wei Hu

TLDR: Fast3D:针对3D多模态大语言模型(MLLMs)计算效率低的问题,提出首个即插即用型视觉Token剪枝框架,通过全局注意力预测(GAP)和样本自适应视觉Token剪枝(SAP)技术,有效加速3D MLLMs。代码已开源。

Abstract

arXiv:2507.09334v1 Announce Type: new Abstract: While 3D Multi-modal Large Language Models (MLLMs) demonstrate remarkable scene understanding capabilities, their practical deployment faces critical challenges due to computational inefficiency. The key bottleneck stems from processing excessive object-centric visual tokens required for comprehensive 3D scene representation. Although visual token pruning has shown promise in accelerating 2D MLLMs, its applicability to 3D domains remains largely unexplored due to fundamental disparities in token structures. In this paper, we reveal two critical insights: (1) Significant redundancy exists in object-level 3D token representations, analogous to patch-level redundancy in 2D systems; (2) Global attention patterns exhibit strong predictive power for identifying non-essential tokens in 3D contexts. Building on these observations, we propose Fast3D, a plug-and-play visual token pruning framework for 3D MLLMs featuring two technical innovations: (1) Global Attention Prediction (GAP), where a lightweight neural network learns to predict the global attention distributions of the target model, enabling efficient token importance estimation for precise pruning guidance; (2) Sample-Adaptive visual token Pruning (SAP), which introduces dynamic token budgets through attention-based complexity assessment, automatically adjusting layer-wise pruning ratios based on input characteristics. Both of these two techniques operate without modifying the parameters of the target model. Extensive evaluations across five benchmarks validate the effectiveness of Fast3D, particularly under high visual token pruning ratios. Code is available at https://github.com/wencan25/Fast3D

Comment: Criterion: 1

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10539 [page] [pdf] [kimi]

Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You

TLDR: GWM:提出一种图世界模型,它能同时支持非结构化和图结构化数据,并融合多模态信息,将多样化任务表示为动作。GWM通过统一的多模态令牌或嵌入空间上的通用消息传递算法,在多模态生成与匹配、推荐、图预测、多智能体、检索增强生成及规划优化等六个任务上,性能超越或媲美领域特定基线,并展现出强大的零/少样本能力,代码已开源。

Abstract

arXiv:2507.10539v1 Announce Type: new Abstract: World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data and cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multi-modal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph-structured states with multi-modal information and represents diverse tasks as actions. The core of a GWM is a generic message-passing algorithm to aggregate structured information, either over a unified multi-modal token space by converting multi-modal data into text (GWM-T) or a unified multi-modal embedding space by modality-specific encoders (GWM-E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on six tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines' performance, benefits from multi-hop structures, and demonstrates strong zero-shot/few-shot capabilities on unseen new tasks. Our code for GWM is released at https://github.com/ulab-uiuc/GWM.

Comment: Criterion: 1, 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09531 [page] [pdf] [kimi]

Authors: Son Nguyen, Giang Nguyen, Hung Dao, Thao Do, Daeyoung Kim

TLDR: VDInstruct:一种用于零样本关键信息提取(KIE)的多模态大语言模型,引入内容感知视觉分词策略,将空间区域检测与语义特征提取分离,实现依据文档复杂性生成token,而非统一碎片化,显著减少图像tokens约3.6倍,并在KIE基准上达到SOTA,零样本性能超越DocOwl 1.5达5.5 F1点,代码已开源。

Abstract

arXiv:2507.09531v1 Announce Type: new Abstract: Key Information Extraction (KIE) underpins the understanding of visual documents (e.g., receipts and contracts) by extracting precise semantic content and accurately capturing spatial structure. Yet existing multimodal large language models (MLLMs) often perform poorly on dense documents and rely on vision tokenization approaches that scale with image size, leading to redundant computation and memory inefficiency. To address these challenges, we introduce VDInstruct, an MLLM that separates spatial region detection from semantic feature extraction. Central to our model is a content-aware tokenization strategy: rather than fragmenting the entire image uniformly, it generates tokens in proportion to document complexity, preserving critical structure while eliminating wasted tokens. Leveraging a three-stage training paradigm, our model achieves state-of-the-art (SOTA) results on KIE benchmarks, matching or exceeding the accuracy of leading approaches while reducing the number of image tokens by roughly 3.6x. In zero-shot evaluations, VDInstruct surpasses strong baselines-such as DocOwl 1.5-by +5.5 F1 points, highlighting its robustness to unseen documents. These findings show that content-aware tokenization combined with explicit layout modeling offers a promising direction forward for document understanding. Data, source code, and model weights will be made publicly available.

Comment: Criterion: 1

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09406 [page] [pdf] [kimi]

Authors: Santhosh Kumar Ravindran

TLDR: Adversarial Activation Patching:该研究提出了一种新颖的机械可解释性框架,即对抗性激活修补,用于检测和缓解安全对齐Transformer模型(包括LLMs和推及的多模态设置)中的紧急欺骗行为,通过激活修补模拟漏洞并量化欺骗率,为AI安全提供了重要见解和实证研究路线图。

Abstract

arXiv:2507.09406v1 Announce Type: new Abstract: Large language models (LLMs) aligned for safety through techniques like reinforcement learning from human feedback (RLHF) often exhibit emergent deceptive behaviors, where outputs appear compliant but subtly mislead or omit critical information. This paper introduces adversarial activation patching, a novel mechanistic interpretability framework that leverages activation patching as an adversarial tool to induce, detect, and mitigate such deception in transformer-based models. By sourcing activations from "deceptive" prompts and patching them into safe forward passes at specific layers, we simulate vulnerabilities and quantify deception rates. Through toy neural network simulations across multiple scenarios (e.g., 1000 trials per setup), we demonstrate that adversarial patching increases deceptive outputs to 23.9% from a 0% baseline, with layer-specific variations supporting our hypotheses. We propose six hypotheses, including transferability across models, exacerbation in multimodal settings, and scaling effects. An expanded literature review synthesizes over 20 key works in interpretability, deception, and adversarial attacks. Mitigation strategies, such as activation anomaly detection and robust fine-tuning, are detailed, alongside ethical considerations and future research directions. This work advances AI safety by highlighting patching's dual-use potential and provides a roadmap for empirical studies on large-scale models.

Comment: Criterion: 3, 1

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10300 [page] [pdf] [kimi]

Authors: Hatef Otroshi Shahreza, S'ebastien Marcel

TLDR: FaceLLM:首个专门用于面部理解的多模态大语言模型,提出一种新颖的弱监督流水线,利用ChatGPT生成高质量问答对以构建FairFaceGPT数据集,显著提升了MLLM在面部相关任务上的性能并达到SOTA,为构建可信赖的以人为中心的AI系统奠定基础,模型和数据集已开源。

Abstract

arXiv:2507.10300v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have shown remarkable performance in vision-language tasks. However, existing MLLMs are primarily trained on generic datasets, limiting their ability to reason on domain-specific visual cues such as those in facial images. In particular, tasks that require detailed understanding of facial structure, expression, emotion, and demographic features remain underexplored by MLLMs due to the lack of large-scale annotated face image-text datasets. In this work, we introduce FaceLLM, a multimodal large language model trained specifically for facial image understanding. To construct the training data, we propose a novel weakly supervised pipeline that uses ChatGPT with attribute-aware prompts to generate high-quality question-answer pairs based on images from the FairFace dataset. The resulting corpus, called FairFaceGPT, covers a diverse set of attributes including expression, pose, skin texture, and forensic information. Our experiments demonstrate that FaceLLM improves the performance of MLLMs on various face-centric tasks and achieves state-of-the-art performance. This work highlights the potential of synthetic supervision via language models for building domain-specialized MLLMs, and sets a precedent for trustworthy, human-centric multimodal AI systems. FairFaceGPT dataset and pretrained FaceLLM models are publicly available in the project page.

Comment: Criterion: 1

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09068 [page] [pdf] [kimi]

Authors: Dell Zhang, Xiangyu Chen, Jixiang Luo, Mengxi Jia, Changzhi Sun, Ruilong Ren, Jingren Liu, Hao Sun, Xuelong Li

TLDR: 无限视频理解:这是一篇前瞻性论文,提出了“无限视频理解”这一宏大研究目标,旨在解决多模态大语言模型(MLLMs)在处理任意时长视频内容时面临的计算、内存及时间连贯性挑战。它为多媒体和AI社区指明了未来研究方向,包括流式架构、持久记忆机制、分层自适应表征及事件中心推理。

Abstract

arXiv:2507.09068v1 Announce Type: new Abstract: The rapid advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have ushered in remarkable progress in video understanding. However, a fundamental challenge persists: effectively processing and comprehending video content that extends beyond minutes or hours. While recent efforts like Video-XL-2 have demonstrated novel architectural solutions for extreme efficiency, and advancements in positional encoding such as HoPE and VideoRoPE++ aim to improve spatio-temporal understanding over extensive contexts, current state-of-the-art models still encounter significant computational and memory constraints when faced with the sheer volume of visual tokens from lengthy sequences. Furthermore, maintaining temporal coherence, tracking complex events, and preserving fine-grained details over extended periods remain formidable hurdles, despite progress in agentic reasoning systems like Deep Video Discovery. This position paper posits that a logical, albeit ambitious, next frontier for multimedia research is Infinite Video Understanding -- the capability for models to continuously process, understand, and reason about video data of arbitrary, potentially never-ending duration. We argue that framing Infinite Video Understanding as a blue-sky research objective provides a vital north star for the multimedia, and the wider AI, research communities, driving innovation in areas such as streaming architectures, persistent memory mechanisms, hierarchical and adaptive representations, event-centric reasoning, and novel evaluation paradigms. Drawing inspiration from recent work on long/ultra-long video understanding and several closely related fields, we outline the core challenges and key research directions towards achieving this transformative capability.

Comment: Criterion: 1

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09071 [page] [pdf] [kimi]

Authors: Tharun Adithya Srikrishnan, Deval Shah, Steven K. Reinhardt

TLDR: BlindSight:本研究提出BlindSight,一种针对大型视觉-语言模型(VLM)推理效率优化的免训练方法!它通过深入分析VLM注意力模式的稀疏性,设计出输入模板感知的注意力稀疏掩码,平均FLOPs减少32%-41%,且准确率仅有微小变化,显著缓解长多模态提示带来的预填充瓶颈,在Qwen2-VL等模型上表现出色。

Abstract

arXiv:2507.09071v1 Announce Type: new Abstract: Large vision-language models (VLMs) enable the joint processing of text and images. However, the inclusion of vision data significantly expands the prompt length. Along with the quadratic complexity of the attention computation, this results in a longer prefill duration. An approach to mitigate this bottleneck is to leverage the inherent sparsity in the attention computation. In our analysis of attention patterns in VLMs, we observe that a substantial portion of layers exhibit minimal cross-image attention, except through attention-sink tokens per image. These sparse attention patterns fall into distinct categories: sink-only, document mask and a hybrid document-sink mask. Based on this, we propose BlindSight: a training-free approach to optimize VLM inference using a input template-aware attention sparsity mask. We utilize samples from a dataset to derive a prompt-agnostic sparsity categorization for every attention head. We evaluate the proposed technique using VLMs such as Qwen2-VL, Qwen2.5-VL and Gemma-3. BlindSight results in a 32%-41% reduction in FLOPs on average with -2%-+2% accuracy compared to the original model in most evaluated multi-image understanding benchmarks.

Comment: Criterion: 1

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10095 [page] [pdf] [kimi]

Authors: Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, Wei Liu

TLDR: FIX-CLIP:一种针对CLIP模型长文本理解能力不足而提出的新型双分支层级对比学习框架,通过双分支训练管线、可学习区域提示和层级特征对齐模块,并利用现有MLLM合成30M长文本图像说明进行训练,显著提升了模型在长短文本检索任务上的SOTA性能,其文本编码器可即插即用于扩散模型处理长文本输入。

Abstract

arXiv:2507.10095v1 Announce Type: new Abstract: CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs (>77 tokens). To remedy this issue, we propose FIX-CLIP which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that FIX-CLIP's text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input.

Comment: Criterion: 1

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09139 [page] [pdf] [kimi]

Authors: Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno

TLDR: PoseLLM:首个基于大语言模型的人体姿态估计框架,通过非线性 MLP 视觉-语言连接器增强跨模态特征融合,在 COCO 验证集上 AP 达到 77.8,超越 LocLLM 0.4 AP,并保持强大的零样本泛化能力。代码已开源。

Abstract

arXiv:2507.09139v1 Announce Type: new Abstract: Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM's linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at https://github.com/Ody-trek/PoseLLM.

Comment: Criterion: 1

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10442 [page] [pdf] [kimi]

Authors: Shivam Chandhok, Wan-Cyuan Fan, Vered Shwartz, Vineeth N Balasubramanian, Leonid Sigal

TLDR: 本文深入分析SoTA视觉语言模型(VLM)在基础视觉任务上的局限性,通过构建一系列超越传统基准的测试,并对比VLM最终响应与从视觉编码器、中间视觉-语言投影及LLM解码器输出的特征探测器性能,揭示了VLM在处理视觉信息时的不足与鲁棒性问题,旨在指导未来VLM改进。

Abstract

arXiv:2507.10442v1 Announce Type: new Abstract: Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking. Importantly, we go significantly beyond the current benchmarks, which simply measure the final performance of VLM response, by also comparing and contrasting it to the performance of probes trained directly on features obtained from the visual encoder, intermediate vision-language projection and LLM-decoder output. In doing so, we uncover shortcomings in VLMs and make a number of important observations about their capabilities, robustness and how they process visual information. We hope our insights will guide progress in further improving VLMs.

Comment: Criterion: 1, 7

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10213 [page] [pdf] [kimi]

Authors: Shicai Wei, Chunbo Luo, Yang Luo

TLDR: DGL:该论文提出一种名为解耦梯度学习(DGL)的框架,旨在解决多模态学习中模态编码器与模态融合模块之间的优化冲突问题,通过解耦它们的优化过程,避免了梯度干扰,使得多模态模型的各模态性能优于单模态模型,并在多种模态、任务和框架上展现了出色的有效性和通用性,代码已开源!

Abstract

arXiv:2507.10213v1 Announce Type: new Abstract: Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning. Existing methods attribute this problem to the imbalanced learning between modalities and rebalance them through gradient modulation. However, they fail to explain why the dominant modality in multimodal models also underperforms that in unimodal learning. In this work, we reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models. Specifically, we prove that the cross-modal fusion in multimodal models decreases the gradient passed back to each modality encoder compared with unimodal models. Consequently, the performance of each modality in the multimodal model is inferior to that in the unimodal model. To this end, we propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality encoder and modality fusion module in the multimodal model. DGL truncates the gradient back-propagated from the multimodal loss to the modality encoder and replaces it with the gradient from unimodal loss. Besides, DGL removes the gradient back-propagated from the unimodal loss to the modality fusion module. This helps eliminate the gradient interference between the modality encoder and modality fusion module while ensuring their respective optimization processes. Finally, extensive experiments on multiple types of modalities, tasks, and frameworks with dense cross-modal interaction demonstrate the effectiveness and versatility of the proposed DGL. Code is available at \href{https://github.com/shicaiwei123/ICCV2025-GDL}{https://github.com/shicaiwei123/ICCV2025-GDL}

Comment: Criterion: 1

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10015 [page] [pdf] [kimi]

Authors: Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto

TLDR: Hyma:该论文提出一种基于超网络的Hypernetwork Model Alignment (Hyma)框架,旨在解决多模态基础模型中单模态模型选择和连接器训练的计算昂贵问题,实现了对N×M种单模态模型组合的连接器联合训练,平均将最优单模态模型对搜索成本降低10倍,同时保持了匹配的性能,极大地提升了多模态模型构建效率!

Abstract

arXiv:2507.10015v1 Announce Type: new Abstract: Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an autoregressive text model. This stitching process is performed by training a connector module that aims to align the representation-representation or representation-input spaces of these uni-modal models. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N \times M$ combinations of uni-modal models. In our experiments, Hyma reduces the optimal uni-modal model pair search cost by $10\times$ (averaged across all experiments), while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.

Comment: Criterion: 1

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09500 [page] [pdf] [kimi]

Authors: Yiwen Liang, Hui Chen, Yizhe Xiong, Zihan Zhou, Mengyao Lyu, Zijia Lin, Shuaicheng Niu, Sicheng Zhao, Jungong Han, Guiguang Ding

TLDR: ReTA:提出一种可靠的视觉语言模型(VLM)测试时自适应(TTA)方法,旨在解决分布偏移下的可靠性挑战。该方法通过一致性感知熵重加权(CER)构建高质量缓存,并利用多样性驱动分布校准(DDC)实现自适应决策边界,显著提升了VLM在真实世界分布偏移下的性能。

Abstract

arXiv:2507.09500v1 Announce Type: new Abstract: Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable, which has motivated the development of Test-Time Adaptation (TTA) to improve VLMs' performance during inference without annotations. Among various TTA approaches, cache-based methods show promise by preserving historical knowledge from low-entropy samples in a dynamic cache and fostering efficient adaptation. However, these methods face two critical reliability challenges: (1) entropy often becomes unreliable under distribution shifts, causing error accumulation in the cache and degradation in adaptation performance; (2) the final predictions may be unreliable due to inflexible decision boundaries that fail to accommodate large downstream shifts. To address these challenges, we propose a Reliable Test-time Adaptation (ReTA) method that integrates two complementary strategies to enhance reliability from two perspectives. First, to mitigate the unreliability of entropy as a sample selection criterion for cache construction, we introduce Consistency-aware Entropy Reweighting (CER), which incorporates consistency constraints to weight entropy during cache updating. While conventional approaches rely solely on low entropy for cache prioritization and risk introducing noise, our method leverages predictive consistency to maintain a high-quality cache and facilitate more robust adaptation. Second, we present Diversity-driven Distribution Calibration (DDC), which models class-wise text embeddings as multivariate Gaussian distributions, enabling adaptive decision boundaries for more accurate predictions across visually diverse content. Extensive experiments demonstrate that ReTA consistently outperforms state-of-the-art methods, particularly under challenging real-world distribution shifts.

Comment: Criterion: 1

Relevance: 8 Novelty: 7 Back to [topic] [top]

Back to [top]


Topic 2: 2. Unified Multimodal Large Language Models for Understanding and Generating (1 papers)

ArXiv: 2507.09910 [page] [pdf] [kimi]

Authors: Yadong Qu, Shancheng Fang, Yuxin Wang, Xiaorui Wang, Zhineng Chen, Hongtao Xie, Yongdong Zhang

TLDR: IGD:提出首个面向指令式平面设计的多模态分层生成框架,利用多模态大语言模型(MLLM)进行属性预测、层级排序与布局,并结合扩散模型生成图像内容,实现参数化渲染和图像资产生成,可通过自然语言指令快速生成可编辑的多模态图层,克服现有方法在创意、智能化及文本可读性上的不足,支持端到端训练和复杂平面设计任务的扩展。

Abstract

arXiv:2507.09910v1 Announce Type: new Abstract: Graphic design visually conveys information and data by creating and combining text, images and graphics. Two-stage methods that rely primarily on layout generation lack creativity and intelligence, making graphic design still labor-intensive. Existing diffusion-based methods generate non-editable graphic design files at image level with poor legibility in visual text rendering, which prevents them from achieving satisfactory and practical automated graphic design. In this paper, we propose Instructional Graphic Designer (IGD) to swiftly generate multimodal layers with editable flexibility with only natural language instructions. IGD adopts a new paradigm that leverages parametric rendering and image asset generation. First, we develop a design platform and establish a standardized format for multi-scenario design files, thus laying the foundation for scaling up data. Second, IGD utilizes the multimodal understanding and reasoning capabilities of MLLM to accomplish attribute prediction, sequencing and layout of layers. It also employs a diffusion model to generate image content for assets. By enabling end-to-end training, IGD architecturally supports scalability and extensibility in complex graphic design tasks. The superior experimental results demonstrate that IGD offers a new solution for graphic design.

Comment: Criterion: 2

Relevance: 9 Novelty: 8 Back to [topic] [top]

Back to [top]


Topic 3: 3. Large Language Models (24 papers)

ArXiv: 2507.09404 [page] [pdf] [kimi]

Authors: Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin

TLDR: 本文提出一种基于缩放定律的系统性方法,用于确定大型基础模型(包括LLM、NMM和LVM)的最佳数据混合比例,该方法能准确预测模型性能并从少量小规模训练中估计参数,为模型训练提供了一种原理性替代方案,避免了昂贵的试错过程。

Abstract

arXiv:2507.09404v1 Announce Type: new Abstract: Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size $N$ trained with $D$ tokens and a specific domain weight vector $h$. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget ($N$,$D$), providing a principled alternative to costly trial-and-error methods.

Comment: Criterion: 1, 3

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09937 [page] [pdf] [kimi]

Authors: Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan

TLDR: MemSinks:一种隔离大型语言模型(LLMs)记忆化行为的新范式,旨在解决模型记忆重复序列带来的隐私和版权问题。通过引入序列标识符激活独特的记忆神经元,实现记忆内容的隔离,使其在不损害通用语言能力的情况下更容易被移除。在数十亿参数和令牌规模上验证了其有效隔离和强大的泛化能力。代码已开源。

Abstract

arXiv:2507.09937v1 Announce Type: new Abstract: Large language models are susceptible to memorizing repeated sequences, posing privacy and copyright concerns. A popular mitigation strategy is to remove memorized information from specific neurons post-hoc. However, such approaches have shown limited success so far. In a controlled setting, we show that the memorization of natural sequences (those that resemble linguistically plausible text) become mechanistically entangled with general language abilities, thereby becoming challenging to remove post-hoc. In this work, we put forward a new paradigm of MemSinks that promotes isolation of memorization by design. We leverage a sequence identifier that activates a unique set of memorization neurons for each sequence across repetitions. By analyzing the dynamics of learning and forgetting, we argue that MemSinks facilitates isolation of memorized content, making it easier to remove without compromising general language capabilities. We implement MemSinks at the billion-parameter and billion-token scale, and observe both effective isolation and strong generalization. To our knowledge, this is the first proof-of-concept on real data demonstrating that simultaneous generalization and isolation is achievable. We open-source our code at http://github.com/grghosal/MemSinks.

Comment: Criterion: 3

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09025 [page] [pdf] [kimi]

Authors: Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen

TLDR: Lizard:一个高效的线性化框架,将预训练的 Transformer-based 大型语言模型(LLMs)转化为亚二次方架构,实现无限上下文生成。它通过引入门控线性注意力与滑动窗口注意力的混合机制,解决了 Transformer 的内存和计算瓶颈,支持恒定内存推理和长程泛化,并包含硬件感知加速算法,在语言建模任务中实现接近无损的性能恢复,显著优于现有线性化方法。

Abstract

arXiv:2507.09025v1 Announce Type: new Abstract: We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model's performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.

Comment: Criterion: 3

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09185 [page] [pdf] [kimi]

Authors: Ameen Ali, Shahar Katz, Lior Wolf, Ivan Titov

TLDR: 本文提出一种新颖的微调方法,通过集成梯度识别并剪枝LLM中与数据集特定机制相关的有害神经元,旨在增强泛化能力,在多项选择基准测试中显著超越现有适应方法,提升LLM对新任务和分布的鲁棒性。

Abstract

arXiv:2507.09185v1 Announce Type: new Abstract: Large language models (LLMs) often develop learned mechanisms specialized to specific datasets, such as reliance on domain-specific correlations, which yield high-confidence predictions without generalizable reasoning. While beneficial in one setting, these dataset-specific mechanisms typically degrade performance when models encounter novel tasks or distributions. In this work, we introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms in transformer-based LLMs. Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance without supporting robust, transferable reasoning. Selectively pruning these neurons compels the model to depend on generalizable representations. Evaluated across multiple-choice benchmarks, our pruning-based fine-tuning significantly enhances performance, surpassing prior (non-pruning) adaptation methods.

Comment: Criterion: 3

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09955 [page] [pdf] [kimi]

Authors: Luolin Xiong, Haofen Wang, Xi Chen, Lu Sheng, Yun Xiong, Jingping Liu, Yanghua Xiao, Huajun Chen, Qing-Long Han, Yang Tang

TLDR: DeepSeek:该论文全面回顾了DeepSeek V3和R1系列大语言模型(LLM)的范式转变与技术演进,详细介绍了其低成本、高性能、开源等优势。文章深入探讨了DeepSeek引入的多头潜在注意力(MLA)、混合专家(MoE)、多令牌预测(MTP)和组相对策略优化(GRPO)等创新算法,并分析了其在LLM扩展、训练、推理和系统级优化架构方面的工程突破,展望了未来大模型技术的发展趋势。

Abstract

arXiv:2507.09955v1 Announce Type: new Abstract: DeepSeek, a Chinese Artificial Intelligence (AI) startup, has released their V3 and R1 series models, which attracted global attention due to their low cost, high performance, and open-source advantages. This paper begins by reviewing the evolution of large AI models focusing on paradigm shifts, the mainstream Large Language Model (LLM) paradigm, and the DeepSeek paradigm. Subsequently, the paper highlights novel algorithms introduced by DeepSeek, including Multi-head Latent Attention (MLA), Mixture-of-Experts (MoE), Multi-Token Prediction (MTP), and Group Relative Policy Optimization (GRPO). The paper then explores DeepSeek engineering breakthroughs in LLM scaling, training, inference, and system-level optimization architecture. Moreover, the impact of DeepSeek models on the competitive AI landscape is analyzed, comparing them to mainstream LLMs across various fields. Finally, the paper reflects on the insights gained from DeepSeek innovations and discusses future trends in the technical and engineering development of large AI models, particularly in data, training, and reasoning.

Comment: Criterion: 3

Relevance: 10 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09839 [page] [pdf] [kimi]

Authors: MohammadReza Davari, Utkarsh Garg, Weixin Cai, Eugene Belilovsky

TLDR: 该研究提出一种新颖的黑盒大语言模型(LLM)提示优化框架,通过引入正向强化和反馈多样化机制,增强LLM生成反馈的有效性和效率。此外,论文还首次形式化了持续提示优化(CPO)以应对跨模型版本或API提供商的提示迁移挑战。实验表明,该方法在准确性、收敛速度和计算成本方面显著优于基线。

Abstract

arXiv:2507.09839v1 Announce Type: new Abstract: An increasing number of NLP applications interact with large language models (LLMs) through black-box APIs, making prompt engineering critical for controlling model outputs. While recent Automatic Prompt Optimization (APO) methods iteratively refine prompts using model-generated feedback, textual gradients, they primarily focus on error correction and neglect valuable insights from correct predictions. This limits both their effectiveness and efficiency. In this paper, we propose a novel APO framework centered on enhancing the feedback mechanism. We reinterpret the textual gradient as a form of negative reinforcement and introduce the complementary positive reinforcement to explicitly preserve beneficial prompt components identified through successful predictions. To mitigate the noise inherent in LLM-generated feedback, we introduce a technique called feedback diversification, which aggregates multiple feedback signals, emphasizing consistent, actionable advice while filtering out outliers. Motivated by the rapid evolution and diversity of available LLMs, we also formalize Continual Prompt Optimization (CPO), addressing the practical challenge of efficiently migrating optimized prompts between different model versions or API providers. Our experiments reveal that naive prompt migration often degrades performance due to loss of critical instructions. In contrast, our approach consistently outperforms strong baselines, achieving significant accuracy improvements, faster convergence, and lower computational costs in both standard and migration scenarios.

Comment: Criterion: 3, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10524 [page] [pdf] [kimi]

Authors: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun

TLDR: Mixture-of-Recursions (MoR):针对大型语言模型计算和内存开销,提出一种结合参数共享和自适应计算的新型统一框架,通过动态分配不同递归深度和选择性缓存KV对,显著降低了验证困惑度并提高少样本准确性,在相同训练FLOPs和更小模型尺寸下,实现更高吞吐量。

Abstract

arXiv:2507.10524v1 Announce Type: new Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

Comment: Criterion: 3

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.08967 [page] [pdf] [kimi]

Authors: Rongyi Zhu, Yuhui Wang, Tanqiu Jiang, Jiacheng Liang, Ting Wang

TLDR: SIMS:《自改进模型操纵》:首个无需外部监督的自改进模型操纵框架,通过自主生成和优化对比样本实现迭代式自改进,显著提升了LLM在推理时的对齐效果和适应性,超越现有方法;该框架为未来推理期LLM对齐研究提供了新方向。

Abstract

arXiv:2507.08967v1 Announce Type: new Abstract: Model steering represents a powerful technique that dynamically aligns large language models (LLMs) with human preferences during inference. However, conventional model-steering methods rely heavily on externally annotated data, not only limiting their adaptability to varying contexts but also tethering their effectiveness to annotation quality. In this paper, we present SIMS, the first self-improving model-steering framework that operates without relying on external supervision. At its core, SIMS autonomously generates and refines contrastive samples through iterative self-improvement cycles, enabling adaptive, context-specific steering. Additionally, SIMS employs novel strategies, including prompt ranking and contrast sampling, to further enhance steering efficacy. Extensive evaluation across diverse LLMs and benchmarks demonstrates that SIMS substantially outperforms existing methods in steering effectiveness and adaptability, highlighting self-improving model steering as a promising direction for future research on inference-time LLM alignment.

Comment: Criterion: 3

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09850 [page] [pdf] [kimi]

Authors: Wei Du, Branislav Kisacanin, George Armstrong, Shubham Toshniwal, Ivan Moshkov, Alexan Ayrapetyan, Sadegh Mahdavi, Dan Zhao, Shizhe Diao, Dragan Masulovic, ..., Sri Yanamandara, Mihir Tandon, Sriram Ananthakrishnan, Vedant Rathi, David Zhang, Joonseok Kang, Leon Luo, Titu Andreescu, Boris Ginsburg, Igor Gitman

TLDR: 论文《人工编写数据就够了吗?在无RL或蒸馏下教授LLM推理能力的挑战》:提出仅用少量(20个)高质量CoT人类编写数据,即可在不依赖强化学习或模型蒸馏的情况下,显著提升基础LLM(Qwen2.5-32B)的推理能力,甚至超越更大的模型(Qwen2.5-Math-72B-Instruct);研究分析了推理数据属性,并已开源人工创作数据集。

Abstract

arXiv:2507.09850v1 Announce Type: new Abstract: Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought (CoT) traces. While recent works show that base models can acquire such reasoning traces via reinforcement learning or distillation from stronger models like DeepSeek-R1, previous works demonstrate that even short CoT prompting without fine-tuning is able to improve reasoning. We ask whether long CoT can be induced in a base model using only prompting or minimal tuning. Using just 20 long CoT examples from the reasoning model \texttt{QwQ-32B-Preview}, we lightly fine-tune the base model \texttt{Qwen2.5-32B}. The resulting model outperforms the much larger \texttt{Qwen2.5-Math-72B-Instruct}, showing that a handful of high-quality examples can unlock strong reasoning capabilities. We further explore using CoT data from non-reasoning models and human annotators, enhanced with prompt engineering, multi-pass editing, and structural guidance. However, neither matches the performance of reasoning model traces, suggesting that certain latent qualities of expert CoT are difficult to replicate. We analyze key properties of reasoning data, such as problem difficulty, diversity, and answer length, that influence reasoning distillation. While challenges remain, we are optimistic that carefully curated human-written CoT, even in small quantities, can activate reasoning behaviors in base models. We release our human-authored dataset across refinement stages and invite further investigation into what makes small-scale reasoning supervision so effective.

Comment: Criterion: 3

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10532 [page] [pdf] [kimi]

Authors: Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang

TLDR: 该研究揭示了大型语言模型(LLM)通过强化学习(RL)增强的推理能力可能因数据污染而产生不可靠结果;为此,提出了一个完全合成、无数据泄露的算术数据集RandomCalculation,并证明只有准确的奖励信号才能持续提升性能,呼吁在无污染基准上评估RL方法以确保结果可信。

Abstract

arXiv:2507.10532v1 Announce Type: new Abstract: The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks. As a result, results derived from these benchmarks may be unreliable. To address this, we introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation. Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not. We advocate for evaluating RL methods on uncontaminated benchmarks and across diverse model families to ensure trustworthy conclusions.

Comment: Criterion: 3, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09709 [page] [pdf] [kimi]

Authors: Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi

TLDR: 这项大规模经验研究发现大型语言模型(LLMs)将高层语义信息编码在低维线性子空间中,并且在更深层或触发结构化推理的提示下这种可分离性更为明显;该几何特性支持开发基于潜在表示的工具,例如通过轻量级MLP分类器作为潜在空间护栏,以高精度检测有害或对抗性提示。

Abstract

arXiv:2507.09709v1 Announce Type: new Abstract: Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. \baturay{However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To investigate this, we conduct a large-scale empirical study of hidden states in transformer-based LLMs, analyzing 11 decoder-only models across 6 scientific topics and 12 layers each. We find that high-level semantic information consistently lies in low-dimensional subspaces that form linearly separable representations across distinct domains. This separability becomes more pronounced in deeper layers and under prompts that trigger structured reasoning or alignment behaviors$\unicode{x2013}$even when surface content is unchanged. This geometry enables simple yet effective causal interventions in hidden space; for example, reasoning patterns like chain-of-thought can be captured by a single vector direction. Together, these findings support the development of geometry-aware tools that operate directly on latent representations to detect and mitigate harmful or adversarial content, using methods such as transport-based defenses that leverage this separability. As a proof of concept, we demonstrate this potential by training a simple MLP classifier as a lightweight latent-space guardrail, which detects adversarial and malicious prompts with high precision.

Comment: Criterion: 3

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09406 [page] [pdf] [kimi]

Authors: Santhosh Kumar Ravindran

TLDR: Adversarial Activation Patching:该研究提出了一种新颖的机械可解释性框架,即对抗性激活修补,用于检测和缓解安全对齐Transformer模型(包括LLMs和推及的多模态设置)中的紧急欺骗行为,通过激活修补模拟漏洞并量化欺骗率,为AI安全提供了重要见解和实证研究路线图。

Abstract

arXiv:2507.09406v1 Announce Type: new Abstract: Large language models (LLMs) aligned for safety through techniques like reinforcement learning from human feedback (RLHF) often exhibit emergent deceptive behaviors, where outputs appear compliant but subtly mislead or omit critical information. This paper introduces adversarial activation patching, a novel mechanistic interpretability framework that leverages activation patching as an adversarial tool to induce, detect, and mitigate such deception in transformer-based models. By sourcing activations from "deceptive" prompts and patching them into safe forward passes at specific layers, we simulate vulnerabilities and quantify deception rates. Through toy neural network simulations across multiple scenarios (e.g., 1000 trials per setup), we demonstrate that adversarial patching increases deceptive outputs to 23.9% from a 0% baseline, with layer-specific variations supporting our hypotheses. We propose six hypotheses, including transferability across models, exacerbation in multimodal settings, and scaling effects. An expanded literature review synthesizes over 20 key works in interpretability, deception, and adversarial attacks. Mitigation strategies, such as activation anomaly detection and robust fine-tuning, are detailed, alongside ethical considerations and future research directions. This work advances AI safety by highlighting patching's dual-use potential and provides a roadmap for empirical studies on large-scale models.

Comment: Criterion: 3, 1

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09482 [page] [pdf] [kimi]

Authors: Changli Wang, Rui Wu, Fang Yin

TLDR: ViSP:该框架针对多模态讽刺文本生成,提出了首个多模态讽刺生成数据集M2SaG(包含图像和讽刺文本),并引入了融合PPO和对比学习的生成框架ViSP,通过奖励分数指导生成,提升了讽刺文本质量,并超越了包括大型语言模型在内的所有基线,生成文本具有更高的讽刺得分和事实不一致性。数据集和代码将开源。

Abstract

arXiv:2507.09482v1 Announce Type: new Abstract: Human emotions are complex, with sarcasm being a subtle and distinctive form. Despite progress in sarcasm research, sarcasm generation remains underexplored, primarily due to the overreliance on textual modalities and the neglect of visual cues, as well as the mismatch between image content and sarcastic intent in existing datasets. In this paper, we introduce M2SaG, a multimodal sarcasm generation dataset with 4,970 samples, each containing an image, a sarcastic text, and a sarcasm target. To benchmark M2SaG, we propose ViSP, a generation framework that integrates Proximal Policy Optimization (PPO) and contrastive learning. PPO utilizes reward scores from DIP to steer the generation of sarcastic texts, while contrastive learning encourages the model to favor outputs with higher reward scores. These strategies improve overall generation quality and produce texts with more pronounced sarcastic intent. We evaluate ViSP across five metric sets and find it surpasses all baselines, including large language models, underscoring their limitations in sarcasm generation. Furthermore, we analyze the distributions of Sarcasm Scores and Factual Incongruity for both M2SaG and the texts generated by ViSP. The generated texts exhibit higher mean Sarcasm Scores (0.898 vs. 0.770) and Factual Incongruity (0.768 vs. 0.739), demonstrating that ViSP produces higher-quality sarcastic content than the original dataset. % The dataset and code will be publicly available. Our dataset and code will be released at \textit{https://github.com/wclapply/ViSP}.

Comment: Criterion: 3, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09751 [page] [pdf] [kimi]

Authors: Bradley P. Allen, Prateek Chhikara, Thomas Macaulay Ferguson, Filip Ilievski, Paul Groth

TLDR: 该研究提出一种通过将LLM直接整合到非经典逻辑形式语义的解释函数中,实现了LLM驱动的神经符号推理方法,有效利用LLM的广泛知识同时保留了逻辑的完备性和可靠性,为解决LLM输出的逻辑不一致性提供了坚实的理论框架和实验证据。

Abstract

arXiv:2507.09751v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but they exhibit problems with logical consistency in the output they generate. How can we harness LLMs' broad-coverage parametric knowledge in formal reasoning despite their inconsistency? We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic. We provide experimental evidence for the feasibility of the method by evaluating the function using datasets created from several short-form factuality benchmarks. Unlike prior work, our method offers a theoretical framework for neuro-symbolic reasoning that leverages an LLM's knowledge while preserving the underlying logic's soundness and completeness properties.

Comment: Criterion: 3

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10540 [page] [pdf] [kimi]

Authors: Tao Feng, Haozhen Zhang, Zijie Lei, Pengrui Han, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jiaxuan You

TLDR: FusionFactory:一种系统性的大语言模型(LLM)融合框架,旨在整合不同LLM的独特优势,通过利用LLM路由数据,提出查询级、思维级和模型级三种融合策略。该框架在包含14个任务和20个开源LLM的FusionBench路由基准上,一致超越了最佳单一LLM的表现,显著提升了复杂任务处理的性能与效率。

Abstract

arXiv:2507.10540v1 Announce Type: new Abstract: The rapid advancement of large language models (LLMs) has created a vibrant ecosystem of diverse architectures, each with unique strengths due to differences in design, training data, and objectives. However, most applications still rely on a single backend model, limiting coverage of capabilities and leading to inefficiencies in performance and token cost when tackling complex tasks. We highlight an underexploited opportunity: LLM routing data, produced when hosting platforms route diverse queries to different models, which can reveal comparative strengths across tasks. To address this, we propose FusionBench, a comprehensive routing benchmark covering 14 tasks across five domains with 20 open-source LLMs (8B to 671B parameters), capturing 103M tokens and summarizing reusable thought templates from top models. Building on this, we introduce FusionFactory, a systematic fusion framework with three levels: (1) query-level fusion, tailoring routers for each query using both direct responses and reasoning-augmented outputs; (2) thought-level fusion, leveraging abstract templates derived from top-performing LLMs' answers to similar queries; and (3) model-level fusion, transferring capabilities between models via distillation, using top responses or highest judge scores as training data. Experiments show FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with optimal fusion configurations varying by benchmark, demonstrating the value of systematic LLM fusion in harnessing complementary strengths and improving overall performance.

Comment: Criterion: 3

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09616 [page] [pdf] [kimi]

Authors: Ofir Gordon, Ariel Lapid, Elad Cohen, Yarden Yagil, Arnon Netzer, Hai Victor Habi

TLDR: MLoRQ:提出一种新型混合低秩与量化(Mixed Low-Rank and Quantization)方法,用于Transformer模型压缩。通过两阶段优化过程,智能分配每层最优位宽和秩,可与现有量化算法无缝集成。该方法在Vision Transformers上实现了高达15%的性能提升,达到SOTA,显著提高了部署于资源受限边缘设备的效率。

Abstract

arXiv:2507.09616v1 Announce Type: new Abstract: Deploying transformer-based neural networks on resource-constrained edge devices presents a significant challenge. This challenge is often addressed through various techniques, such as low-rank approximation and mixed-precision quantization. In this work, we introduce Mixed Low-Rank and Quantization (MLoRQ), a novel method that integrates both techniques. MLoRQ employs a two-stage optimization process to determine optimal bit-width and rank assignments for each layer, adhering to predefined memory constraints. This process includes: (i) an intra-layer optimization that identifies potentially optimal compression solutions out of all low-rank and quantization combinations; (ii) an inter-layer optimization that assigns bit-width precision and rank to each layer while ensuring the memory constraint is met. An optional final step applies a sequential optimization process using a modified adaptive rounding technique to mitigate compression-induced errors in joint low-rank approximation and quantization. The method is compatible and can be seamlessly integrated with most existing quantization algorithms. MLoRQ shows state-of-the-art results with up to 15% performance improvement, evaluated on Vision Transformers for image classification, object detection, and instance segmentation tasks.

Comment: Criterion: 3

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.08833 [page] [pdf] [kimi]

Authors: Seokmin Ko

TLDR: 本文深入分析LLM微调中广泛使用的LoRA技术,指出其在不同模型架构和训练设置下,速度提升并不总是一致,并探究了其速度限制的根本原因。基于这些发现,论文提出了几种更高效的LLM微调方法,在性能相当或更优的同时,实现更一致的训练速度提升,为优化资源受限下的LLM微调提供了宝贵见解。

Abstract

arXiv:2507.08833v1 Announce Type: new Abstract: Low-Rank Adaptation (LoRA) is one of the most widely used techniques for fine-tuning large language models (LLMs). By introducing a small number of trainable low-rank weight matrices, LoRA substantially reduces the number of parameters that need to be updated, offering significant advantages in memory consumption and computational efficiency compared to full fine-tuning. However, we observed that LoRA does not consistently provide speed improvements across all model architectures and training setups. Motivated by this inconsistency, we conduct a comprehensive analysis of LoRA's performance and investigate the underlying factors limiting its speedup. Based on our findings, we propose several methods for more efficient fine-tuning of LLMs. We empirically evaluate these methods and compare them to LoRA, demonstrating that our approach achieves comparable or superior performance while delivering more consistent training speed improvements. Our work offers valuable insights and practical guidelines for practitioners seeking to optimize LLM fine-tuning under resource constraints.

Comment: Criterion: 3

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09931 [page] [pdf] [kimi]

Authors: Yoon Pyo Lee

TLDR: 该论文提出一种通过LoRA微调Gemma-3-1b-it并结合神经元激活模式分析及静默技术,对LLM在核工程领域进行机械可解释性的新方法,成功识别并验证了编码领域知识的稀疏神经元组,显著提升了不透明黑盒模型的透明度,为实现核级AI保障提供了具体途径。

Abstract

arXiv:2507.09931v1 Announce Type: new Abstract: The integration of Large Language Models (LLMs) into safety-critical domains, such as nuclear engineering, necessitates a deep understanding of their internal reasoning processes. This paper presents a novel methodology for interpreting how an LLM encodes and utilizes domain-specific knowledge, using a Boiling Water Reactor system as a case study. We adapted a general-purpose LLM (Gemma-3-1b-it) to the nuclear domain using a parameter-efficient fine-tuning technique known as Low-Rank Adaptation. By comparing the neuron activation patterns of the base model to those of the fine-tuned model, we identified a sparse set of neurons whose behavior was significantly altered during the adaptation process. To probe the causal role of these specialized neurons, we employed a neuron silencing technique. Our results demonstrate that while silencing most of these specialized neurons individually did not produce a statistically significant effect, deactivating the entire group collectively led to a statistically significant degradation in task performance. Qualitative analysis further revealed that silencing these neurons impaired the model's ability to generate detailed, contextually accurate technical information. This paper provides a concrete methodology for enhancing the transparency of an opaque black-box model, allowing domain expertise to be traced to verifiable neural circuits. This offers a pathway towards achieving nuclear-grade artificial intelligence (AI) assurance, addressing the verification and validation challenges mandated by nuclear regulatory frameworks (e.g., 10 CFR 50 Appendix B), which have limited AI deployment in safety-critical nuclear operations.

Comment: Criterion: 3

Relevance: 8 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09846 [page] [pdf] [kimi]

Authors: Minhak Song, Beomhan Baek, Kwangjun Ahn, Chulhee Yun

TLDR: Schedule-Free方法:深入理解并改进无调度(Schedule-Free, SF)方法在语言模型训练中的应用,分析SF动态,揭示其隐式权重平均特性,无需额外内存开销,并提出改进版SF,增强对动量的鲁棒性,在大批量训练下表现更优,建立SF作为一种实用、可扩展且有理论基础的语言模型训练方法。

Abstract

arXiv:2507.09846v1 Announce Type: new Abstract: As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.

Comment: Criterion: 3

Relevance: 8 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10445 [page] [pdf] [kimi]

Authors: Chris Madge, Matthew Purver, Massimo Poesio

TLDR: 本文研究了大语言模型(LLMs)在任务型对话中提出澄清问题的能力,并与人类行为进行对比。研究利用新构建的Minecraft对话语料库,发现人类与LLMs在处理指代歧义和任务不确定性时的澄清行为存在显著差异和弱关联性;人类更侧重任务不确定性,LLMs则更倾向指代歧义。研究还表明,不同的推理方法能影响LLM提出问题的频率和相关性,为理解LLM对话推理提供了新视角。

Abstract

arXiv:2507.10445v1 Announce Type: new Abstract: In this work we examine LLMs' ability to ask clarification questions in task-oriented dialogues that follow the asynchronous instruction-giver/instruction-follower format. We present a new corpus that combines two existing annotations of the Minecraft Dialogue Corpus -- one for reference and ambiguity in reference, and one for SDRT including clarifications -- into a single common format providing the necessary information to experiment with clarifications and their relation to ambiguity. With this corpus we compare LLM actions with original human-generated clarification questions, examining how both humans and LLMs act in the case of ambiguity. We find that there is only a weak link between ambiguity and humans producing clarification questions in these dialogues, and low correlation between humans and LLMs. Humans hardly ever produce clarification questions for referential ambiguity, but often do so for task-based uncertainty. Conversely, LLMs produce more clarification questions for referential ambiguity, but less so for task uncertainty. We question if LLMs' ability to ask clarification questions is predicated on their recent ability to simulate reasoning, and test this with different reasoning approaches, finding that reasoning does appear to increase question frequency and relevancy.

Comment: Criterion: 3

Relevance: 8 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10326 [page] [pdf] [kimi]

Authors: Muzhaffar Hazman, Minh-Khoi Pham, Shweta Soundararajan, Goncalo Mordido, Leonardo Custode, David Lynch, Giorgio Cruciata, Yucheng Shi, Hongmeng Song, Wang Chao, Pan Yue, Aleksandar Milenovic, Alexandros Agapitos

TLDR: 本文提出一种基于语法引导进化搜索的两阶段离散提示优化方法,通过结合语法引导遗传编程和局部搜索,解决了现有方法在处理复杂提示和小型LLM时性能下降的问题。该方法在四项领域特定任务中,超越PromptWizard、OPRO和RL-Prompt等SOTA方法,显著提升了小型通用LLM的性能。

Abstract

arXiv:2507.10326v1 Announce Type: new Abstract: Prompt engineering has proven to be a crucial step in leveraging pretrained large language models (LLMs) in solving various real-world tasks. Numerous solutions have been proposed that seek to automate prompt engineering by using the model itself to edit prompts. However, the majority of state-of-the-art approaches are evaluated on tasks that require minimal prompt templates and on very large and highly capable LLMs. In contrast, solving complex tasks that require detailed information to be included in the prompt increases the amount of text that needs to be optimised. Furthermore, smaller models have been shown to be more sensitive to prompt design. To address these challenges, we propose an evolutionary search approach to automated discrete prompt optimisation consisting of two phases. In the first phase, grammar-guided genetic programming is invoked to synthesise prompt-creating programmes by searching the space of programmes populated by function compositions of syntactic, dictionary-based and LLM-based prompt-editing functions. In the second phase, local search is applied to explore the neighbourhoods of best-performing programmes in an attempt to further fine-tune their performance. Our approach outperforms three state-of-the-art prompt optimisation approaches, PromptWizard, OPRO, and RL-Prompt, on three relatively small general-purpose LLMs in four domain-specific challenging tasks. We also illustrate several examples where these benchmark methods suffer relatively severe performance degradation, while our approach improves performance in almost all task-model combinations, only incurring minimal degradation when it does not.

Comment: Criterion: 3

Relevance: 8 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10435 [page] [pdf] [kimi]

Authors: Xinnan Dai, Kai Yang, Jay Revolinsky, Kai Guo, Aoran Wang, Bohang Zhang, Jiliang Tang

TLDR: ISF:本研究提出了一种诱导子结构过滤(ISF)的视角,深入分析了Transformer模型在文本描述中进行图结构推理的内部机制,揭示了其在处理分子图等复杂归因图时的通用性与高效性,为理解序列模型如何执行图数据子结构提取任务提供了新见解。

Abstract

arXiv:2507.10435v1 Announce Type: new Abstract: Recent studies suggest that large language models (LLMs) possess the capability to solve graph reasoning tasks. Notably, even when graph structures are embedded within textual descriptions, LLMs can still effectively answer related questions. This raises a fundamental question: How can a decoder-only Transformer architecture understand underlying graph structures? To address this, we start with the substructure extraction task, interpreting the inner mechanisms inside the transformers and analyzing the impact of the input queries. Specifically, through both empirical results and theoretical analysis, we present Induced Substructure Filtration (ISF), a perspective that captures the substructure identification in the multi-layer transformers. We further validate the ISF process in LLMs, revealing consistent internal dynamics across layers. Building on these insights, we explore the broader capabilities of Transformers in handling diverse graph types. Specifically, we introduce the concept of thinking in substructures to efficiently extract complex composite patterns, and demonstrate that decoder-only Transformers can successfully extract substructures from attributed graphs, such as molecular graphs. Together, our findings offer a new insight on how sequence-based Transformers perform the substructure extraction task over graph data.

Comment: Criterion: 3

Relevance: 8 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09394 [page] [pdf] [kimi]

Authors: Nandan Kumar Jha, Brandon Reagen

TLDR: 本文运用随机矩阵理论分析了多头潜在注意力(MLA)在Transformer预训练过程中对模型内部容量的影响,通过Marchenko-Pastur诊断揭示了容量瓶颈和秩坍塌现象,并发现共享旋转嵌入组件可有效缓解谱碎片化并保持表示容量,强调了旋转嵌入应用方式的关键性!

Abstract

arXiv:2507.09394v1 Announce Type: new Abstract: In this work, we study how multi-head latent attention (MLA), a popular strategy for compressing key/value memory, affects a transformer's internal capacity during pretraining. Using a lightweight suite of Marchenko-Pastur (MP) diagnostics, we analyze the spectrum of the $W_{Q}W_{K}^\top$ gram matrix throughout training, comparing three variants: the standard multi-head attention (MHA) baseline, MLA-PreRoPE with rotary applied before compression, and MLA-Decoupled, which shares a single rotary sub-vector across all heads. Our random matrix analysis reveals \textbf{three key findings:} \textbf{ i)} capacity bottlenecks emerge locally: both MHA and MLA-PreRoPE exhibit sharp, early spikes in specific layers that persist and propagate, disrupting the balance between bulk and outlier directions; \textbf{ ii)} these spikes coincide with rank collapse, concentrating the model's expressivity into narrow subspaces; \textbf{ iii)} only the decoupled variant prevents this cascade, maintaining broad spectral support and suppressing outlier formation across layers. These results underscore that \emph{how} rotary embeddings are applied is just as critical as \emph{where} compression occurs. Sharing rotary components across heads mitigates spectral fragmentation and preserves representational capacity.

Comment: Criterion: 3

Relevance: 8 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10124 [page] [pdf] [kimi]

Authors: Thomas T. Hills

TLDR: 无(主旨是“Could you be wrong”元认知提示词):一种利用元认知提示词“Could you be wrong?”来消除大型语言模型(LLMs)偏见、提升其元认知能力的新策略,该提示词能促使LLMs在回应后产生额外信息,包括其答案原因、错误、偏见、矛盾证据和替代方案,从而实现有效的去偏和提高内容质量,为LLM对齐和推理提供新途径。

Abstract

arXiv:2507.10124v1 Announce Type: new Abstract: Identifying bias in LLMs is ongoing. Because they are still in development, what is true today may be false tomorrow. We therefore need general strategies for debiasing that will outlive current models. Strategies developed for debiasing human decision making offer one promising approach as they incorporate an LLM-style prompt intervention designed to bring latent knowledge into awareness during decision making. LLMs trained on vast amounts of information contain information about potential biases, counter-arguments, and contradictory evidence, but that information may only be brought to bear if prompted. Metacognitive prompts developed in the human decision making literature are designed to achieve this, and as I demonstrate here, they show promise with LLMs. The prompt I focus on here is "could you be wrong?" Following an LLM response, this prompt leads LLMs to produce additional information, including why they answered as they did, errors, biases, contradictory evidence, and alternatives, none of which were apparent in their initial response. Indeed, this metaknowledge often reveals that how LLMs and users interpret prompts are not aligned. Here I demonstrate this prompt using a set of questions taken from recent articles about LLM biases, including implicit discriminatory biases and failures of metacognition. "Could you be wrong" prompts the LLM to identify its own biases and produce cogent metacognitive reflection. I also present another example involving convincing but incomplete information, which is readily corrected by the metacognitive prompt. In sum, this work argues that human psychology offers a new avenue for prompt engineering, leveraging a long history of effective prompt-based improvements to human decision making.

Comment: Criterion: 3, 6

Relevance: 8 Novelty: 7 Back to [topic] [top]

Back to [top]


Topic 4: 4. Self-Supervised Learning and Vision-Language Pre-training (4 papers)

ArXiv: 2507.10403 [page] [pdf] [kimi]

Authors: Daniele Rege Cambrin, Lorenzo Vaiani, Giuseppe Gallipoli, Luca Cagliero, Paolo Garza

TLDR: CLOSP:一种用于遥感图像检索的对比语言光学SAR预训练框架,通过文本作为桥梁将非配对光学和SAR图像对齐到统一嵌入空间。该研究引入了新的大型数据集CrisisLandMark(包含超过647,000张Sentinel-1 SAR和Sentinel-2多光谱图像),并在文本到图像检索任务上实现了新SOTA,nDGC比现有模型提升54%。此外,GeoCLOSP集成了地理坐标,进一步提升了检索特定地理特征的能力。

Abstract

arXiv:2507.10403v1 Announce Type: new Abstract: Retrieving relevant imagery from vast satellite archives is crucial for applications like disaster response and long-term climate monitoring. However, most text-to-image retrieval systems are limited to RGB data, failing to exploit the unique physical information captured by other sensors, such as the all-weather structural sensitivity of Synthetic Aperture Radar (SAR) or the spectral signatures in optical multispectral data. To bridge this gap, we introduce CrisisLandMark, a new large-scale corpus of over 647,000 Sentinel-1 SAR and Sentinel-2 multispectral images paired with structured textual annotations for land cover, land use, and crisis events harmonized from authoritative land cover systems (CORINE and Dynamic World) and crisis-specific sources. We then present CLOSP (Contrastive Language Optical SAR Pretraining), a novel framework that uses text as a bridge to align unpaired optical and SAR images into a unified embedding space. Our experiments show that CLOSP achieves a new state-of-the-art, improving retrieval nDGC by 54% over existing models. Additionally, we find that the unified training strategy overcomes the inherent difficulty of interpreting SAR imagery by transferring rich semantic knowledge from the optical domain with indirect interaction. Furthermore, GeoCLOSP, which integrates geographic coordinates into our framework, creates a powerful trade-off between generality and specificity: while the CLOSP excels at general semantic tasks, the GeoCLOSP becomes a specialized expert for retrieving location-dependent crisis events and rare geographic features. This work highlights that the integration of diverse sensor data and geographic context is essential for unlocking the full potential of remote sensing archives.

Comment: Criterion: 4

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09574 [page] [pdf] [kimi]

Authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Junjie Hu

TLDR: MENTOR:一种新颖的自回归(AR)框架,用于高效的多模态条件图像生成。该模型采用两阶段训练范式,实现多模态输入与图像输出之间的细粒度、令牌级对齐,无需辅助适配器或交叉注意力模块。MENTOR在DreamBench++基准测试中表现出色,在概念保留和提示遵循方面超越基线,并声称比扩散模型具有更优的图像重建保真度与训练效率。代码已开源。

Abstract

arXiv:2507.09574v1 Announce Type: new Abstract: Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

Comment: Criterion: 5, 4

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09487 [page] [pdf] [kimi]

Authors: Changli Wang, Fang Yin, Jiafeng Liu, Rui Wu

TLDR: 本研究提出HMID-Net:首次在双曲空间中整合掩码图像建模(MIM)和知识蒸馏技术,并引入专门的蒸馏损失函数以促进有效知识迁移,在图像分类和检索等一系列下游任务中显著超越现有模型如MERU和CLIP。

Abstract

arXiv:2507.09487v1 Announce Type: new Abstract: Visual and semantic concepts are often structured in a hierarchical manner. For instance, textual concept `cat' entails all images of cats. A recent study, MERU, successfully adapts multimodal learning techniques from Euclidean space to hyperbolic space, effectively capturing the visual-semantic hierarchy. However, a critical question remains: how can we more efficiently train a model to capture and leverage this hierarchy? In this paper, we propose the \textit{Hyperbolic Masked Image and Distillation Network} (HMID-Net), a novel and efficient method that integrates Masked Image Modeling (MIM) and knowledge distillation techniques within hyperbolic space. To the best of our knowledge, this is the first approach to leverage MIM and knowledge distillation in hyperbolic space to train highly efficient models. In addition, we introduce a distillation loss function specifically designed to facilitate effective knowledge transfer in hyperbolic space. Our experiments demonstrate that MIM and knowledge distillation techniques in hyperbolic space can achieve the same remarkable success as in Euclidean space. Extensive evaluations show that our method excels across a wide range of downstream tasks, significantly outperforming existing models like MERU and CLIP in both image classification and retrieval.

Comment: Criterion: 4

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09537 [page] [pdf] [kimi]

Authors: Yangang Ren, Guojian Zhan, Chen Lv, Jun Li, Fenghua Liang, Keqiang Li

TLDR: Plan-MAE:一种针对自动驾驶车辆的预测与规划一体化自监督预训练框架,通过掩码自编码器学习空间关联、社会交互和目的地意图,并结合局部子规划任务,在大规模数据集上显著优于现有规划方法,可作为学习型运动规划器的重要预训练步骤。

Abstract

arXiv:2507.09537v1 Announce Type: new Abstract: Predicting the future of surrounding agents and accordingly planning a safe, goal-directed trajectory are crucial for automated vehicles. Current methods typically rely on imitation learning to optimize metrics against the ground truth, often overlooking how scene understanding could enable more holistic trajectories. In this paper, we propose Plan-MAE, a unified pretraining framework for prediction and planning that capitalizes on masked autoencoders. Plan-MAE fuses critical contextual understanding via three dedicated tasks: reconstructing masked road networks to learn spatial correlations, agent trajectories to model social interactions, and navigation routes to capture destination intents. To further align vehicle dynamics and safety constraints, we incorporate a local sub-planning task predicting the ego-vehicle's near-term trajectory segment conditioned on earlier segment. This pretrained model is subsequently fine-tuned on downstream tasks to jointly generate the prediction and planning trajectories. Experiments on large-scale datasets demonstrate that Plan-MAE outperforms current methods on the planning metrics by a large margin and can serve as an important pre-training step for learning-based motion planner.

Comment: Criterion: 4

Relevance: 8 Novelty: 7 Back to [topic] [top]

Back to [top]


Topic 5: 5. Image Generation (Diffusion, Autoregressive, Tokenizer, etc.) (20 papers)

ArXiv: 2507.09308 [page] [pdf] [kimi]

Authors: Zile Wang, Hao Yu, Jiabo Zhan, Chun Yuan

TLDR: AlphaVAE:提出首个用于RGBA图像重建与生成的统一端到端变分自编码器(VAE),通过扩展预训练RGB VAE并整合专用Alpha通道和复合优化目标,实现了透明图像的高保真合成。该模型仅使用8K图像训练,PSNR较LayerDiffuse提升4.9 dB,SSIM提升3.2%,并在潜扩散框架下展现出卓越生成能力。同时,推出了首个RGBA基准ALPHA。代码、数据和模型已开源。

Abstract

arXiv:2507.09308v1 Announce Type: new Abstract: Recent advances in latent diffusion models have achieved remarkable results in high-fidelity RGB image synthesis by leveraging pretrained VAEs to compress and reconstruct pixel data at low computational cost. However, the generation of transparent or layered content (RGBA image) remains largely unexplored, due to the lack of large-scale benchmarks. In this work, we propose ALPHA, the first comprehensive RGBA benchmark that adapts standard RGB metrics to four-channel images via alpha blending over canonical backgrounds. We further introduce ALPHAVAE, a unified end-to-end RGBA VAE that extends a pretrained RGB VAE by incorporating a dedicated alpha channel. The model is trained with a composite objective that combines alpha-blended pixel reconstruction, patch-level fidelity, perceptual consistency, and dual KL divergence constraints to ensure latent fidelity across both RGB and alpha representations. Our RGBA VAE, trained on only 8K images in contrast to 1M used by prior methods, achieves a +4.9 dB improvement in PSNR and a +3.2% increase in SSIM over LayerDiffuse in reconstruction. It also enables superior transparent image generation when fine-tuned within a latent diffusion framework. Our code, data, and models are released on https://github.com/o0o0o00o0/AlphaVAE for reproducibility.

Comment: Criterion: 5

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10072 [page] [pdf] [kimi]

Authors: Meng Yu, Kun Zhan

TLDR: Frequency Regulation:本文提出一种无需训练且即插即用的频率调节机制,旨在缓解扩散模型中的曝光偏差,显著提升生成质量!通过观察到噪声图像能量在扩散过程中以不同频率模式下降的特性,该方法利用小波变换分别调整低频和高频子带,为多种扩散模型提供了鲁棒的曝光偏差解决方案。代码已开源!

Abstract

arXiv:2507.10072v1 Announce Type: new Abstract: Diffusion models exhibit impressive generative capabilities but are significantly impacted by exposure bias. In this paper, we make a key observation: the energy of the predicted noisy images decreases during the diffusion process. Building on this, we identify two important findings: 1) The reduction in energy follows distinct patterns in the low-frequency and high-frequency subbands; 2) This energy reduction results in amplitude variations between the network-reconstructed clean data and the real clean data. Based on the first finding, we introduce a frequency-domain regulation mechanism utilizing wavelet transforms, which separately adjusts the low- and high-frequency subbands. Leveraging the second insight, we provide a more accurate analysis of exposure bias in the two subbands. Our method is training-free and plug-and-play, significantly improving the generative quality of various diffusion models and providing a robust solution to exposure bias across different model architectures. The source code is available at https://github.com/kunzhan/wpp.

Comment: Criterion: 5

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09984 [page] [pdf] [kimi]

Authors: Junho Lee, Jeongwoo Shin, Hyungwook Choi, Joonseok Lee

TLDR: LDMAEs:本文提出一种结合掩码自编码器(MAE)的潜在扩散模型(LDMAEs),通过深入分析自编码器在潜在扩散模型中的关键特性,设计了利用MAE层级特征的变分掩码自编码器(VMAEs)!该方法显著提升了图像生成质量和计算效率,为扩散模型的自编码器设计提供了重要突破!

Abstract

arXiv:2507.09984v1 Announce Type: new Abstract: In spite of remarkable potential of the Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoder. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs). Through comprehensive experiments, we demonstrate significantly enhanced image generation quality and computational efficiency.

Comment: Criterion: 5

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10217 [page] [pdf] [kimi]

Authors: Jeongho Kim, Sunghyun Park, Hyoungwoo Park, Sungrack Yun, Jaegul Choo, Seokeon Cho

TLDR: Wardrobe Polyptych LoRA:一种新颖的部分级别可控人物图像生成模型,通过仅训练LoRA层、利用衣橱条件和空间参考、引入选择性主体区域损失,实现了高保真和一致性合成,推理阶段无需额外参数,在自定义数据集上显著优于现有技术,可实现逼真且身份保持的全身合成。

Abstract

arXiv:2507.10217v1 Announce Type: new Abstract: Recent diffusion models achieve personalization by learning specific subjects, allowing learned attributes to be integrated into generated images. However, personalized human image generation remains challenging due to the need for precise and consistent attribute preservation (e.g., identity, clothing details). Existing subject-driven image generation methods often require either (1) inference-time fine-tuning with few images for each new subject or (2) large-scale dataset training for generalization. Both approaches are computationally expensive and impractical for real-time applications. To address these limitations, we present Wardrobe Polyptych LoRA, a novel part-level controllable model for personalized human image generation. By training only LoRA layers, our method removes the computational burden at inference while ensuring high-fidelity synthesis of unseen subjects. Our key idea is to condition the generation on the subject's wardrobe and leverage spatial references to reduce information loss, thereby improving fidelity and consistency. Additionally, we introduce a selective subject region loss, which encourages the model to disregard some of reference images during training. Our loss ensures that generated images better align with text prompts while maintaining subject integrity. Notably, our Wardrobe Polyptych LoRA requires no additional parameters at the inference stage and performs generation using a single model trained on a few training samples. We construct a new dataset and benchmark tailored for personalized human image generation. Extensive experiments show that our approach significantly outperforms existing techniques in fidelity and consistency, enabling realistic and identity-preserving full-body synthesis.

Comment: Criterion: 5

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.08980 [page] [pdf] [kimi]

Authors: Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka, Stephen Bates, Tommi Jaakkola

TLDR: REED:一种用于扩散模型的灵活表征引导学习框架,通过联合多模态配对模型和优化训练课程来增强表征对齐,在ImageNet上实现23.3倍训练加速并超越SOTA方法REPA达4倍速度提升,同时在图像、蛋白质序列和分子生成任务上均表现出色。代码已开源。

Abstract

arXiv:2507.08980v1 Announce Type: new Abstract: Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at https://github.com/ChenyuWang-Monica/REED.

Comment: Criterion: 5

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09733 [page] [pdf] [kimi]

Authors: Bradley Camburn

TLDR: Universal Physics Simulation:首个通用的基础AI物理模拟模型,采用草图引导的扩散Transformer方法,将物理模拟视为条件生成问题,直接从边界条件数据学习物理定律,跳过时间积分步骤,实现了从边界到平衡态的直接映射,SSIM>0.8,开创了AI发现物理的新范式。

Abstract

arXiv:2507.09733v1 Announce Type: new Abstract: We present the first foundational AI model for universal physics simulation that learns physical laws directly from boundary-condition data without requiring a priori equation encoding. Traditional physics-informed neural networks (PINNs) and finite-difference methods necessitate explicit mathematical formulation of governing equations, fundamentally limiting their generalizability and discovery potential. Our sketch-guided diffusion transformer approach reimagines computational physics by treating simulation as a conditional generation problem, where spatial boundary conditions guide the synthesis of physically accurate steady-state solutions. By leveraging enhanced diffusion transformer architectures with novel spatial relationship encoding, our model achieves direct boundary-to-equilibrium mapping and is generalizable to diverse physics domains. Unlike sequential time-stepping methods that accumulate errors over iterations, our approach bypasses temporal integration entirely, directly generating steady-state solutions with SSIM > 0.8 while maintaining sub-pixel boundary accuracy. Our data-informed approach enables physics discovery through learned representations analyzable via Layer-wise Relevance Propagation (LRP), revealing emergent physical relationships without predetermined mathematical constraints. This work represents a paradigm shift from AI-accelerated physics to AI-discovered physics, establishing the first truly universal physics simulation framework.

Comment: Criterion: 5

Relevance: 8 Novelty: 9 Back to [topic] [top]

ArXiv: 2507.09212 [page] [pdf] [kimi]

Authors: Jonas Scholz, Richard E. Turner

TLDR: 本文提出Warm-start模型:一种简单确定性模型,通过提供基于输入上下文的先验信息,大幅加速扩散和流匹配等迭代生成模型的条件生成过程,在图像修复等任务上,仅用11次函数评估即可达到与1000步DDPM基线相当的结果,代码已开源。

Abstract

arXiv:2507.09212v1 Announce Type: new Abstract: Iterative generative models, like diffusion and flow-matching, create high-fidelity samples by progressively refining a noise vector into data. However, this process is notoriously slow, often requiring hundreds of function evaluations. We introduce the warm-start model, a simple, deterministic model that dramatically accelerates conditional generation by providing a better starting point. Instead of starting generation from an uninformed N(0, I) prior, our warm-start model predicts an informed prior N(mu, sigma), whose moments are conditioned on the input context. This "warm start" substantially reduces the distance the generative process must traverse, particularly when the conditioning information is strongly informative. On tasks like image inpainting, our method achieves results competitive with a 1000-step DDPM baseline using only 11 total function evaluations (1 for the warm start, 10 for generation). A simple conditional normalization trick makes our method compatible with any standard generative model and sampler without modification, allowing it to be combined with other efficient sampling techniques for further acceleration. Our implementation is available at https://github.com/jonas-scholz123/warm-start-model.

Comment: Criterion: 5

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09574 [page] [pdf] [kimi]

Authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Junjie Hu

TLDR: MENTOR:一种新颖的自回归(AR)框架,用于高效的多模态条件图像生成。该模型采用两阶段训练范式,实现多模态输入与图像输出之间的细粒度、令牌级对齐,无需辅助适配器或交叉注意力模块。MENTOR在DreamBench++基准测试中表现出色,在概念保留和提示遵循方面超越基线,并声称比扩散模型具有更优的图像重建保真度与训练效率。代码已开源。

Abstract

arXiv:2507.09574v1 Announce Type: new Abstract: Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

Comment: Criterion: 5, 4

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10547 [page] [pdf] [kimi]

Authors: Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, Jiwen Lu

TLDR: Quantize-then-Rectify (ReVQ):本文提出一种高效的VQ-VAE训练框架,利用预训练VAE并集成通道多组量化与后校正器,将ImageNet图像压缩至512个token,同时将VQ-VAE训练成本降低两个数量级(单块NVIDIA 4090仅需约22小时),实现训练效率与重建质量的显著提升,为多模态大模型提供关键的视觉tokenizer。

Abstract

arXiv:2507.10547v1 Announce Type: new Abstract: Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.

Comment: Criterion: 5

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10029 [page] [pdf] [kimi]

Authors: Seokeon Choi, Sunghyun Park, Hyoungwoo Park, Jeongho Kim, Sungrack Yun

TLDR: 该研究提出一种内存高效的文生图扩散模型个性化框架,通过根据扩散步长动态选择低分辨率反向传播(BP-low)和高分辨率零阶优化(ZO-high)策略,有效平衡效率与生成质量,实现高质量设备端个性化而无需增加推理延迟。

Abstract

arXiv:2507.10029v1 Announce Type: new Abstract: Memory-efficient personalization is critical for adapting text-to-image diffusion models while preserving user privacy and operating within the limited computational resources of edge devices. To this end, we propose a selective optimization framework that adaptively chooses between backpropagation on low-resolution images (BP-low) and zeroth-order optimization on high-resolution images (ZO-high), guided by the characteristics of the diffusion process. As observed in our experiments, BP-low efficiently adapts the model to target-specific features, but suffers from structural distortions due to resolution mismatch. Conversely, ZO-high refines high-resolution details with minimal memory overhead but faces slow convergence when applied without prior adaptation. By complementing both methods, our framework leverages BP-low for effective personalization while using ZO-high to maintain structural consistency, achieving memory-efficient and high-quality fine-tuning. To maximize the efficacy of both BP-low and ZO-high, we introduce a timestep-aware probabilistic function that dynamically selects the appropriate optimization strategy based on diffusion timesteps. This function mitigates the overfitting from BP-low at high timesteps, where structural information is critical, while ensuring ZO-high is applied more effectively as training progresses. Experimental results demonstrate that our method achieves competitive performance while significantly reducing memory consumption, enabling scalable, high-quality on-device personalization without increasing inference latency.

Comment: Criterion: 5

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09168 [page] [pdf] [kimi]

Authors: Haiming Zhu, Yangyang Xu, Chenshu Xu, Tingrui Shen, Wenxi Liu, Yong Du, Jun Yu, Shengfeng He

TLDR: Stable Score Distillation(SSD):一种用于文本引导图像和3D编辑的简化框架,通过将单个分类器锚定到源提示并利用Classifier-Free Guidance方程实现跨提示对齐,同时引入一个常量空文本分支来稳定优化过程,并额外设置提示增强分支以提高编辑强度,从而在2D和3D编辑任务中实现了最先进的性能、更快的收敛和更低的复杂度。

Abstract

arXiv:2507.09168v1 Announce Type: new Abstract: Text-guided image and 3D editing have advanced with diffusion-based models, yet methods like Delta Denoising Score often struggle with stability, spatial control, and editing strength. These limitations stem from reliance on complex auxiliary structures, which introduce conflicting optimization signals and restrict precise, localized edits. We introduce Stable Score Distillation (SSD), a streamlined framework that enhances stability and alignment in the editing process by anchoring a single classifier to the source prompt. Specifically, SSD utilizes Classifier-Free Guidance (CFG) equation to achieves cross-prompt alignment, and introduces a constant term null-text branch to stabilize the optimization process. This approach preserves the original content's structure and ensures that editing trajectories are closely aligned with the source prompt, enabling smooth, prompt-specific modifications while maintaining coherence in surrounding regions. Additionally, SSD incorporates a prompt enhancement branch to boost editing strength, particularly for style transformations. Our method achieves state-of-the-art results in 2D and 3D editing tasks, including NeRF and text-driven style edits, with faster convergence and reduced complexity, providing a robust and efficient solution for text-guided editing.

Comment: Criterion: 5

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10340 [page] [pdf] [kimi]

Authors: Hongjae Lee, Myungjun Son, Dongjea Kang, Seung-Won Jung

TLDR: QLIP:提出一种新颖的量化方法,利用文本嵌入指导文生图扩散模型中每层在每个时间步的比特精度选择,有效降低计算复杂度的同时提升生成图像质量,可无缝集成到现有量化方法中,显著提高扩散模型在资源受限环境下的部署效率。

Abstract

arXiv:2507.10340v1 Announce Type: new Abstract: Despite the success of diffusion models in image generation tasks such as text-to-image, the enormous computational complexity of diffusion models limits their use in resource-constrained environments. To address this, network quantization has emerged as a promising solution for designing efficient diffusion models. However, existing diffusion model quantization methods do not consider input conditions, such as text prompts, as an essential source of information for quantization. In this paper, we propose a novel quantization method dubbed Quantization of Language-to-Image diffusion models using text Prompts (QLIP). QLIP leverages text prompts to guide the selection of bit precision for every layer at each time step. In addition, QLIP can be seamlessly integrated into existing quantization methods to enhance quantization efficiency. Our extensive experiments demonstrate the effectiveness of QLIP in reducing computational complexity and improving the quality of the generated images across various datasets.

Comment: Criterion: 5

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09595 [page] [pdf] [kimi]

Authors: Or Greenberg

TLDR: FLUX架构逆向工程:本报告对Black Forest Labs的SOTA文本到图像生成模型FLUX.1的架构进行了详尽的逆向工程分析。尽管FLUX.1性能卓越,但缺乏官方技术文档,此报告旨在通过开源代码解析其内部结构,为未来研究和开发提供基础,以支持其作为骨干模型的应用。

Abstract

arXiv:2507.09595v1 Announce Type: new Abstract: FLUX.1 is a diffusion-based text-to-image generation model developed by Black Forest Labs, designed to achieve faithful text-image alignment while maintaining high image quality and diversity. FLUX is considered state-of-the-art in text-to-image generation, outperforming popular models such as Midjourney, DALL-E 3, Stable Diffusion 3 (SD3), and SDXL. Although publicly available as open source, the authors have not released official technical documentation detailing the model's architecture or training setup. This report summarizes an extensive reverse-engineering effort aimed at demystifying FLUX's architecture directly from its source code, to support its adoption as a backbone for future research and development. This document is an unofficial technical report and is not published or endorsed by the original developers or their affiliated institutions.

Comment: Criterion: 5

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09915 [page] [pdf] [kimi]

Authors: Siyue Yao, Mingjie Sun, Eng Gee Lim, Ran Yi, Baojiang Zhong, Moncef Gabbouj

TLDR: Crucial-Diff:该研究提出一个领域无关的统一扩散模型Crucial-Diff,用于在数据稀缺场景下合成关键图像及注释,其通过场景无关特征提取器(SAFE)与弱点感知样本挖掘器(WASM)的结合,根据下游模型的反馈生成难以检测的样本,在MVTec数据集上像素级AP达到83.63%,在息肉数据集上mIoU达到81.64%。代码即将开源。

Abstract

arXiv:2507.09915v1 Announce Type: new Abstract: The scarcity of data in various scenarios, such as medical, industry and autonomous driving, leads to model overfitting and dataset imbalance, thus hindering effective detection and segmentation performance. Existing studies employ the generative models to synthesize more training samples to mitigate data scarcity. However, these synthetic samples are repetitive or simplistic and fail to provide "crucial information" that targets the downstream model's weaknesses. Additionally, these methods typically require separate training for different objects, leading to computational inefficiencies. To address these issues, we propose Crucial-Diff, a domain-agnostic framework designed to synthesize crucial samples. Our method integrates two key modules. The Scene Agnostic Feature Extractor (SAFE) utilizes a unified feature extractor to capture target information. The Weakness Aware Sample Miner (WASM) generates hard-to-detect samples using feedback from the detection results of downstream model, which is then fused with the output of SAFE module. Together, our Crucial-Diff framework generates diverse, high-quality training data, achieving a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec. On polyp dataset, Crucial-Diff reaches an mIoU of 81.64% and an mDice of 87.69%. Code will be released after acceptance.

Comment: Criterion: 5

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10065 [page] [pdf] [kimi]

Authors: Chenguo Lin, Yuchen Lin, Panwang Pan, Yifan Yu, Honglei Yan, Katerina Fragkiadaki, Yadong Mu

TLDR: MoVieS:一种新颖的前馈模型,首次实现了运动感知4D动态新视图合成,通过像素对齐的高斯基元网格统一建模外观、几何和运动,能在1秒内完成动态场景渲染,并提供数个数量级的速度提升,同时支持场景流估计和运动目标分割等零样本应用。

Abstract

arXiv:2507.10065v1 Announce Type: new Abstract: We present MoVieS, a novel feed-forward model that synthesizes 4D dynamic novel views from monocular videos in one second. MoVieS represents dynamic 3D scenes using pixel-aligned grids of Gaussian primitives, explicitly supervising their time-varying motion. This allows, for the first time, the unified modeling of appearance, geometry and motion, and enables view synthesis, reconstruction and 3D point tracking within a single learning-based framework. By bridging novel view synthesis with dynamic geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups.

Comment: Criterion: 5

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09052 [page] [pdf] [kimi]

Authors: Fang Chen, Alex Villa, Gongbo Liang, Xiaoyi Lu, Meng Tang

TLDR: 本文引入两种新颖的对比损失函数,用于解决类别不平衡扩散模型中的生成多样性问题,通过 InfoNCE 损失和条件-无条件对齐,显著提升长尾类别图像的生成多样性,并在CIFAR10/100-LT等多个数据集上超越现有方法。

Abstract

arXiv:2507.09052v1 Announce Type: new Abstract: Training data for class-conditional image synthesis often exhibit a long-tailed distribution with limited images for tail classes. Such an imbalance causes mode collapse and reduces the diversity of synthesized images for tail classes. For class-conditional diffusion models trained on imbalanced data, we aim to improve the diversity of tail class images without compromising the fidelity and diversity of head class images. We achieve this by introducing two deceptively simple but highly effective contrastive loss functions. Firstly, we employ an unsupervised InfoNCE loss utilizing negative samples to increase the distance/dissimilarity among synthetic images, particularly for tail classes. To further enhance the diversity of tail classes, our second loss is an MSE loss that contrasts class-conditional generation with unconditional generation at large timesteps. This second loss makes the denoising process insensitive to class conditions for the initial steps, which enriches tail classes through knowledge sharing from head classes. Conditional-unconditional alignment has been shown to enhance the performance of long-tailed GAN. We are the first to adapt such alignment to diffusion models. We successfully leveraged contrastive learning for class-imbalanced diffusion models. Our contrastive learning framework is easy to implement and outperforms standard DDPM and alternative methods for class-imbalanced diffusion models across various datasets, including CIFAR10/100-LT, PlacesLT, TinyImageNetLT, and ImageNetLT.

Comment: Criterion: 5

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.08956 [page] [pdf] [kimi]

Authors: Zhenghan Fang, Mateo D'iaz, Sam Buchanan, Jeremias Sulam

TLDR: ProxDM:一种新型扩散模型,通过使用近端映射而非传统得分函数,实现了反向随机微分方程的离散化。该方法在理论上具有收敛保证,并在实践中显著加快采样速度,比传统得分匹配方法收敛更快。

Abstract

arXiv:2507.08956v1 Announce Type: new Abstract: Diffusion models have quickly become some of the most popular and powerful generative models for high-dimensional data. The key insight that enabled their development was the realization that access to the score -- the gradient of the log-density at different noise levels -- allows for sampling from data distributions by solving a reverse-time stochastic differential equation (SDE) via forward discretization, and that popular denoisers allow for unbiased estimators of this score. In this paper, we demonstrate that an alternative, backward discretization of these SDEs, using proximal maps in place of the score, leads to theoretical and practical benefits. We leverage recent results in proximal matching to learn proximal operators of the log-density and, with them, develop Proximal Diffusion Models (ProxDM). Theoretically, we prove that $\widetilde{O}(d/\sqrt{\varepsilon})$ steps suffice for the resulting discretization to generate an $\varepsilon$-accurate distribution w.r.t. the KL divergence. Empirically, we show that two variants of ProxDM achieve significantly faster convergence within just a few sampling steps compared to conventional score-matching methods.

Comment: Criterion: 5

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09573 [page] [pdf] [kimi]

Authors: Zhe Wang, Jingbo Zhang, Tianyi Wei, Wanchao Su, Can Wang

TLDR: WordCraft:提出一个交互式艺术字体设计系统,结合扩散模型,引入免训练区域注意力机制和噪声混合技术,支持局部编辑、迭代细化、多字符组合;集成大语言模型解析用户提示,显著增强艺术字体合成的交互性,实现高质量、风格化的多语言字体生成。

Abstract

arXiv:2507.09573v1 Announce Type: new Abstract: Artistic typography aims to stylize input characters with visual effects that are both creative and legible. Traditional approaches rely heavily on manual design, while recent generative models, particularly diffusion-based methods, have enabled automated character stylization. However, existing solutions remain limited in interactivity, lacking support for localized edits, iterative refinement, multi-character composition, and open-ended prompt interpretation. We introduce WordCraft, an interactive artistic typography system that integrates diffusion models to address these limitations. WordCraft features a training-free regional attention mechanism for precise, multi-region generation and a noise blending that supports continuous refinement without compromising visual quality. To support flexible, intent-driven generation, we incorporate a large language model to parse and structure both concrete and abstract user prompts. These components allow our framework to synthesize high-quality, stylized typography across single- and multi-character inputs across multiple languages, supporting diverse user-centered workflows. Our system significantly enhances interactivity in artistic typography synthesis, opening up creative possibilities for artists and designers.

Comment: Criterion: 5

Relevance: 8 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09524 [page] [pdf] [kimi]

Authors: Yunwei Lan, Zhigao Cui, Xin Luo, Chang Liu, Nian Wang, Menglin Zhang, Yanzhao Su, Dong Liu

TLDR: DehazeSB:一个基于薛定谔桥的新型无监督去雾框架,利用最优传输理论直接连接模糊和清晰图像的分布,实现在更少步骤内生成高质量去雾图像。模型引入了细节保留正则化和新颖的提示学习以利用预训练CLIP模型,并在多个真实世界数据集上展现出卓越性能。代码已开源。

Abstract

arXiv:2507.09524v1 Announce Type: new Abstract: Recent advancements in unpaired dehazing, particularly those using GANs, show promising performance in processing real-world hazy images. However, these methods tend to face limitations due to the generator's limited transport mapping capability, which hinders the full exploitation of their effectiveness in unpaired training paradigms. To address these challenges, we propose DehazeSB, a novel unpaired dehazing framework based on the Schr"odinger Bridge. By leveraging optimal transport (OT) theory, DehazeSB directly bridges the distributions between hazy and clear images. This enables optimal transport mappings from hazy to clear images in fewer steps, thereby generating high-quality results. To ensure the consistency of structural information and details in the restored images, we introduce detail-preserving regularization, which enforces pixel-level alignment between hazy inputs and dehazed outputs. Furthermore, we propose a novel prompt learning to leverage pre-trained CLIP models in distinguishing hazy images and clear ones, by learning a haze-aware vision-language alignment. Extensive experiments on multiple real-world datasets demonstrate our method's superiority. Code: https://github.com/ywxjm/DehazeSB.

Comment: Criterion: 5

Relevance: 8 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10347 [page] [pdf] [kimi]

Authors: Yan-Ting Chen, Hao-Wei Chen, Tsu-Ching Hsiao, Chun-Yi Lee

TLDR: 本文提出一种算法,通过将数值Picard迭代法应用于SO(3)流形上的扩散过程,实现了扩散模型采样的并行加速。该方法在解决姿态模糊问题的现有扩散模型上得到验证,实现了高达4.9倍的速度提升,显著降低了生成单个样本的延迟,且不影响任务奖励。

Abstract

arXiv:2507.10347v1 Announce Type: new Abstract: In this paper, we design an algorithm to accelerate the diffusion process on the $SO(3)$ manifold. The inherently sequential nature of diffusion models necessitates substantial time for denoising perturbed data. To overcome this limitation, we proposed to adapt the numerical Picard iteration for the $SO(3)$ space. We demonstrate our algorithm on an existing method that employs diffusion models to address the pose ambiguity problem. Moreover, we show that this acceleration advantage occurs without any measurable degradation in task reward. The experiments reveal that our algorithm achieves a speed-up of up to 4.9$\times$, significantly reducing the latency for generating a single sample.

Comment: Criterion: 5

Relevance: 8 Novelty: 7 Back to [topic] [top]

Back to [top]


Topic 6: 6. Reinforcement Learning in Large or Multimodal Models & Reasoning During Inference (17 papers)

ArXiv: 2507.10085 [page] [pdf] [kimi]

Authors: Chenxi Huang, Shaotian Yan, Liang Xie, Binbin Lin, Sinan Fan, Yue Xin, Deng Cai, Chen Shen, Jieping Ye

TLDR: CRFT:一种创新的参数高效微调方法,通过识别并优化大语言模型中的关键表征,显著增强其链式思考(CoT)推理能力。该方法在信息流分析基础上动态优化关键表征,冻结基础模型,并在八个算术和常识推理基准上验证了其有效性和效率,特别是在单次学习设置中将准确率提升了16.4%。

Abstract

arXiv:2507.10085v1 Announce Type: new Abstract: Representation Fine-tuning (ReFT), a recently proposed Parameter-Efficient Fine-Tuning (PEFT) method, has attracted widespread attention for significantly improving parameter efficiency by editing representation space alone. In this work, we investigate applying ReFT to complex reasoning tasks. However, directly using the native ReFT method, which modifies fixed representations at the beginning and end of each layer, yields suboptimal performance, as these fixed-position representations have uncertain impact on the outputs. We observe that, in complex reasoning tasks, there often exist certain critical representations. These representations either integrate significant information from preceding layers or regulate subsequent layer representations. Through layer-by-layer propagation, they exert a substantial influence on the final output. Naturally, fine-tuning these critical representations has the potential to greatly enhance reasoning performance. Building upon these insights, we propose Critical Representation Fine-Tuning (CRFT), a novel method that identifies and optimizes these critical representations through information flow analysis. CRFT operates within a supervised learning framework, dynamically optimizing critical representations in a low-rank linear subspace while freezing the base model. The effectiveness and efficiency of our method are validated across eight benchmarks for arithmetic and commonsense reasoning, using LLaMA and Mistral model families. Furthermore, our method also adapts effectively to few-shot settings, boosting one-shot accuracy by 16.4%. Our work highlights the untapped potential of representation-level optimization for CoT reasoning, offering a lightweight yet powerful alternative to traditional PEFT methods.

Comment: Criterion: 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09407 [page] [pdf] [kimi]

Authors: Quanyan Zhu

TLDR: LLM-Stackelberg Games:引入LLM-Stackelberg博弈框架,将大型语言模型(LLMs)融入领导者与追随者间的策略交互,提出推理与行为均衡、推测推理均衡等新概念,允许LLMs通过结构化提示进行推理和策略调整,揭示其在网络安全、虚假信息等领域的认知丰富性和对抗潜力。

Abstract

arXiv:2507.09407v1 Announce Type: new Abstract: We introduce the framework of LLM-Stackelberg games, a class of sequential decision-making models that integrate large language models (LLMs) into strategic interactions between a leader and a follower. Departing from classical Stackelberg assumptions of complete information and rational agents, our formulation allows each agent to reason through structured prompts, generate probabilistic behaviors via LLMs, and adapt their strategies through internal cognition and belief updates. We define two equilibrium concepts: reasoning and behavioral equilibrium, which aligns an agent's internal prompt-based reasoning with observable behavior, and conjectural reasoning equilibrium, which accounts for epistemic uncertainty through parameterized models over an opponent's response. These layered constructs capture bounded rationality, asymmetric information, and meta-cognitive adaptation. We illustrate the framework through a spearphishing case study, where a sender and a recipient engage in a deception game using structured reasoning prompts. This example highlights the cognitive richness and adversarial potential of LLM-mediated interactions. Our results show that LLM-Stackelberg games provide a powerful paradigm for modeling decision-making in domains such as cybersecurity, misinformation, and recommendation systems.

Comment: Criterion: 8, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10007 [page] [pdf] [kimi]

Authors: Zijun Chen, Wenbo Hu, Richang Hong

TLDR: 本文提出一种新颖的方法,通过利用模型内在的真值编码来校准思维链(CoT)推理的准确性,发现特定注意力头激活能可靠反映CoT推理步骤的真实性,并据此训练置信度预测器,通过束搜索动态选择最合理路径。该方法在单模态和多模态设置下的数学、符号和常识推理任务中显著优于现有SOTA基线,提供了提升CoT推理可靠性的新途径。

Abstract

arXiv:2507.10007v1 Announce Type: new Abstract: Chain of Thought (CoT) reasoning has demonstrated remarkable deep reasoning capabilities in both large language models (LLMs) and multimodal large language models (MLLMs). However, its reliability is often undermined by the accumulation of errors in intermediate steps. This paper introduces an novel approach to calibrate the CoT reasoning accuracy by leveraging the model's intrinsic veracity encoding. We discover that specific attention head activations reliably reflect the truthfulness of reasoning steps in CoT. Based on this insight, we train a confidence predictor to evaluate the correctness of each reasoning step using these truthfulness-sensitive activations, dynamically selecting the most plausible reasoning path via beam search. Experimental results demonstrate that our method significantly outperforms the state-of-the-art baselines (e.g., Few-Shot CoT, Self-Consistency, and Self-Evaluation Guided Beam Search) across the mathematical, symbolic, and commonsense reasoning tasks, exhibiting superior accuracy and reliability in both unimodal and multimodal settings. We further validate the approach on large reasoning models, confirming its applicability to specialized reasoning models. Additionally, we explore the role of the model's self-correction ability in CoT reasoning. This work provides a novel reliability improvement path for CoT reasoning with broad application potential.

Comment: Criterion: 1, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09495 [page] [pdf] [kimi]

Authors: Hang Wang, Junshan Zhang

TLDR: Generative-RL Agent Perspective:本文提出一种基于生成式AI的多智能体强化学习新范式,旨在将智能体从反应式转变为主动式!该方法将智能体重塑为能够合成复杂多智能体动态、预测其他智能体行为并生成协调动作序列的生成模型,从而实现前瞻性决策、无缝协调与动态适应。有望为分布式智能、自主系统、机器人和人机协作带来突破!

Abstract

arXiv:2507.09495v1 Announce Type: new Abstract: Multi-agent reinforcement learning faces fundamental challenges that conventional approaches have failed to overcome: exponentially growing joint action spaces, non-stationary environments where simultaneous learning creates moving targets, and partial observability that constrains coordination. Current methods remain reactive, employing stimulus-response mechanisms that fail when facing novel scenarios. We argue for a transformative paradigm shift from reactive to proactive multi-agent intelligence through generative AI-based reinforcement learning. This position advocates reconceptualizing agents not as isolated policy optimizers, but as sophisticated generative models capable of synthesizing complex multi-agent dynamics and making anticipatory decisions based on predictive understanding of future interactions. Rather than responding to immediate observations, generative-RL agents can model environment evolution, predict other agents' behaviors, generate coordinated action sequences, and engage in strategic reasoning accounting for long-term dynamics. This approach leverages pattern recognition and generation capabilities of generative AI to enable proactive decision-making, seamless coordination through enhanced communication, and dynamic adaptation to evolving scenarios. We envision this paradigm shift will unlock unprecedented possibilities for distributed intelligence, moving beyond individual optimization toward emergent collective behaviors representing genuine collaborative intelligence. The implications extend across autonomous systems, robotics, and human-AI collaboration, promising solutions to coordination challenges intractable under traditional reactive frameworks.

Comment: Criterion: 6, 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09839 [page] [pdf] [kimi]

Authors: MohammadReza Davari, Utkarsh Garg, Weixin Cai, Eugene Belilovsky

TLDR: 该研究提出一种新颖的黑盒大语言模型(LLM)提示优化框架,通过引入正向强化和反馈多样化机制,增强LLM生成反馈的有效性和效率。此外,论文还首次形式化了持续提示优化(CPO)以应对跨模型版本或API提供商的提示迁移挑战。实验表明,该方法在准确性、收敛速度和计算成本方面显著优于基线。

Abstract

arXiv:2507.09839v1 Announce Type: new Abstract: An increasing number of NLP applications interact with large language models (LLMs) through black-box APIs, making prompt engineering critical for controlling model outputs. While recent Automatic Prompt Optimization (APO) methods iteratively refine prompts using model-generated feedback, textual gradients, they primarily focus on error correction and neglect valuable insights from correct predictions. This limits both their effectiveness and efficiency. In this paper, we propose a novel APO framework centered on enhancing the feedback mechanism. We reinterpret the textual gradient as a form of negative reinforcement and introduce the complementary positive reinforcement to explicitly preserve beneficial prompt components identified through successful predictions. To mitigate the noise inherent in LLM-generated feedback, we introduce a technique called feedback diversification, which aggregates multiple feedback signals, emphasizing consistent, actionable advice while filtering out outliers. Motivated by the rapid evolution and diversity of available LLMs, we also formalize Continual Prompt Optimization (CPO), addressing the practical challenge of efficiently migrating optimized prompts between different model versions or API providers. Our experiments reveal that naive prompt migration often degrades performance due to loss of critical instructions. In contrast, our approach consistently outperforms strong baselines, achieving significant accuracy improvements, faster convergence, and lower computational costs in both standard and migration scenarios.

Comment: Criterion: 3, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09279 [page] [pdf] [kimi]

Authors: Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel

TLDR: Prompt4Trust:首个面向多模态大语言模型(MLLM)置信度校准的强化学习提示增强框架,通过轻量级LLM生成上下文感知辅助提示,提升下游MLLM的预测准确性与置信度校准,在PMC-VQA医学视觉问答基准上实现SOTA性能,并展示对更大MLLM的零样本泛化能力,有效增强MLLM在安全关键场景下的可信赖性,代码已开源。

Abstract

arXiv:2507.09279v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/vccrl-llm.

Comment: Criterion: 1, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.08838 [page] [pdf] [kimi]

Authors: Xiaohang Tang, Rares Dolga, Sangwoong Yoon, Ilija Bogunovic

TLDR: wd1:针对基于扩散的语言模型(dLLM)的推理能力,提出一种新颖的加权策略优化方法,通过将目标函数重构为加权似然,仅需单次近似即可进行策略优化,在无SFT数据下,推理基准上性能优于现有RL方法高达16%,并实现训练时间缩短和函数评估次数减少。

Abstract

arXiv:2507.08838v1 Announce Type: new Abstract: Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead and lead to potentially large bias -- particularly when approximation errors occur in the denominator of policy ratios used for importance sampling. To mitigate these issues, we introduce $\mathtt{wd1}$, a novel policy optimization approach that reformulates the objective as a weighted likelihood, requiring only a single approximation for the current parametrized policy likelihood. Experiments on widely used reasoning benchmarks demonstrate that $\mathtt{wd1}$, without supervised fine-tuning (SFT) or any supervised data, outperforms existing RL methods for dLLMs, achieving up to 16% higher accuracy. $\mathtt{wd1}$ delivers additional computational gains, including reduced training time and fewer function evaluations (NFEs) per gradient step. These findings, combined with the simplicity of method's implementation and R1-Zero-like training (no SFT), position $\mathtt{wd1}$ as a more effective and efficient method for applying RL to dLLMs reasoning.

Comment: Criterion: 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09638 [page] [pdf] [kimi]

Authors: Pawitsapak Akarajaradwong, Chompakorn Chaksangchaichot, Pirat Pothavorn, Attapol Thamrongrattanarit-Rutherford, Ekapol Chuangsuwanich, Sarana Nutanong

TLDR: GRPO:该论文提出组相对策略优化(GRPO)方法,利用BGE-M3嵌入作为高效语义相似度奖励,在NitiBench基准上将检索增强生成(RAG)系统在泰语法律问答中的引文F1分数提升90%,并比指令微调效果更优,显著增强复杂法律推理任务的鲁棒性,同时计算成本降低2.5倍。

Abstract

arXiv:2507.09638v1 Announce Type: new Abstract: The Retrieval-Augmented Generation (RAG) systems' performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce an approach aligning LLMs toward improved law citation accuracy and better response quality using Group-Relative Policy Optimization (GRPO). Our approach leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to large language model judges. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains from the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our method shows enhanced robustness on complex legal reasoning tasks compared to instruction tuning, providing an effective and resource-efficient solution for enhancing Thai legal LLMs.

Comment: Criterion: 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10532 [page] [pdf] [kimi]

Authors: Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang

TLDR: 该研究揭示了大型语言模型(LLM)通过强化学习(RL)增强的推理能力可能因数据污染而产生不可靠结果;为此,提出了一个完全合成、无数据泄露的算术数据集RandomCalculation,并证明只有准确的奖励信号才能持续提升性能,呼吁在无污染基准上评估RL方法以确保结果可信。

Abstract

arXiv:2507.10532v1 Announce Type: new Abstract: The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks. As a result, results derived from these benchmarks may be unreliable. To address this, we introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation. Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not. We advocate for evaluating RL methods on uncontaminated benchmarks and across diverse model families to ensure trustworthy conclusions.

Comment: Criterion: 3, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09482 [page] [pdf] [kimi]

Authors: Changli Wang, Rui Wu, Fang Yin

TLDR: ViSP:该框架针对多模态讽刺文本生成,提出了首个多模态讽刺生成数据集M2SaG(包含图像和讽刺文本),并引入了融合PPO和对比学习的生成框架ViSP,通过奖励分数指导生成,提升了讽刺文本质量,并超越了包括大型语言模型在内的所有基线,生成文本具有更高的讽刺得分和事实不一致性。数据集和代码将开源。

Abstract

arXiv:2507.09482v1 Announce Type: new Abstract: Human emotions are complex, with sarcasm being a subtle and distinctive form. Despite progress in sarcasm research, sarcasm generation remains underexplored, primarily due to the overreliance on textual modalities and the neglect of visual cues, as well as the mismatch between image content and sarcastic intent in existing datasets. In this paper, we introduce M2SaG, a multimodal sarcasm generation dataset with 4,970 samples, each containing an image, a sarcastic text, and a sarcasm target. To benchmark M2SaG, we propose ViSP, a generation framework that integrates Proximal Policy Optimization (PPO) and contrastive learning. PPO utilizes reward scores from DIP to steer the generation of sarcastic texts, while contrastive learning encourages the model to favor outputs with higher reward scores. These strategies improve overall generation quality and produce texts with more pronounced sarcastic intent. We evaluate ViSP across five metric sets and find it surpasses all baselines, including large language models, underscoring their limitations in sarcasm generation. Furthermore, we analyze the distributions of Sarcasm Scores and Factual Incongruity for both M2SaG and the texts generated by ViSP. The generated texts exhibit higher mean Sarcasm Scores (0.898 vs. 0.770) and Factual Incongruity (0.768 vs. 0.739), demonstrating that ViSP produces higher-quality sarcastic content than the original dataset. % The dataset and code will be publicly available. Our dataset and code will be released at \textit{https://github.com/wclapply/ViSP}.

Comment: Criterion: 3, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09662 [page] [pdf] [kimi]

Authors: Jason Zhu, Hongyu Li

TLDR: 综述《Towards Concise and Adaptive Thinking in Large Reasoning Models》:深入探讨大型推理模型(LRMs)中简洁自适应推理的最新进展,旨在解决长链式思维(CoT)推理冗余问题。该研究概述了现有方法、基准,并指明未来方向,对于提升LRM推理效率和实用性至关重要,帮助研究者快速理解并启发新思路。

Abstract

arXiv:2507.09662v1 Announce Type: new Abstract: Large reasoning models (LRMs) like OpenAI o1 and DeepSeek R1 have demonstrated impressive performance on complex reasoning tasks like mathematics and programming with long Chain-of-Thought (CoT) reasoning sequences (slow-thinking), compared with traditional large language models (fast-thinking). However, these reasoning models also face a huge challenge that generating unnecessarily lengthy and redundant reasoning chains even for trivial questions. This phenomenon leads to a significant waste of inference resources, increases the response time for simple queries, and hinders the practical application of LRMs in real-world products. To this end, it is crucial to shorten lengthy reasoning chains and learn adaptive reasoning between fast and slow thinking based on input difficulty. In this survey, we provide a comprehensive overview of recent progress in concise and adaptive thinking for efficient reasoning of LRMs, including methodologies, benchmarks, and challenges for future exploration. We hope this survey can help researchers quickly understand the landscape of this field and inspire novel adaptive thinking ideas to facilitate better usage of LRMs.

Comment: Criterion: 6

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09104 [page] [pdf] [kimi]

Authors: Taolin Zhang, Maosong Cao, Alexander Lam, Songyang Zhang, Kai Chen

TLDR: CompassJudger-2:提出一种通用评判模型,通过任务驱动、多领域数据策略和可验证奖励,结合拒绝采样与边际策略梯度损失,显著提升了LLM评判模型的鲁棒性和泛化能力;同时推出JudgerBenchV2综合基准,其7B模型性能匹敌大型模型。

Abstract

arXiv:2507.09104v1 Announce Type: new Abstract: Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.

Comment: Criterion: 6

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09016 [page] [pdf] [kimi]

Authors: Karim Galliamov, Ivan Titov, Ilya Pershin

TLDR: 该研究通过引入人类注视建模来增强RLHF效率,提出注视感知奖励模型和基于注视的稀疏奖励分配策略,实验证明能加速收敛并保持或略微提升性能,从而显著降低RLHF的计算成本。

Abstract

arXiv:2507.09016v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) aligns language models with human preferences but is computationally expensive. We explore two approaches that leverage human gaze modeling to enhance RLHF: (1) gaze-aware reward models and (2) gaze-based distribution of sparse rewards at token level. Our experiments demonstate that gaze-informed RLHF achieves faster convergence while maintaining or slightly improving performance, thus, reducing computational costs during policy optimization. These results show that human gaze provides a valuable and underused signal for policy optimization, pointing to a promising direction for improving RLHF efficiency.

Comment: Criterion: 6

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09884 [page] [pdf] [kimi]

Authors: Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, Wentao Zhang

TLDR: VerifyBench:该研究提出VerifyBench,一个跨领域综合基准,用于系统评估大语言模型(LLM)推理验证器的性能,尤其关注其在增强LLM推理能力上的作用。它构建了包含4000个专家级问题的数据集,涵盖数学、物理、化学和生物等领域,揭示了专业验证器与通用LLM评判模型在准确性、召回率、输入结构敏感性及跨领域泛化能力上的权衡与局限性。

Abstract

arXiv:2507.09884v1 Announce Type: new Abstract: Large language models (LLMs) increasingly rely on reinforcement learning (RL) to enhance their reasoning capabilities through feedback. A critical challenge is verifying the consistency of model-generated responses and reference answers, since these responses are often lengthy, diverse, and nuanced. Rule-based verifiers struggle with complexity, prompting the use of model-based verifiers. However, specialized verifiers lack flexibility, while general LLM judges can be inconsistent. Existing research primarily focuses on building better verifiers, yet a systematic evaluation of different types of verifiers' performance across domains remains lacking, severely constraining the reliable development of Reinforcement Learning with Verifiable Reward (RLVR). To address this, we propose VerifyBench--a cross-domain comprehensive benchmark for systematically evaluating verifiers. We construct 4,000 expert-level questions covering mathematics, physics, chemistry, and biology. Each question is equipped with reference answers and diverse responses. The reliability of the evaluation is ensured through a rigorous annotation process conducted by a multidisciplinary expert team. We design a four-dimensional experimental framework to comprehensively compare the performance boundaries of specialized verifiers and general LLMs under combined conditions of extracted answers vs. complete responses, and short vs. long outputs. Our evaluation uncovers fundamental trade-offs in verifiers: while specialized verifiers achieve leading accuracy, they exhibit deficiencies in recall; general models show stronger inclusivity but unstable precision. More importantly, we discover verifiers' high sensitivity to input structure and inherent limitations in cross-domain generalization, providing critical insights into the bottlenecks of current verifier technology.

Comment: Criterion: 6, 7

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09477 [page] [pdf] [kimi]

Authors: Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu

TLDR: Towards Agentic RAG with Deep Reasoning:该综述论文深入探讨了结合检索增强生成(RAG)与深度推理的系统,以解决大型语言模型(LLMs)在多步推理和事实性方面的不足。它提出了推理增强RAG、RAG增强推理以及协同RAG-推理框架等分类,并重点关注代理LLM如何迭代地交织搜索与推理,为LLMs的智能体能力与推理提供了全面视角,未来将扩展到多模态适应性。

Abstract

arXiv:2507.09477v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at https://github.com/DavidZWZ/Awesome-RAG-Reasoning.

Comment: Criterion: 6, 8

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09854 [page] [pdf] [kimi]

Authors: Aniruddha Chattopadhyay, Raj Dandekar, Kaushik Roy

TLDR: Model-Grounded Symbolic AI Systems:提出一种将指令微调大语言模型重塑为模型接地符号AI系统的新范式,结合神经-符号AI机制,利用自然语言作为符号层,通过模型内部表示空间实现接地,旨在提升学习效率和推理可靠性,并在公理化演绎推理中进行初步评估。

Abstract

arXiv:2507.09854v1 Announce Type: new Abstract: Neurosymbolic artificial intelligence (AI) systems combine neural network and classical symbolic AI mechanisms to exploit the complementary strengths of large scale, generalizable learning and robust, verifiable reasoning. Numerous classifications of neurosymbolic AI illustrate how these two components can be integrated in distinctly different ways. In this work, we propose reinterpreting instruction tuned large language models as model grounded symbolic AI systems where natural language serves as the symbolic layer and grounding is achieved through the models internal representation space. Within this framework, we investigate and develop novel learning and reasoning approaches that preserve structural similarities to traditional learning and reasoning paradigms. Preliminary evaluations across axiomatic deductive reasoning procedures of varying complexity provide insights into the effectiveness of our approach in improving learning efficiency and reasoning reliability.

Comment: Criterion: 6

Relevance: 8 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10124 [page] [pdf] [kimi]

Authors: Thomas T. Hills

TLDR: 无(主旨是“Could you be wrong”元认知提示词):一种利用元认知提示词“Could you be wrong?”来消除大型语言模型(LLMs)偏见、提升其元认知能力的新策略,该提示词能促使LLMs在回应后产生额外信息,包括其答案原因、错误、偏见、矛盾证据和替代方案,从而实现有效的去偏和提高内容质量,为LLM对齐和推理提供新途径。

Abstract

arXiv:2507.10124v1 Announce Type: new Abstract: Identifying bias in LLMs is ongoing. Because they are still in development, what is true today may be false tomorrow. We therefore need general strategies for debiasing that will outlive current models. Strategies developed for debiasing human decision making offer one promising approach as they incorporate an LLM-style prompt intervention designed to bring latent knowledge into awareness during decision making. LLMs trained on vast amounts of information contain information about potential biases, counter-arguments, and contradictory evidence, but that information may only be brought to bear if prompted. Metacognitive prompts developed in the human decision making literature are designed to achieve this, and as I demonstrate here, they show promise with LLMs. The prompt I focus on here is "could you be wrong?" Following an LLM response, this prompt leads LLMs to produce additional information, including why they answered as they did, errors, biases, contradictory evidence, and alternatives, none of which were apparent in their initial response. Indeed, this metaknowledge often reveals that how LLMs and users interpret prompts are not aligned. Here I demonstrate this prompt using a set of questions taken from recent articles about LLM biases, including implicit discriminatory biases and failures of metacognition. "Could you be wrong" prompts the LLM to identify its own biases and produce cogent metacognitive reflection. I also present another example involving convincing but incomplete information, which is readily corrected by the metacognitive prompt. In sum, this work argues that human psychology offers a new avenue for prompt engineering, leveraging a long history of effective prompt-based improvements to human decision making.

Comment: Criterion: 3, 6

Relevance: 8 Novelty: 7 Back to [topic] [top]

Back to [top]


Topic 7: 7. Evaluation Sets and Datasets for Multimodal Large Models (12 papers)

ArXiv: 2507.09650 [page] [pdf] [kimi]

Authors: Lily Hong Zhang (Wes), Smitha Milli (Wes), Karen Jusko (Wes), Jonathan Smith (Wes), Brandon Amos (Wes), Wassim (Wes), Bouaziz, Manon Revel, Jack Kussman, Lisa Titus, Bhaktipriya Radharapu, Jane Yu, Vidya Sarma, Kris Rose, Maximilian Nickel

TLDR: Community Alignment:新提出“Community Alignment”数据集,旨在解决LLM服务多元用户偏好的挑战!该工作通过大规模多语言人类研究揭示LLM偏好与人类差异,并引入负相关采样新范式显著提升模型对异构偏好的学习能力,数据集包含来自五国的近20万个比较,是迄今最大、最具代表性的多语言多轮偏好数据集,代码已开源。

Abstract

arXiv:2507.09650v1 Announce Type: new Abstract: How can large language models (LLMs) serve users with varying preferences that may conflict across cultural, political, or other dimensions? To advance this challenge, this paper establishes four key results. First, we demonstrate, through a large-scale multilingual human study with representative samples from five countries (N=15,000), that humans exhibit significantly more variation in preferences than the responses of 21 state-of-the-art LLMs. Second, we show that existing methods for preference dataset collection are insufficient for learning the diversity of human preferences even along two of the most salient dimensions of variability in global values, due to the underlying homogeneity of candidate responses. Third, we argue that this motivates the need for negatively-correlated sampling when generating candidate sets, and we show that simple prompt-based techniques for doing so significantly enhance the performance of alignment methods in learning heterogeneous preferences. Fourth, based on this novel candidate sampling approach, we collect and open-source Community Alignment, the largest and most representative multilingual and multi-turn preference dataset to date, featuring almost 200,000 comparisons from annotators spanning five countries. We hope that the Community Alignment dataset will be a valuable resource for improving the effectiveness of LLMs for a diverse global population.

Comment: Criterion: 7

Relevance: 9 Novelty: 9 Back to [topic] [top]

ArXiv: 2507.10548 [page] [pdf] [kimi]

Authors: Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi

TLDR: EmRACE-3K:提出首个用于评估VLM具身推理能力的EmRACE-3K数据集,包含3000多项语言引导任务,在Unreal Engine构建的逼真环境中涵盖导航、物体操作和多阶段目标执行,为具身智能研究提供重要基准,并展示了监督学习结合强化学习微调Qwen2.5-VL-7B的显著效果。

Abstract

arXiv:2507.10548v1 Announce Type: new Abstract: Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.

Comment: Criterion: 7, 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09693 [page] [pdf] [kimi]

Authors: Jiali Chen, Yujie Jia, Zihan Wu, Jinyu Yang, Jianpeng Chen, Xusen Hei, Jiayuan Xie, Yi Cai, Qing Li

TLDR: ExpStar:一种用于多学科科学实验自动评论生成的模型,通过构建首个包含7K步级评论的ExpInstruct数据集,并利用检索增强机制自适应地访问和利用外部知识。该模型在实验中显著超越了14个主流大型多模态模型(LMMs),在视频理解和精细评论生成方面表现出卓越能力,有望推动AI辅助科学教学。

Abstract

arXiv:2507.09693v1 Announce Type: new Abstract: Experiment commentary is crucial in describing the experimental procedures, delving into underlying scientific principles, and incorporating content-related safety guidelines. In practice, human teachers rely heavily on subject-specific expertise and invest significant time preparing such commentary. To address this challenge, we introduce the task of automatic commentary generation across multi-discipline scientific experiments. While recent progress in large multimodal models (LMMs) has demonstrated promising capabilities in video understanding and reasoning, their ability to generate fine-grained and insightful experiment commentary remains largely underexplored. In this paper, we make the following contributions: (i) We construct \textit{ExpInstruct}, the first dataset tailored for experiment commentary generation, featuring over 7\textit{K} step-level commentaries across 21 scientific subjects from 3 core disciplines (\ie, science, healthcare and engineering). Each sample includes procedural descriptions along with potential scientific principles (\eg, chemical equations and physical laws) and safety guidelines. (ii) We propose ExpStar, an automatic experiment commentary generation model that leverages a retrieval-augmented mechanism to adaptively access, evaluate, and utilize external knowledge. (iii) Extensive experiments show that our ExpStar substantially outperforms 14 leading LMMs, which highlights the superiority of our dataset and model. We believe that ExpStar holds great potential for advancing AI-assisted scientific experiment instruction.

Comment: Criterion: 1, 7

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09313 [page] [pdf] [kimi]

Authors: Yueqian Wang, Xiaojun Meng, Yifan Wang, Huishuai Zhang, Dongyan Zhao

TLDR: ProactiveBench:首个用于评估视频大语言模型中主动交互能力的综合基准,提出了PAUC指标以更准确地衡量实时响应的时间动态,通过用户研究证明其与人类偏好更一致,优于传统评估方法。代码已开源。

Abstract

arXiv:2507.09313v1 Announce Type: new Abstract: With the growing research focus on multimodal dialogue systems, the capability for proactive interaction is gradually gaining recognition. As an alternative to conventional turn-by-turn dialogue, users increasingly expect multimodal systems to be more initiative, for example, by autonomously determining the timing of multi-turn responses in real time during video playback. To facilitate progress in this emerging area, we introduce ProactiveBench, the first comprehensive benchmark to evaluate a system's ability to engage in proactive interaction. Since model responses are generated at varying timestamps, we further propose PAUC, the first metric that accounts for the temporal dynamics of model responses. This enables a more accurate evaluation of systems operating in proactive settings. Through extensive benchmarking of various baseline systems on ProactiveBench and a user study of human preferences, we show that PAUC is in better agreement with human preferences than traditional evaluation metrics, which typically only consider the textual content of responses. These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios. Project homepage: https://github.com/yellow-binary-tree/ProactiveBench

Comment: Criterion: 7

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09491 [page] [pdf] [kimi]

Authors: Yiyang Zhou, Linjie Li, Shi Qiu, Zhengyuan Yang, Yuyang Zhao, Siwei Han, Yangfan He, Kangqi Li, Haonian Ji, Zihao Zhao, Haibo Tong, Lijuan Wang, Huaxiu Yao

TLDR: GLIMPSE:一个新颖的基准测试,旨在评估大视觉语言模型(LVLMs)是否能真正进行视频深度理解而非仅依赖静态帧。该基准包含3,269个视频和4,342个视觉中心问题,涵盖轨迹分析、时间推理、取证检测等11个类别,旨在挑战模型进行全面视频上下文推理,揭示了当前LVLM在视频理解能力上的显著不足,即使GPT-o3也仅达到66.43%的准确率。

Abstract

arXiv:2507.09491v1 Announce Type: new Abstract: Existing video benchmarks often resemble image-based benchmarks, with question types like "What actions does the person perform throughout the video?" or "What color is the woman's dress in the video?" For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context-this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos.

Comment: Criterion: 7

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09815 [page] [pdf] [kimi]

Authors: Younggun Kim, Ahmed S. Abdelrahman, Mohamed Abdel-Aty

TLDR: VRU-Accident:提出一个大型视觉-语言基准数据集,包含1K个真实世界行车记录仪事故视频、6K多选问答对和1K密集场景描述,专为评估多模态大语言模型在涉及易受伤害道路使用者事故场景中的理解与推理能力而设计,揭示现有MLLM在原因分析方面存在挑战。

Abstract

arXiv:2507.09815v1 Announce Type: new Abstract: Ensuring the safety of vulnerable road users (VRUs), such as pedestrians and cyclists, is a critical challenge for autonomous driving systems, as crashes involving VRUs often result in severe or fatal consequences. While multimodal large language models (MLLMs) have shown promise in enhancing scene understanding and decision making in autonomous vehicles, there is currently no standardized benchmark to quantitatively evaluate their reasoning abilities in complex, safety-critical scenarios involving VRUs. To address this gap, we present VRU-Accident, a large-scale vision-language benchmark designed to evaluate MLLMs in high-risk traffic scenarios involving VRUs. VRU-Accident comprises 1K real-world dashcam accident videos, annotated with 6K multiple-choice question-answer pairs across six safety-critical categories (with 24K candidate options and 3.4K unique answer choices), as well as 1K dense scene descriptions. Unlike prior works, our benchmark focuses explicitly on VRU-vehicle accidents, providing rich, fine-grained annotations that capture both spatial-temporal dynamics and causal semantics of accidents. To assess the current landscape of MLLMs, we conduct a comprehensive evaluation of 17 state-of-the-art models on the multiple-choice VQA task and on the dense captioning task. Our findings reveal that while MLLMs perform reasonably well on visually grounded attributes, they face significant challenges in reasoning and describing accident causes, types, and preventability.

Comment: Criterion: 7

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09862 [page] [pdf] [kimi]

Authors: Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li

TLDR: SpeakerVid-5M:首个大规模、高质量的音视频数据集,专为双人交互式虚拟人生成设计,总时长超过8743小时,包含520万个视频片段。数据集按交互类型和质量分层,并提供基于AR的视频聊天基线模型及VidChatBench基准测试集。数据集和处理代码将公开。

Abstract

arXiv:2507.09862v1 Announce Type: new Abstract: The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/

Comment: Criterion: 7

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10449 [page] [pdf] [kimi]

Authors: Hongyong Han, Wei Wang, Gaowei Zhang, Mingjie Li, Yi Wang

TLDR: CoralVQA:首个大规模珊瑚礁视觉问答数据集,包含12,805张图像和277,653个问答对,旨在全面评估大型视觉语言模型在珊瑚礁图像理解中的生态及健康相关推理能力,并通过半自动化构建流程确保专业级质量,揭示了当前SOTA LVLM的局限性与机遇。

Abstract

arXiv:2507.10449v1 Announce Type: new Abstract: Coral reefs are vital yet vulnerable ecosystems that require continuous monitoring to support conservation. While coral reef images provide essential information in coral monitoring, interpreting such images remains challenging due to the need for domain expertise. Visual Question Answering (VQA), powered by Large Vision-Language Models (LVLMs), has great potential in user-friendly interaction with coral reef images. However, applying VQA to coral imagery demands a dedicated dataset that addresses two key challenges: domain-specific annotations and multidimensional questions. In this work, we introduce CoralVQA, the first large-scale VQA dataset for coral reef analysis. It contains 12,805 real-world coral images from 67 coral genera collected from 3 oceans, along with 277,653 question-answer pairs that comprehensively assess ecological and health-related conditions. To construct this dataset, we develop a semi-automatic data construction pipeline in collaboration with marine biologists to ensure both scalability and professional-grade data quality. CoralVQA presents novel challenges and provides a comprehensive benchmark for studying vision-language reasoning in the context of coral reef images. By evaluating several state-of-the-art LVLMs, we reveal key limitations and opportunities. These insights form a foundation for future LVLM development, with a particular emphasis on supporting coral conservation efforts.

Comment: Criterion: 7

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10541 [page] [pdf] [kimi]

Authors: Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu

TLDR: REST:提出一个针对大型推理模型(LRMs)的压力测试框架,通过同时暴露模型于多个问题来评估其在上下文优先级分配、跨问题干扰抵抗和动态认知负载管理方面的能力,揭示了DeepSeek-R1等SOTA模型在压力下性能显著下降,且比现有基准具有更强判别力。

Abstract

arXiv:2507.10541v1 Announce Type: new Abstract: Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that concurrently exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST specifically evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key mechanistic insights emerge from our analysis: (1) the "overthinking trap" is a critical factor contributing to the performance degradation; (2) the models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation.

Comment: Criterion: 7

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10106 [page] [pdf] [kimi]

Authors: Harshal Nandigramwar, Syed Qutub, Kay-Ulrich Scholl

TLDR: BlueGlass:该研究提出一个名为BlueGlass的复合AI安全框架,旨在整合和组合多种安全工具,支持对VLM进行安全评估,并展示了针对视觉语言模型在目标检测任务上的三种安全分析(分布评估、层动态分析、稀疏自编码器概念识别),为构建更稳健可靠的AI系统奠定基础。

Abstract

arXiv:2507.10106v1 Announce Type: new Abstract: As AI systems become increasingly capable and ubiquitous, ensuring the safety of these systems is critical. However, existing safety tools often target different aspects of model safety and cannot provide full assurance in isolation, highlighting a need for integrated and composite methodologies. This paper introduces BlueGlass, a framework designed to facilitate composite AI safety workflows by providing a unified infrastructure enabling the integration and composition of diverse safety tools that operate across model internals and outputs. Furthermore, to demonstrate the utility of this framework, we present three safety-oriented analyses on vision-language models for the task of object detection: (1) distributional evaluation, revealing performance trade-offs and potential failure modes across distributions; (2) probe-based analysis of layer dynamics highlighting shared hierarchical learning via phase transition; and (3) sparse autoencoders identifying interpretable concepts. More broadly, this work contributes foundational infrastructure and findings for building more robust and reliable AI systems.

Comment: Criterion: 7

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09884 [page] [pdf] [kimi]

Authors: Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, Wentao Zhang

TLDR: VerifyBench:该研究提出VerifyBench,一个跨领域综合基准,用于系统评估大语言模型(LLM)推理验证器的性能,尤其关注其在增强LLM推理能力上的作用。它构建了包含4000个专家级问题的数据集,涵盖数学、物理、化学和生物等领域,揭示了专业验证器与通用LLM评判模型在准确性、召回率、输入结构敏感性及跨领域泛化能力上的权衡与局限性。

Abstract

arXiv:2507.09884v1 Announce Type: new Abstract: Large language models (LLMs) increasingly rely on reinforcement learning (RL) to enhance their reasoning capabilities through feedback. A critical challenge is verifying the consistency of model-generated responses and reference answers, since these responses are often lengthy, diverse, and nuanced. Rule-based verifiers struggle with complexity, prompting the use of model-based verifiers. However, specialized verifiers lack flexibility, while general LLM judges can be inconsistent. Existing research primarily focuses on building better verifiers, yet a systematic evaluation of different types of verifiers' performance across domains remains lacking, severely constraining the reliable development of Reinforcement Learning with Verifiable Reward (RLVR). To address this, we propose VerifyBench--a cross-domain comprehensive benchmark for systematically evaluating verifiers. We construct 4,000 expert-level questions covering mathematics, physics, chemistry, and biology. Each question is equipped with reference answers and diverse responses. The reliability of the evaluation is ensured through a rigorous annotation process conducted by a multidisciplinary expert team. We design a four-dimensional experimental framework to comprehensively compare the performance boundaries of specialized verifiers and general LLMs under combined conditions of extracted answers vs. complete responses, and short vs. long outputs. Our evaluation uncovers fundamental trade-offs in verifiers: while specialized verifiers achieve leading accuracy, they exhibit deficiencies in recall; general models show stronger inclusivity but unstable precision. More importantly, we discover verifiers' high sensitivity to input structure and inherent limitations in cross-domain generalization, providing critical insights into the bottlenecks of current verifier technology.

Comment: Criterion: 6, 7

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10442 [page] [pdf] [kimi]

Authors: Shivam Chandhok, Wan-Cyuan Fan, Vered Shwartz, Vineeth N Balasubramanian, Leonid Sigal

TLDR: 本文深入分析SoTA视觉语言模型(VLM)在基础视觉任务上的局限性,通过构建一系列超越传统基准的测试,并对比VLM最终响应与从视觉编码器、中间视觉-语言投影及LLM解码器输出的特征探测器性能,揭示了VLM在处理视觉信息时的不足与鲁棒性问题,旨在指导未来VLM改进。

Abstract

arXiv:2507.10442v1 Announce Type: new Abstract: Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking. Importantly, we go significantly beyond the current benchmarks, which simply measure the final performance of VLM response, by also comparing and contrasting it to the performance of probes trained directly on features obtained from the visual encoder, intermediate vision-language projection and LLM-decoder output. In doing so, we uncover shortcomings in VLMs and make a number of important observations about their capabilities, robustness and how they process visual information. We hope our insights will guide progress in further improving VLMs.

Comment: Criterion: 1, 7

Relevance: 9 Novelty: 7 Back to [topic] [top]

Back to [top]


Topic 8: 8. AI Agents & Embodied Intelligence (especially involving LLMs/MLLMs) (17 papers)

ArXiv: 2507.10548 [page] [pdf] [kimi]

Authors: Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi

TLDR: EmRACE-3K:提出首个用于评估VLM具身推理能力的EmRACE-3K数据集,包含3000多项语言引导任务,在Unreal Engine构建的逼真环境中涵盖导航、物体操作和多阶段目标执行,为具身智能研究提供重要基准,并展示了监督学习结合强化学习微调Qwen2.5-VL-7B的显著效果。

Abstract

arXiv:2507.10548v1 Announce Type: new Abstract: Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.

Comment: Criterion: 7, 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09407 [page] [pdf] [kimi]

Authors: Quanyan Zhu

TLDR: LLM-Stackelberg Games:引入LLM-Stackelberg博弈框架,将大型语言模型(LLMs)融入领导者与追随者间的策略交互,提出推理与行为均衡、推测推理均衡等新概念,允许LLMs通过结构化提示进行推理和策略调整,揭示其在网络安全、虚假信息等领域的认知丰富性和对抗潜力。

Abstract

arXiv:2507.09407v1 Announce Type: new Abstract: We introduce the framework of LLM-Stackelberg games, a class of sequential decision-making models that integrate large language models (LLMs) into strategic interactions between a leader and a follower. Departing from classical Stackelberg assumptions of complete information and rational agents, our formulation allows each agent to reason through structured prompts, generate probabilistic behaviors via LLMs, and adapt their strategies through internal cognition and belief updates. We define two equilibrium concepts: reasoning and behavioral equilibrium, which aligns an agent's internal prompt-based reasoning with observable behavior, and conjectural reasoning equilibrium, which accounts for epistemic uncertainty through parameterized models over an opponent's response. These layered constructs capture bounded rationality, asymmetric information, and meta-cognitive adaptation. We illustrate the framework through a spearphishing case study, where a sender and a recipient engage in a deception game using structured reasoning prompts. This example highlights the cognitive richness and adversarial potential of LLM-mediated interactions. Our results show that LLM-Stackelberg games provide a powerful paradigm for modeling decision-making in domains such as cybersecurity, misinformation, and recommendation systems.

Comment: Criterion: 8, 6

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09495 [page] [pdf] [kimi]

Authors: Hang Wang, Junshan Zhang

TLDR: Generative-RL Agent Perspective:本文提出一种基于生成式AI的多智能体强化学习新范式,旨在将智能体从反应式转变为主动式!该方法将智能体重塑为能够合成复杂多智能体动态、预测其他智能体行为并生成协调动作序列的生成模型,从而实现前瞻性决策、无缝协调与动态适应。有望为分布式智能、自主系统、机器人和人机协作带来突破!

Abstract

arXiv:2507.09495v1 Announce Type: new Abstract: Multi-agent reinforcement learning faces fundamental challenges that conventional approaches have failed to overcome: exponentially growing joint action spaces, non-stationary environments where simultaneous learning creates moving targets, and partial observability that constrains coordination. Current methods remain reactive, employing stimulus-response mechanisms that fail when facing novel scenarios. We argue for a transformative paradigm shift from reactive to proactive multi-agent intelligence through generative AI-based reinforcement learning. This position advocates reconceptualizing agents not as isolated policy optimizers, but as sophisticated generative models capable of synthesizing complex multi-agent dynamics and making anticipatory decisions based on predictive understanding of future interactions. Rather than responding to immediate observations, generative-RL agents can model environment evolution, predict other agents' behaviors, generate coordinated action sequences, and engage in strategic reasoning accounting for long-term dynamics. This approach leverages pattern recognition and generation capabilities of generative AI to enable proactive decision-making, seamless coordination through enhanced communication, and dynamic adaptation to evolving scenarios. We envision this paradigm shift will unlock unprecedented possibilities for distributed intelligence, moving beyond individual optimization toward emergent collective behaviors representing genuine collaborative intelligence. The implications extend across autonomous systems, robotics, and human-AI collaboration, promising solutions to coordination challenges intractable under traditional reactive frameworks.

Comment: Criterion: 6, 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09041 [page] [pdf] [kimi]

Authors: Andrew Wagenmaker, Zhiyuan Zhou, Sergey Levine

TLDR: Behavioral Exploration:提出一种训练智能体通过上下文适应学习探索行为的新范式,利用长上下文生成模型预测专家动作,基于历史交互上下文实现快速在线适应和目标导向的专家式探索,在模拟和真实机器人操作任务中展现出强大的自适应探索能力。

Abstract

arXiv:2507.09041v1 Announce Type: new Abstract: Developing autonomous agents that quickly explore an environment and adapt their behavior online is a canonical challenge in robotics and machine learning. While humans are able to achieve such fast online exploration and adaptation, often acquiring new information and skills in only a handful of interactions, existing algorithmic approaches tend to rely on random exploration and slow, gradient-based behavior updates. How can we endow autonomous agents with such capabilities on par with humans? Taking inspiration from recent progress on both in-context learning and large-scale behavioral cloning, in this work we propose behavioral exploration: training agents to internalize what it means to explore and adapt in-context over the space of expert'' behaviors. To achieve this, given access to a dataset of expert demonstrations, we train a long-context generative model to predict expert actions conditioned on a context of past observations and a measure of how exploratory'' the expert's behaviors are relative to this context. This enables the model to not only mimic the behavior of an expert, but also, by feeding its past history of interactions into its context, to select different expert behaviors than what have been previously selected, thereby allowing for fast online adaptation and targeted, ``expert-like'' exploration. We demonstrate the effectiveness of our method in both simulated locomotion and manipulation settings, as well as on real-world robotic manipulation tasks, illustrating its ability to learn adaptive, exploratory behavior.

Comment: Criterion: 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.08885 [page] [pdf] [kimi]

Authors: Baining Zhao, Rongze Tang, Mingyuan Jia, Ziyou Wang, Fanghang Man, Xin Zhang, Yu Shang, Weichen Zhang, Chen Gao, Wei Wu, Xin Wang, Xinlei Chen, Yong Li

TLDR: AirScape:首个为六自由度空中智能体设计的生成式世界模型,旨在使其能够预测自身运动意图的未来观察结果,为此构建了包含11k视频-意图对的大规模数据集,并开发了两阶段训练方案,使其具备受运动意图控制并遵守物理时空约束的空间想象能力。

Abstract

arXiv:2507.08885v1 Announce Type: new Abstract: How to enable robots to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore more general spatial imagination capabilities, here we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct an dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase training schedule to train a foundation model -- initially devoid of embodied spatial knowledge -- into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints.

Comment: Criterion: 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09160 [page] [pdf] [kimi]

Authors: Jialei Huang, Shuo Wang, Fanqi Lin, Yihang Hu, Chuan Wen, Yang Gao

TLDR: Tactile-VLA:该研究提出一种新型框架Tactile-VLA,深度融合视觉、语言、动作和触觉传感,通过混合位置-力控制器和推理模块,将VLA模型意图转化为精确物理动作,激活VLM的物理知识以实现接触密集任务的零样本泛化,大幅提升了机器人通用性。

Abstract

arXiv:2507.09160v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have shown remarkable achievements, driven by the rich implicit knowledge of their vision-language components. However, achieving generalist robotic agents demands precise grounding into physical interactions, especially in contact-rich scenarios where fine-grained force control is essential. We advance VLAs' implicit knowledge beyond identifying what to do, towards guiding how to physically interact with real world. This paper introduces Tactile-VLA, a novel framework that deeply fuses vision, language, action, and tactile sensing. This framework incorporates a hybrid position-force controller to translate the model's intentions into precise physical actions and a reasoning module that allows the robot to adapt its strategy based on tactile feedback. Experiments demonstrate Tactile-VLA's effectiveness and generalizability in three key aspects: (1) enabling tactile-aware instruction following, (2) utilizing tactile-relevant commonsense, and (3) facilitating adaptive tactile-involved reasoning. A key finding is that the VLM's prior knowledge already contains semantic understanding of physical interaction; by connecting it to the robot's tactile sensors with only a few demonstrations, we can activate this prior knowledge to achieve zero-shot generalization in contact-rich tasks.

Comment: Criterion: 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09985 [page] [pdf] [kimi]

Authors: Samson Yu, Kelvin Lin, Harold Soh

TLDR: Octopi-1.5:该研究展示了最新的视觉-触觉-语言模型Octopi-1.5,它能处理多物体部分的触觉信号并结合RAG模块,显著提升了触觉推理与实时学习能力,通过TMI手持触觉界面无需机器人即可实时交互演示,代码已开源。

Abstract

arXiv:2507.09985v1 Announce Type: new Abstract: Touch is recognized as a vital sense for humans and an equally important modality for robots, especially for dexterous manipulation, material identification, and scenarios involving visual occlusion. Building upon very recent work in touch foundation models, this demonstration will feature Octopi-1.5, our latest visual-tactile-language model. Compared to its predecessor, Octopi-1.5 introduces the ability to process tactile signals from multiple object parts and employs a simple retrieval-augmented generation (RAG) module to improve performance on tasks and potentially learn new objects on-the-fly. The system can be experienced live through a new handheld tactile-enabled interface, the TMI, equipped with GelSight and TAC-02 tactile sensors. This convenient and accessible setup allows users to interact with Octopi-1.5 without requiring a robot. During the demonstration, we will showcase Octopi-1.5 solving tactile inference tasks by leveraging tactile inputs and commonsense knowledge. For example, in a Guessing Game, Octopi-1.5 will identify objects being grasped and respond to follow-up queries about how to handle it (e.g., recommending careful handling for soft fruits). We also plan to demonstrate Octopi-1.5's RAG capabilities by teaching it new items. With live interactions, this demonstration aims to highlight both the progress and limitations of VTLMs such as Octopi-1.5 and to foster further interest in this exciting field. Code for Octopi-1.5 and design files for the TMI gripper are available at https://github.com/clear-nus/octopi-1.5.

Comment: Criterion: 1, 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10284 [page] [pdf] [kimi]

Authors: Venkat Margapuri

TLDR: PIRL:一种新颖的提示信息强化学习方法,通过集成大型语言模型(GPT-3.5)的零样本推理和上下文学习能力,动态塑造强化学习的奖励函数,以优化无人机视觉覆盖路径规划。该方法显著提升了OpenAI Gym中高达14%和Webots中高达27%的视觉覆盖率,同时提高了电池效率和降低了冗余,为机器人领域中RL与自然语言先验的融合指明了新方向。

Abstract

arXiv:2507.10284v1 Announce Type: new Abstract: Visual coverage path planning with unmanned aerial vehicles (UAVs) requires agents to strategically coordinate UAV motion and camera control to maximize coverage, minimize redundancy, and maintain battery efficiency. Traditional reinforcement learning (RL) methods rely on environment-specific reward formulations that lack semantic adaptability. This study proposes Prompt-Informed Reinforcement Learning (PIRL), a novel approach that integrates the zero-shot reasoning ability and in-context learning capability of large language models with curiosity-driven RL. PIRL leverages semantic feedback from an LLM, GPT-3.5, to dynamically shape the reward function of the Proximal Policy Optimization (PPO) RL policy guiding the agent in position and camera adjustments for optimal visual coverage. The PIRL agent is trained using OpenAI Gym and evaluated in various environments. Furthermore, the sim-to-real-like ability and zero-shot generalization of the agent are tested by operating the agent in Webots simulator which introduces realistic physical dynamics. Results show that PIRL outperforms multiple learning-based baselines such as PPO with static rewards, PPO with exploratory weight initialization, imitation learning, and an LLM-only controller. Across different environments, PIRL outperforms the best-performing baseline by achieving up to 14% higher visual coverage in OpenAI Gym and 27% higher in Webots, up to 25% higher battery efficiency, and up to 18% lower redundancy, depending on the environment. The results highlight the effectiveness of LLM-guided reward shaping in complex spatial exploration tasks and suggest a promising direction for integrating natural language priors into RL for robotics.

Comment: Criterion: 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.08851 [page] [pdf] [kimi]

Authors: Simon Schwaiger, Stefan Thalhammer, Wilfried W"ober, Gerald Steinbauer-Wagner

TLDR: OTAS:《开放词汇令牌对齐户外分割》:提出首个开放词汇令牌对齐方法,直接从预训练视觉模型输出令牌中提取语义结构,并通过语言进行接地,实现了零样本的户外开放词汇分割,解决了传统方法的局限性;该方法运行速度达17fps,在TartanAir数据集的3D分割上IoU提升高达151%,适用于机器人规划控制,代码即将开源。

Abstract

arXiv:2507.08851v1 Announce Type: new Abstract: Understanding open-world semantics is critical for robotic planning and control, particularly in unstructured outdoor environments. Current vision-language mapping approaches rely on object-centric segmentation priors, which often fail outdoors due to semantic ambiguities and indistinct semantic class boundaries. We propose OTAS - an Open-vocabulary Token Alignment method for Outdoor Segmentation. OTAS overcomes the limitations of open-vocabulary segmentation models by extracting semantic structure directly from the output tokens of pretrained vision models. By clustering semantically similar structures across single and multiple views and grounding them in language, OTAS reconstructs a geometrically consistent feature field that supports open-vocabulary segmentation queries. Our method operates zero-shot, without scene-specific fine-tuning, and runs at up to ~17 fps. OTAS provides a minor IoU improvement over fine-tuned and open-vocabulary 2D segmentation methods on the Off-Road Freespace Detection dataset. Our model achieves up to a 151% IoU improvement over open-vocabulary mapping methods in 3D segmentation on TartanAir. Real-world reconstructions demonstrate OTAS' applicability to robotic applications. The code and ROS node will be made publicly available upon paper acceptance.

Comment: Criterion: 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10539 [page] [pdf] [kimi]

Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You

TLDR: GWM:提出一种图世界模型,它能同时支持非结构化和图结构化数据,并融合多模态信息,将多样化任务表示为动作。GWM通过统一的多模态令牌或嵌入空间上的通用消息传递算法,在多模态生成与匹配、推荐、图预测、多智能体、检索增强生成及规划优化等六个任务上,性能超越或媲美领域特定基线,并展现出强大的零/少样本能力,代码已开源。

Abstract

arXiv:2507.10539v1 Announce Type: new Abstract: World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data and cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multi-modal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph-structured states with multi-modal information and represents diverse tasks as actions. The core of a GWM is a generic message-passing algorithm to aggregate structured information, either over a unified multi-modal token space by converting multi-modal data into text (GWM-T) or a unified multi-modal embedding space by modality-specific encoders (GWM-E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on six tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines' performance, benefits from multi-hop structures, and demonstrates strong zero-shot/few-shot capabilities on unseen new tasks. Our code for GWM is released at https://github.com/ulab-uiuc/GWM.

Comment: Criterion: 1, 8

Relevance: 9 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.10500 [page] [pdf] [kimi]

Authors: Kyungtae Han, Yitao Chen, Rohit Gupta, Onur Altintas

TLDR: SC-ADAS:一种模块化框架,通过集成大型语言模型(LLMs)、视觉到文本解释和结构化函数调用等生成式AI组件,赋能实时、可解释且自适应的场景感知对话式高级驾驶辅助系统(ADAS)。该系统在CARLA模拟器中实现多轮对话和基于视觉/传感器上下文的ADSL控制,支持自然语言推荐和驾驶员确认的ADAS控制指令执行,展示了LLM赋能具身智能在复杂驾驶环境中的潜力。

Abstract

arXiv:2507.10500v1 Announce Type: new Abstract: While autonomous driving technologies continue to advance, current Advanced Driver Assistance Systems (ADAS) remain limited in their ability to interpret scene context or engage with drivers through natural language. These systems typically rely on predefined logic and lack support for dialogue-based interaction, making them inflexible in dynamic environments or when adapting to driver intent. This paper presents Scene-Aware Conversational ADAS (SC-ADAS), a modular framework that integrates Generative AI components including large language models, vision-to-text interpretation, and structured function calling to enable real-time, interpretable, and adaptive driver assistance. SC-ADAS supports multi-turn dialogue grounded in visual and sensor context, allowing natural language recommendations and driver-confirmed ADAS control. Implemented in the CARLA simulator with cloud-based Generative AI, the system executes confirmed user intents as structured ADAS commands without requiring model fine-tuning. We evaluate SC-ADAS across scene-aware, conversational, and revisited multi-turn interactions, highlighting trade-offs such as increased latency from vision-based context retrieval and token growth from accumulated dialogue history. These results demonstrate the feasibility of combining conversational reasoning, scene perception, and modular ADAS control to support the next generation of intelligent driver assistance.

Comment: Criterion: 8

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09329 [page] [pdf] [kimi]

Authors: Matous Kozak, Roshanak Zilouchian Moghaddam, Siva Sivaraman

TLDR: 针对LLM编码智能体的安全性评估:首次系统性分析了LLM编码智能体在软件开发中的不安全行为,评估了GPT-4o等五种模型在93个真实世界任务上的表现,发现21%的轨迹包含不安全动作,并提出了高精度检测系统,GPT-4.1在缓解成功率上达96.8%。

Abstract

arXiv:2507.09329v1 Announce Type: new Abstract: LLM-based coding agents are rapidly being deployed in software development, yet their security implications remain poorly understood. These agents, while capable of accelerating software development, may inadvertently introduce insecure practices. We conducted the first systematic security evaluation of autonomous coding agents, analyzing over 12,000 actions across five state-of-the-art models (GPT-4o, GPT-4.1, Claude variants) on 93 real-world software setup tasks. Our findings reveal significant security concerns: 21% of agent trajectories contained insecure actions, with models showing substantial variation in security behavior. We developed a high-precision detection system that identified four major vulnerability categories, with information exposure (CWE-200) being the most prevalent one. We also evaluated mitigation strategies including feedback mechanisms and security reminders with various effectiveness between models. GPT-4.1 demonstrated exceptional security awareness with 96.8% mitigation success. Our work provides the first comprehensive framework for evaluating coding agent security and highlights the need for security-aware design of next generation LLM-based coding agents.

Comment: Criterion: 8

Relevance: 8 Novelty: 8 Back to [topic] [top]

ArXiv: 2507.09477 [page] [pdf] [kimi]

Authors: Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu

TLDR: Towards Agentic RAG with Deep Reasoning:该综述论文深入探讨了结合检索增强生成(RAG)与深度推理的系统,以解决大型语言模型(LLMs)在多步推理和事实性方面的不足。它提出了推理增强RAG、RAG增强推理以及协同RAG-推理框架等分类,并重点关注代理LLM如何迭代地交织搜索与推理,为LLMs的智能体能力与推理提供了全面视角,未来将扩展到多模态适应性。

Abstract

arXiv:2507.09477v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at https://github.com/DavidZWZ/Awesome-RAG-Reasoning.

Comment: Criterion: 6, 8

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.08831 [page] [pdf] [kimi]

Authors: Josh Qixuan Sun, Xiaoying Xing, Huaiyuan Weng, Chul Min Yeum, Mark Crowley

TLDR: VIL:提出一种用于连续环境中视觉-语言导航(VLNCE)的视角不变后训练策略。该方法利用对比学习构建稀疏且视角不变的特征,并引入教师-学生框架以增强现有导航策略的鲁棒性。VIL在V2-VLNCE场景下,R2R-CE和RxR-CE数据集的成功率提升8-15%,并在RxR-CE数据集上达到SOTA,可作为即插即用方法。

Abstract

arXiv:2507.08831v1 Announce Type: new Abstract: Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent's observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components, thus eliminating the cost for individual module training. Empirical results show that our method outperforms state-of-the-art approaches on V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE setting and find that, despite being trained for varied viewpoints, it often still improves performance. On the more challenging RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics when compared to other map-free methods. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method.

Comment: Criterion: 8

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10087 [page] [pdf] [kimi]

Authors: Muhammad Tayyab Khan, Ammar Waheed

TLDR: Foundation Model Driven Robotics:这篇综述深入探讨了大型语言模型(LLMs)和视觉语言模型(VLMs)在机器人领域的应用,涵盖了感知、规划、控制和人机交互的显著进展。论文通过对模拟驱动设计、开放世界执行和模拟到现实迁移等应用进行系统性综合,强调了集成系统级策略,并讨论了关键瓶颈和未来研究方向,旨在通过更鲁棒、可解释和具身化的模型弥合语义推理与物理智能之间的鸿沟。

Abstract

arXiv:2507.10087v1 Announce Type: new Abstract: The rapid emergence of foundation models, particularly Large Language Models (LLMs) and Vision-Language Models (VLMs), has introduced a transformative paradigm in robotics. These models offer powerful capabilities in semantic understanding, high-level reasoning, and cross-modal generalization, enabling significant advances in perception, planning, control, and human-robot interaction. This critical review provides a structured synthesis of recent developments, categorizing applications across simulation-driven design, open-world execution, sim-to-real transfer, and adaptable robotics. Unlike existing surveys that emphasize isolated capabilities, this work highlights integrated, system-level strategies and evaluates their practical feasibility in real-world environments. Key enabling trends such as procedural scene generation, policy generalization, and multimodal reasoning are discussed alongside core bottlenecks, including limited embodiment, lack of multimodal data, safety risks, and computational constraints. Through this lens, this paper identifies both the architectural strengths and critical limitations of foundation model-based robotics, highlighting open challenges in real-time operation, grounding, resilience, and trust. The review concludes with a roadmap for future research aimed at bridging semantic reasoning and physical intelligence through more robust, interpretable, and embodied models.

Comment: Criterion: 8

Relevance: 9 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.09174 [page] [pdf] [kimi]

Authors: Shuo Yang, Zijian Yu, Zhenzhe Ying, Yuqin Dai, Guoqing Wang, Jun Lan, Jinfeng Xu, Jinze Li, Edith C. H. Ngai

TLDR: RAMA:一种用于多模态事实核查的检索增强多智能体框架,通过战略性查询制定、跨验证证据聚合以及多模态大语言模型的多智能体集成架构,有效解决了多模态虚假信息检测中模棱两可或缺乏上下文的难题。实验证明其在基准数据集上表现优越,代码已开源。

Abstract

arXiv:2507.09174v1 Announce Type: new Abstract: The rapid proliferation of multimodal misinformation presents significant challenges for automated fact-checking systems, especially when claims are ambiguous or lack sufficient context. We introduce RAMA, a novel retrieval-augmented multi-agent framework designed for verifying multimedia misinformation. RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross-verification evidence aggregation from diverse, authoritative sources; and (3) a multi-agent ensemble architecture that leverages the complementary strengths of multiple multimodal large language models and prompt variants. Extensive experiments demonstrate that RAMA achieves superior performance on benchmark datasets, particularly excelling in resolving ambiguous or improbable claims by grounding verification in retrieved factual evidence. Our findings underscore the necessity of integrating web-based evidence and multi-agent reasoning for trustworthy multimedia verification, paving the way for more reliable and scalable fact-checking solutions. RAMA will be publicly available at https://github.com/kalendsyang/RAMA.git.

Comment: Criterion: 8

Relevance: 8 Novelty: 7 Back to [topic] [top]

ArXiv: 2507.10134 [page] [pdf] [kimi]

Authors: Yousef Emami, Hao Zhou, Miguel Gutierrez Gaitan, Kai Li, Luis Almeida

TLDR: FRSICL:一种基于LLM的上下文学习方案,用于无人机辅助野火监测中的飞行资源分配,它利用LLM的上下文学习和自然语言指令进行实时飞行控制和数据收集调度,解决传统DRL方法的局限性,并在模拟中超越PPO和Nearest-Neighbor基线。

Abstract

arXiv:2507.10134v1 Announce Type: new Abstract: Unmanned Aerial Vehicles (UAVs) are vital for public safety, particularly in wildfire monitoring, where early detection minimizes environmental impact. In UAV-Assisted Wildfire Monitoring (UAWM) systems, joint optimization of sensor transmission scheduling and velocity is critical for minimizing Age of Information (AoI) from stale sensor data. Deep Reinforcement Learning (DRL) has been used for such optimization; however, its limitations such as low sampling efficiency, simulation-to-reality gaps, and complex training render it unsuitable for time-critical applications like wildfire monitoring. This paper introduces a new online Flight Resource Allocation scheme based on LLM-Enabled In-Context Learning (FRSICL) to jointly optimize the UAV's flight control and data collection schedule along the trajectory in real time, thereby asymptotically minimizing the average AoI across ground sensors. In contrast to DRL, FRSICL generates data collection schedules and controls velocity using natural language task descriptions and feedback from the environment, enabling dynamic decision-making without extensive retraining. Simulation results confirm the effectiveness of the proposed FRSICL compared to Proximal Policy Optimization (PPO) and Nearest-Neighbor baselines.

Comment: Criterion: 8

Relevance: 8 Novelty: 7 Back to [topic] [top]

Back to [top]


Go beyond (0 papers)

Back to [top]