Xingchen Wan

1600 Amphitheatre Parkway

Mountain View, CA 94043

xingchenw[at]google.com

I am a Senior Research Scientist at Google based in the San Francisco Bay Area.

Research Interests

My primary research drives innovations in large language models (LLMs), focusing on building systems that are more efficient, robust, and autonomous. My contributions span:

LLM Post-training (e.g., [1, 2, 3]);
Developing self-improving (multimodal) LLM agents (e.g., [4, 5, 6]);
Automated optimization techniques for LLMs and agents (e.g., [7, 8, 9]); and
Integrating GenAI with large-scale (unstructured) data systems (e.g., [10, 11]).

I have authored over 30 papers, including more than 20 published in top peer-reviewed conferences (e.g., NeurIPS, ICML, ICLR) and journals (e.g., JMLR, TACL) with more than 1200 citations as of Nov. 2025.

Previously, I did my PhD in the Machine Learning Research Group, Department of Engineering Science, University of Oxford where I worked on Bayesian optimization, AutoML, and machine learning on graphs.

Academic Services

Area chair/senior program committee member at NeurIPS (2024-25), ICML (2025), ACL ARR (2025-); Action editor at TMLR.

Reviewer/program committee member at ACL (2023-24), AutoML-Conf (2023-24), COLM (2024), CVPR (2024), ECCV (2024), EMNLP (2023-24), ICLR (2024-25), ICML (2023-24), JMLR, Machine Learning, NeurIPS (2022-23), WACV (2022-24), etc.

news

Oct 17, 2025	We present VISTA and Maestro, two self-improving multimodal generation agents for text-to-video and text-to-image generation, respectively. 🚨 Google just dropped the most advanced self-improving video AI ever built. It’s called VISTA, and it literally rewrites its own prompts to make every new generation better than the last. No retraining. No fine-tuning. Just pure test-time self-reflection. Here’s how it works:… pic.twitter.com/WyhX9uur9l— Louis Gleeson (@aigleeson) October 24, 2025
May 19, 2025	We present Visual Planning, where we apply reinforcement learning post-training on pure-vision models to achieve state-of-the-art performance in visual reasoning tasks. 🚀Let’s Think Only with Images. No language and No verbal thought.🤔 Let’s think through a sequence of images💭, like how humans picture steps in their minds🎨. We propose Visual Planning, a novel reasoning paradigm that enables models to reason purely through images. pic.twitter.com/ly9JtuEC33— Yi Xu (@_yixu) May 19, 2025

Oct 17, 2025

We present VISTA and Maestro, two self-improving multimodal generation agents for text-to-video and text-to-image generation, respectively.

🚨 Google just dropped the most advanced self-improving video AI ever built.

It’s called VISTA, and it literally rewrites its own prompts to make every new generation better than the last.

No retraining. No fine-tuning. Just pure test-time self-reflection.

Here’s how it works:… pic.twitter.com/WyhX9uur9l— Louis Gleeson (@aigleeson) October 24, 2025

May 19, 2025

We present Visual Planning, where we apply reinforcement learning post-training on pure-vision models to achieve state-of-the-art performance in visual reasoning tasks.

🚀Let’s Think Only with Images.

No language and No verbal thought.🤔

Let’s think through a sequence of images💭, like how humans picture steps in their minds🎨.

We propose Visual Planning, a novel reasoning paradigm that enables models to reason purely through images. pic.twitter.com/ly9JtuEC33— Yi Xu (@_yixu) May 19, 2025

selected publications

arXiv

VISTA: A Test-Time Self-Improving Video Generation Agent

Do Xuan Long, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister, and Sercan Ö. Arık

arXiv preprint arXiv:2510.15831, 2025

Abs PDF Website

Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.
arXiv

Maestro: Self-Improving Text-to-Image Generation via Agent Orchestration

Xingchen Wan, Han Zhou, Ruoxi Sun, Hootan Nakhost, Ke Jiang, Rajarishi Sinha, and Sercan Ö. Arık

arXiv preprint arXiv:2509.10704, 2025

Abs PDF

Text-to-image (T2I) models, while offering immense creative potential, are highly reliant on human intervention, posing significant usability challenges that often necessitate manual, iterative prompt engineering over often underspecified prompts. This paper introduces Maestro, a novel self-evolving image generation system that enables T2I models to autonomously self-improve generated images through iterative evolution of prompts, using only an initial prompt. Maestro incorporates two key innovations: 1) self-critique, where specialized multimodal LLM (MLLM) agents act as ’critics’ to identify weaknesses in generated images, correct for under-specification, and provide interpretable edit signals, which are then integrated by a ’verifier’ agent while preserving user intent; and 2) self-evolution, utilizing MLLM-as-a-judge for head-to-head comparisons between iteratively generated images, eschewing problematic images, and evolving creative prompt candidates that align with user intents. Extensive experiments on complex T2I tasks using black-box models demonstrate that Maestro significantly improves image quality over initial prompts and state-of-the-art automated methods, with effectiveness scaling with more advanced MLLM components. This work presents a robust, interpretable, and effective pathway towards self-improving T2I generation.
arXiv

Visual Planning: Let’s Think Only with Images

Yi Xu^*, Chengzu Li^*, Han Zhou^*, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vulić

arXiv preprint arXiv:2505.11409. 🏆🥉 #3 paper of the day at HuggingFace 🤗 , 2025

Abs PDF Code

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.
arXiv

Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vulić, Anna Korhonen, and Sercan Ö. Arık

arXiv preprint arXiv:2502.02533, 2025

Abs PDF

Large language models, employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with prompts that declare their functionality, along with the topologies that orchestrate interactions across agents. Designing prompts and topologies for multi-agent systems (MAS) is inherently complex. To automate the entire design process, we first conduct an in-depth analysis of the design space aiming to understand the factors behind building effective MAS. We reveal that prompts together with topologies play critical roles in enabling more effective MAS design. Based on the insights, we propose Multi-Agent System Search (MASS), a MAS optimization framework that efficiently exploits the complex MAS design space by interleaving its optimization stages, from local to global, from prompts to topologies, over three stages: 1) block-level (local) prompt optimization; 2) workflow topology optimization; 3) workflow-level (global) prompt optimization, where each stage is conditioned on the iteratively optimized prompts/topologies from former stages. We show that MASS-optimized multi-agent systems outperform a spectrum of existing alternatives by a substantial margin. Based on the MASS-found systems, we finally propose design principles behind building effective multi-agent systems.
ICLR 2025

From Few to Many: Self-Improving Many-Shot Reasoners Through Iterative Optimization and Generation

Xingchen Wan, Han Zhou, Ruoxi Sun, Hootan Nakhost, Ke Jiang, and Sercan Ö. Arık

In The Thirteenth International Conference on Learning Representations, 2025

Abs PDF

Recent advances in long-context large language models (LLMs) have led to the emerging paradigm of many-shot in-context learning (ICL), where it is observed that scaling many more demonstrating examples beyond the conventional few-shot setup in the context can lead to performance benefits. However, despite its promise, it is unclear what aspects dominate the benefits and whether simply scaling to more examples is the most effective way of improving many-shot ICL. In this work, we first provide an analysis of the factors driving many-shot ICL, and we find that 1) many-shot performance can still be attributed to often a few disproportionately influential examples and 2) identifying such influential examples ("optimize") and using them as demonstrations to regenerate new examples ("generate") can lead to further improvements. Inspired by the findings, we propose BRIDGE, an algorithm that alternates between the optimize step with Bayesian optimization to discover the influential sets of examples and the generate step to reuse this set to expand the reasoning paths of the examples back to the many-shot regime automatically. On Gemini, Claude, and Mistral LLMs of different sizes, we show that BRIDGE to significant improvements across a diverse set of tasks, including symbolic reasoning, numerical reasoning, and code generation.
ACL 2025

Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models

Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan O Arik

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Abs DOI PDF

Retrieval augmented generation (RAG), while effectively integrating external knowledge to address the inherent limitations of large language models (LLMs), can be hindered by imperfect retrieval that contain irrelevant, misleading, or even malicious information. Previous studies have rarely connected the behavior of RAG through joint analysis, particularly regarding error propagation coming from imperfect retrieval and potential conflicts between LLMs’ internal knowledge and external sources. Through comprehensive and controlled analyses under realistic conditions, we find that imperfect retrieval augmentation is inevitable, common, and harmful. We identify the knowledge conflicts between LLM-internal and external knowledge from retrieval as a bottleneck to overcome imperfect retrieval in the post-retrieval stage of RAG. To address this, we propose Astute RAG, a novel RAG approach designed to be resilient to imperfect retrieval augmentation. It adaptively elicits essential information from LLMs’ internal knowledge, iteratively consolidates internal and external knowledge with source-awareness, and finalizes the answer according to information reliability. Our experiments with Gemini and Claude demonstrate the superior performance of Astute RAG compared to previous robustness-enhanced RAG approaches. Specifically, Astute RAG is the only RAG method that achieves performance comparable to or even surpassing conventional use of LLMs under the worst-case scenario. Further analysis reveals the effectiveness of Astute RAG in resolving knowledge conflicts, thereby improving the trustworthiness of RAG.
NeurIPS 2024

Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization

Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Sercan Ö. Arik

In Advances in Neural Information Processing Systems 37. ☁️ Powers the Google Cloud Vertex AI Prompt Optimizer , 2024

Abs PDF

Large language models have demonstrated remarkable capabilities but their performance is heavily reliant on effective prompt engineering. Automatic prompt optimization (APO) methods are designed to automate this and can be broadly categorized into those targeting instructions (instruction optimization, IO) vs. those targeting exemplars (exemplar optimization, EO). Despite their shared objective, these have evolved rather independently, with IO receiving more research attention recently. This paper seeks to bridge this gap by comprehensively comparing the performance of representative IO and EO techniques both isolation and combination on a diverse set of challenging tasks. Our findings reveal that intelligently reusing model-generated input-output pairs obtained from evaluating prompts on the validation set as exemplars, consistently improves performance on top of IO methods but is currently under-investigated. We also find that despite the recent focus on IO, how we select exemplars can outweigh how we optimize instructions, with EO strategies as simple as random search outperforming state-of-the-art IO methods with seed instructions without any optimization. Moreover, we observe a synergy between EO and IO, with optimal combinations surpassing the individual contributions. We conclude that studying exemplar optimization both as a standalone method and its optimal combination with instruction optimization remain a crucial aspect of APO and deserve greater consideration in future research, even in the era of highly capable instruction-following models.