RoboPapers

RoboPapers Episode 4: Vision Language Models are In-Context Value Learners

Chris Paxton and Michael Cho — Tue, 15 Jul 2025 00:02:50 GMT

Note: this is an old episode of RoboPapers. I am uploading old episodes of our podcast here to make it easier to find them and to find our reference materials. Please follow the podcast on Youtube or on Spotify.

Value prediction is an important robotics problem, wherein we can determine how useful a state is to the successful execution of a task in the future. The key insight that Jason and his co-authors showed in this paper was simple:

* Large vision-language models already encode a ton of information about task completion from being trained on a wide range of human image and video data

* We can use these on robotics tasks to estimate progress, which is a useful surrogate for the “value” of any individual state.

Here’s the abstract:

Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can scale and generalize. To address these challenges, we present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Naively asking a VLM to predict values for a video sequence performs poorly due to the strong temporal correlation between successive frames. Instead, GVL poses value estimation as a temporal ordering problem over shuffled video frames; this seemingly more challenging task encourages VLMs to more fully exploit their underlying semantic and temporal grounding capabilities to differentiate frames based on their perceived task progress, consequently producing significantly better value predictions. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks across diverse robot platforms, including challenging bimanual manipulation tasks. Furthermore, we demonstrate that GVL permits flexible multi-modal in-context learning via examples from heterogeneous tasks and embodiments, such as human videos. The generality of GVL enables various downstream applications pertinent to visuomotor policy learning, including dataset filtering, success detection, and value-weighted regression -- all without any model training or finetuning.

Watch on YouTube

Listen on Spotify

Original post on Twitter/X

Project Site

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit itcanthink.substack.com

RoboPapers Episode 3: Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids with Toru Lin

Chris Paxton — Mon, 14 Apr 2025 13:00:00 GMT

Sim-to-real training is an important part of the future of robot skill learning. But a lot of sim-to-real work focuses on navigation, grasping, or, more recently, non-interactive robot behaviors like dancing. Training dexterous policies for humanoids is very different, because manipulation is a very hard problem and multi-finger dexterous hands are even more difficult.

Enter this cool work from Toru Lin and colleagues. Their goal is to do long-horizon manipulation with dexterous humanoid robots:

And they discuss what is actually necessary to make this work, which involves things like real-to-sim and reward engineering as well as neural network configuration:

Abstract

Reinforcement learning has delivered promising results in achieving human- or even superhuman-level capabilities across diverse problem domains, but success in dexterous robot manipulation remains limited. This work investigates the key challenges in applying reinforcement learning to solve a collection of contact-rich manipulation tasks on a humanoid embodiment. We introduce novel techniques to overcome the identified challenges with empirical validation. Our main contributions include an automated real-to-sim tuning module that brings the simulated environment closer to the real world, a generalized reward design scheme that simplifies reward engineering for long-horizon contact-rich manipulation tasks, a divide-and-conquer distillation process that improves the sample efficiency of hard-exploration problems while maintaining sim-to-real performance, and a mixture of sparse and dense object representations to bridge the sim-to-real perception gap. We show promising results on three humanoid dexterous manipulation tasks, with ablation studies on each technique. Our work presents a successful approach to learning humanoid dexterous manipulation using sim-to-real reinforcement learning, achieving robust generalization and high performance without the need for human demonstration.

Project page

ArXiv

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit itcanthink.substack.com

RoboPapers Episode 2: Robot Utility Models with Mahi Shafiullah

Chris Paxton and Michael Cho — Thu, 03 Apr 2025 02:37:26 GMT

One of the dreams of robotics research is being able to download and test out a model and have it *just work*. Mahi talks to us about “robot utility models,” which are essentially just this: models that you can download and test out to do useful things like opening a cabinet or picking up a can.

This work is extremely interesting and it shows up in my long post on scaling laws for imitation learning.

Robot models, particularly those trained with large amounts of data, have recently shown a plethora of real-world manipulation and navigation capabilities. Several independent efforts have shown that given sufficient training data in an environment, robot policies can generalize to demonstrated variations in that environment. However, needing to finetune robot models to every new environment stands in stark contrast to models in language or vision that can be deployed zero-shot for open-world problems. In this work, we present Robot Utility Models (RUMs), a framework for training and deploying zero-shot robot policies that can directly generalize to new environments without any finetuning. To create RUMs efficiently, we develop new tools to quickly collect data for mobile manipulation tasks, integrate such data into a policy with multi-modal imitation learning, and deploy policies on-device on Hello Robot Stretch, a cheap commodity robot, with an external mLLM verifier for retrying. We train five such utility models for opening cabinet doors, opening drawers, picking up napkins, picking up paper bags, and reorienting fallen objects. Our system, on average, achieves 90% success rate in unseen, novel environments interacting with unseen objects. Moreover, the utility models can also succeed in different robot and camera set-ups with no further data, training, or fine-tuning. Primary among our lessons are the importance of training data over training algorithm and policy class, guidance about data scaling, necessity for diverse yet high-quality demonstrations, and a recipe for robot introspection and retrying to improve performance on individual environments. Our code, data, models, hardware designs, as well as our experiment and deployment videos are open sourced and can be found on our project website: this https URL

You can also find the project page to learn more.

If you want to learn more about scaling in robotics broadly, you can read my post on that.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit itcanthink.substack.com

RoboPapers Episode 1: SAM2Act with Jiafei Duan

Chris Paxton and Michael Cho — Tue, 25 Mar 2025 14:29:00 GMT

Join Chris Paxton & Michael Cho as we geek out over robotic papers with paper authors. First episode: Jiafei Duan talks about SAM2Act.

How do we build robots which can remember things and perform challenging, long horizon manipulation tasks with objects they haven't seen before?

Project Website ArXiV Youtube Spotify

Abstract:

Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of 86.8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4.3% performance gap under diverse environmental perturbations. Building on this foundation, we propose SAM2Act+, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves competitive performance on MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-based robotic systems.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit itcanthink.substack.com