CoReVLA:
CoReVLA: A Dual-Stage End-to-End Autonomous Driving Framework
for Long-Tail Scenarios via Collect-and-Refine

Shiyu Fang1, Yiming Cui1, Haoyang Liang1, Chen Lv2, Peng Hang1, Jian Sun1

1Tongji University, 2Nanyang Technological University

Contact: fangshiyu@tongji.edu.cn

CoReVLA Framework Overview

Abstract

Autonomous Driving (AD) systems have made notable progress, but their performance in long-tail, safety-critical scenarios remains limited. These rare cases contribute a disproportionate number of accidents. Vision-Language Action (VLA) models have strong reasoning abilities and offer a potential solution, but their effectiveness is limited by the lack of high-quality data and inefficient learning in such conditions. To address these challenges, we propose CoReVLA, a continual learning end-to-end autonomous driving framework that improves the performance in long-tail scenarios through a dual-stage process of data Collection and behavior Refinement. First, the model is jointly fine-tuned on a mixture of open-source driving QA datasets, allowing it to acquire a foundational understanding of driving scenarios. Next, CoReVLA is deployed within the Cave Automatic Virtual Environment (CAVE) simulation platform, where driver takeover data is collected from real-time interactions. Each takeover indicates a long-tail scenario that CoReVLA fails to handle reliably. Finally, the model is refined via Direct Preference Optimization (DPO), allowing it to learn directly from human preferences and thereby avoid reward hacking caused by manually designed rewards. Extensive open-loop and closed-loop experiments demonstrate that the proposed CoReVLA model can accurately perceive driving scenarios and make appropriate decisions. On the Bench2Drive benchmark, CoReVLA achieves a Driving Score (DS) of 72.18 and a Success Rate (SR) of 50%, outperforming state-of-the-art methods by 7.96 DS and 15% SR under long-tail, safety-critical scenarios. Furthermore, case studies demonstrate the model’s ability to continually improve its performance in similar failure-prone scenarios by leveraging past takeover experiences.

Key Contributions

  • Collection of visually grounded takeover data via HITL testing in the immersive CAVE platform. The CAVE platform reconstructs 3D scenarios from trajectories, enabling end-to-end AD testing. During testing, long-tail scenarios where the model underperforms are proactively taken over by human drivers, yielding valuable takeover data including visual context, driver behaviors, and real-time attention.
  • Introduction of the DPO approach for efficient behavior refinement from sparse takeover data. By contrasting suboptimal pre-intervention behaviors from models with high-quality human takeovers, the CoReVLA directly learns driver preferences, avoiding the pitfalls of indirect reward modeling and significantly improving learning efficiency.
  • Validation of CoReVLA in both open-loop and closed-loop settings. We demonstrate effective scene understanding and decision-making capabilities. On the Bench2Drive benchmark, CoReVLA achieves a Driving Score of 72.18 and a Success Rate of 50\%, surpassing SOTA methods by 7.96 and 15\% respectively in long-tail, safety-critical scenarios. Case studies further verify its potential for cross-scenario generalization capability.

Methodology

vla-model

To improve AV performance in long-tail scenarios, we propose CoReVLA, as illustrated in Figure. First, the Qwen2.5- VL-7B model is STF with a combination of open-source driving QA datasets to build a foundational understanding of driving tasks. It is then deployed in the CAVE platform, a closed-loop, HITL simulation environment, where long-tail failure cases requiring human takeovers are identified and collected. Finally, CoReVLA is refined via DPO using human feedback from takeover events, enabling the model to align with human preference and improve its generalization in long-tail scenarios

1

Open-loop fine tune with SFT

To adapt a general-purpose VLM to domain-specific reasoning tasks in autonomous driving, we performed supervised fine-tuning on the Qwen2.5-VL-7B model using the constructed dataset. Specifically, we applied Low-Rank Adaptation (LoRA) to two key components of the model: the vision projector and the LLM backbone. The former enhances the model's ability to align visual inputs with textual semantics, while the latter improves its capacity to understand and reason about driving-related questions.

CAVE Overview
2

Close-loop fine tune with Collect and Refine

Stage 1: Takeover data Collection

To collect driving data from long-tail scenarios, we consider human takeover events as representative failure cases that expose the limitations of the current model. Therefore, the intervention marks the boundary of the model’s capabilities and thus offers valuable guidance for enhancing robustness and safety.

In our experiments, CoReVLA is integrated into the CAVE platform, where it interacts with background vehicles in real time. Its performance is continuously monitored throughout each test case. When CoReVLA exhibits suboptimal behavior that leads to deadlock or collision, the system switches to replay mode. In this mode, a safety driver wears a VR headset to experience an immersive driving environment and closely supervises CoReVLA's behavior. If a hazardous situation arises, the driver performs a manual takeover.

Stage 2: VLA behavior Refinement

In Stage 2, CoReVLA is refined using takeover data collected from the CAVE platform. Each sample consists of an action pair: the suboptimal behavior previously generated by the model and the corrective behavior performed by the safety driver in the same scenario. These comparisons encode implicit human preferences and serve as supervision for learning more desirable driving policies. To align the model with human intent, we adopt DPO, which fine-tunes the policy to favor actions consistent with human takeovers, thereby reducing repeated failures in similar high-risk situations.

Compared to other Reinforcement Learning from Human Feedback (RLHF) methods, such as PPO, DPO offers several advantages. It eliminates the need for an explicitly designed reward function, which is often difficult to define in complex long-tail scenarios. This avoids issues such as reward hacking and reduces reliance on manual reward engineering. Moreover, DPO can be trained directly on offline human demonstration data, substantially improving data efficiency. These properties make DPO particularly well-suited for learning from sparse long-tail events.

Experimental Results

Open-loop QA evaluations

To evaluate whether CoReVLA can understand complex scenarios and complete driving tasks, we conduct both open-loop and closed-loop experiments. First, we compare its performance with baselines using BLEU and ROUGE. Then, we integrate CoReVLA into the CAVE platform to identify failure cases and apply DPO for behavior refinement. Finally, we benchmark against SOTA methods under closed-loop settings using Bench2Drive, which consists of diverse and challenging long-tail scenarios.

Open-loop

To assess the language understanding and reasoning capability of CoReVLA, we first conduct open-loop QA evaluations across three representative datasets: LingoQA, BDD, and HAD. As shown above, CoReVLA consistently achieves higher BLEU and ROUGE scores across all datasets, indicating that SFT enhances the model's ability to understand driving scenarios and make correct decisions, laying the groundwork for closed-loop evaluation.

Close-loop model performance evaluations

Close-loop

The above table presents the performance of several representative methods from both small-scale task-specific models and large-scale pretrained models on the Bench2Drive benchmark. Compared to existing SOTA approaches, our proposed CoReVLA achieves the highest DS and SR, reaching 72.18 and 50.00%, respectively. This corresponds to an improvement of 7.96 points in DS and a 14.99% increase in SR over the second-best method.

While CoReVLA demonstrates significant improvements in DS and SR, it does not outperform all baseline models in terms of efficiency and comfortness. This is mainly because CoReVLA focuses on high-risk, long-tail driving scenarios where safety is prioritized during model refinement. In the DPO-based HITL fine-tuning within the CAVE platform, drivers tend to exhibit cautious behavior, maintaining moderate speeds and carefully observing their surroundings, rather than accelerating quickly to exit potentially dangerous situations. Additionally, emergency braking is sometimes required for safety, which can negatively impact comfort-related metrics. This explains why, despite a significant increase in SR, the improvement in DS is relatively modest. A similar pattern is observed in DriveTransformer-Large, which is the second-best performing model.

Close-loop model ability evaluations

Close-loop-ability

In addition to macroscopic metrics such as the SR, we further evaluated the model's diverse driving capabilities using the capability assessment framework provided by Bench2Drive. Specifically, we tested its performance across multiple dimensions, including Merging, Overtaking, Emergency Braking, Giving Way, and Traffic Sign compliance. As shown in the table above, it is evident that our proposed CoReVLA achieves the highest overall capability score. Moreover, it outperforms existing methods in Merging, Overtaking, and Emergency Braking. Thanks to the inclusion of numerous traffic signal-related training samples during the Supervised Fine-Tuning (SFT) stage, CoReVLA also excels in Traffic Sign recognition.

However, a significant limitation of our approach lies in its complete lack of Give Way capability, indicating that CoReVLA fails to cooperate with other vehicles during driving and primarily focuses on its own gains. This shortcoming may be attributed to the predominance of "safe-critic" data used during the Direct Preference Optimization (DPO) fine-tuning stage, which leads the model to adopt an overly conservative driving strategy. Consequently, this capability gap results in failures in the Yield Special-Vehicle Case. Therefore, enhancing the model's ability to handle long-tail scenarios through cooperative behaviors and other diverse strategies remains a critical issue to be addressed in future work.

Case Study & Analysis

Rainy Cut-in

Stop-Sign Left-Turn

Four Lane_Cut-in

Yield Special-Vehicle

Blocked Intersection

Dense Left-Turn

To provide an intuitive understanding of CoReVLA's performance, we selected several representative cases that highlight both its strengths and remaining limitations. The **Rainy Cut-in Case** demonstrates the model's capability to accurately identify anomalous objects and execute emergency braking even under perceptually degraded conditions. The **Stop-Sign Left-Turn Case** illustrates its ability to identify appropriate gaps in traffic and complete left-turn maneuvers safely.

However, several issues persist. In the **Four-Lane Cut-in Case**, although CoReVLA avoided collisions and ultimately reached the destination, it exhibited excessive emergency braking behavior. This appears to stem from an overrepresentation of "safe-critic" scenarios during training, which may have induced an overly conservative driving policy—analogous to a form of "PTSD" regarding potential risks. The model tends to initiate emergency braking upon detecting even minimal indications of lane intrusion by neighboring vehicles. This case explains why, despite achieving the highest success rate, our method does not attain top performance in terms of efficiency and comfort.

In the **Yield Special-Vehicle Case**, the model failed to recognize that an emergency vehicle behind it was operating under priority conditions, resulting in a failure to yield. The **Blocked Intersection Case** led to a collision due to insufficient utilization of the left-front camera data, preventing timely detection of a disabled vehicle obstructing the exit lane. Finally, in the **Dense Left-Turn Case**, CoReVLA was unable to identify a safe merging gap amid intense oncoming traffic, resulting in task failure. Nevertheless, the model's reasoning output correctly indicated that left-turning vehicles should yield to oncoming traffic—suggesting that, although the maneuver failed, the decision-making logic reflected a reasonable level of intelligence.

Citation


            @misc{fang2025corevladualstageendtoendautonomous,
                  title={CoReVLA: A Dual-Stage End-to-End Autonomous Driving Framework for Long-Tail Scenarios via Collect-and-Refine}, 
                  author={Shiyu Fang and Yiming Cui and Haoyang Liang and Chen Lv and Peng Hang and Jian Sun},
                  year={2025},
                  eprint={2509.15968},
                  archivePrefix={arXiv},
                  primaryClass={cs.RO},
                  url={https://arxiv.org/abs/2509.15968}, 
            }