This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale.
We propose a novel approach that leverages state-of-the-art open-source VLMs—Gemma 3 and Qwen3-VL—to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations.
Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models’ reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.
We utilize the Helios 3D procedural plant generation library to create a diverse range of synthetic cowpea plots. This allows us to generate ground-truth labels for plant locations, leaf optics, and environmental conditions that are often difficult or impossible to obtain in the field.
Conversion from drone orthophotos to ground-truth 3D bounding box configurations.
Progression of plant growth (Days After Planting - DAP) simulated in Helios 3D.
An example interaction of the Vision Language Model generating simulation configurations and returning plant locations based on the input image and prompting constraints.
We evaluated the capabilities of Vision Language Models to accurately predict complex simulated plant geometries based on remote sensing data. We tested state-of-the-art models (like Gemma3 and Qwen3-VL) applying several In-Context Learning techniques.
@misc{yun2026usingvisionlanguagefoundation,
title={Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning},
author={Heesup Yun and Isaac Kazuo Uyehara and Earl Ranario and Lars Lundqvist and Christine H. Diepenbrock and Brian N. Bailey and J. Mason Earles},
year={2026},
eprint={2603.08930},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.08930},
}