Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

Abstract

This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale.

We propose a novel approach that leverages state-of-the-art open-source VLMs—Gemma 3 and Qwen3-VL—to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations.

Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models’ reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.

Synthetic Dataset Generation

We utilize the Helios 3D procedural plant generation library to create a diverse range of synthetic cowpea plots. This allows us to generate ground-truth labels for plant locations, leaf optics, and environmental conditions that are often difficult or impossible to obtain in the field.

Conversion from drone orthophotos to ground-truth 3D bounding box configurations.

Progression of plant growth (Days After Planting - DAP) simulated in Helios 3D.

Model Inference Example

An example interaction of the Vision Language Model generating simulation configurations and returning plant locations based on the input image and prompting constraints.

Prompt

You are a plant phenotyping expert analyzing Helios 3D plant simulator images.
Extract all simulation parameters that produced this image and output them as JSON.

Reference & JSON Schema

(Includes detailed guidelines for metadata, environment, field layout, plant properties, and camera)

{
  "reasoning": "string",
  "seed": "integer",
  "metadata": {...},
  "environment": {...},
  "field": {...},
  "plant_properties": {...},
  "camera": {...}
}

VLM

OK. I understand the simulation parameters and the required JSON schema.

Prompt

Example 1:

VLM

Generated JSON Configuration for Example 1

{
  "reasoning": "1. Sun: Shadows are short, pointing S (~164 deg). Sun direction = shadow + 180 = 344 deg. Mid-to-high sun angle; elevation ~60 deg.\n2. DAP: Plants appear as very small seedlings (2-4 leaves), early stage, estimated ~10 DAP...",
  "seed": 451,
  "metadata": {
    "year": 2025,
    "location": "Davis",
    "plant_type": "cowpea",
    "dap": 10
  },
  "environment": {
    "sun": {
      "elevation_degrees": 59.6819,
      "azimuth_degrees": 343.7588,
      "shadow": true
    }
  }
}

Prompt

Example 2:

VLM

Generated JSON Configuration for Example 2

{
  "reasoning": "1. Sun: Shadows are short, pointing S (~164 deg). Sun direction = shadow + 180 = 344 deg. Mid-to-high sun angle; elevation ~60 deg.\n2. DAP: Plants appear as medium-sized plants in vegetative growth, estimated ~50 DAP...",
  "seed": 451,
  "metadata": {
    "year": 2025,
    "location": "Davis",
    "plant_type": "cowpea",
    "dap": 50
  },
  "environment": {
    "sun": {
      "elevation_degrees": 59.6819,
      "azimuth_degrees": 343.7588,
      "shadow": true
    }
  }
}

Prompt

Example 3:

VLM

Generated JSON Configuration for Example 3

{
  "reasoning": "1. Sun: Shadows are long, pointing E (~72 deg). Sun direction = shadow + 180 = 252 deg. Low sun angle; elevation ~13 deg.\n2. DAP: Plants appear as mature plants, likely flowering, estimated ~90 DAP...",
  "seed": 372,
  "metadata": {
    "year": 2025,
    "location": "Davis",
    "plant_type": "cowpea",
    "dap": 90
  },
  "environment": {
    "sun": {
      "elevation_degrees": 13.3296,
      "azimuth_degrees": 251.9924,
      "shadow": true
    }
  }
}

Prompt

**Ground truth hints for target image:** - Plant age: 87 DAP - Plant count: 10 plants - Plot dimensions: 1.2985m x 3.831m x 0.2m - Beds/Rows: 1 bed(s) x 1 row(s) - Number of plots: 1 - Sun position: 52.9 deg elevation, 129.0 deg azimuth - Plant approximate relative locations [0.0-1.0] (rx, ry): `[(0.542, 0.227), (0.539, 0.279), (0.533, 0.336), (0.542, 0.47), (0.547, 0.532), (0.545, 0.592), (0.545, 0.62), (0.549, 0.658), (0.535, 0.683), (0.542, 0.773)]` **Convert to meters:** `x = (rx - 0.5) * 1.2985`, `y = -(ry - 0.5) * 3.8310` **Answer:**

VLM

I have analyzed the input image and corresponding hints. Here is the fully constructed JSON schema for the 3D plant simulation, mapping physical coordinates and properties.

VLM

Step 1: Generated JSON Configuration

{
  "reasoning": "1. Sun: Shadows are medium-length, pointing SW (~225 deg). Sun direction = shadow + 180 = 45 deg. Medium sun angle; elevation ~52.9 deg.\n2. DAP: Plants appear as mature plants, likely flowering, estimated ~87 DAP.\n3. Plants: Counted 10 plants across 1 plot(s) (image 383x1080px, field 1.2985m x 3.831m).\n   Coordinate conversion: x = (rx - 0.5) * 1.2985, y = -(ry - 0.5) * 3.831\n   rel_x, rel_y, x_m, y_m\n   0.542, 0.227, 0.034, 1.284\n   0.539, 0.279, 0.028, 1.005\n   0.533, 0.336, 0.020, 0.726\n   0.542, 0.47, 0.034, 0.205\n   0.547, 0.532, 0.039, -0.025\n   0.545, 0.592, 0.036, -0.255\n   0.545, 0.62, 0.036, -0.355\n   0.549, 0.658, 0.040, -0.485\n   0.535, 0.683, 0.022, -0.575\n   0.542, 0.773, 0.034, -0.855\n4. Leaf optics: Leaves appear dark green (high chlorophyll ~45.0 μg/cm²). Healthy green (typical carotenoid ~12.0). Slight purple tint (moderate anthocyan ~0.5). Slight browning (0.05). Leaf structure: thick/leathery (1.85 layers). Water: glossy/turgid (0.022 g/cm²). Slight gloss (specular 2.5). Architecture: horizontal leaves (~15°). Dry mass ~0.0065 g/cm2.",
  "seed": 12345,
  "metadata": {
    "year": 2025,
    "location": "Davis",
    "plant_type": "cowpea",
    "dap": 87
  },
  "environment": {
    "soil": {
      "use_obj_ground": false,
      "rotate_obj": 90.0,
      "obj_file_path": "../../../obj/dirt_rocks.obj",
      "spectral_data": {
        "reflectivity": "soil_reflectivity_0003"
      },
      "specular_exponent": 2.7
    },
    "sun": {
      "elevation_degrees": 52.9,
      "azimuth_degrees": 129.0,
      "shadow": true
    }
  },
  "field": {
    "layout": {
      "mode": "manual",
      "plot_size_x": 1.2985,
      "plot_size_y": 3.831,
      "plot_size_z": 0.2,
      "num_beds": 1,
      "num_rows": 1
    },
    "plots": [
      {
        "bed": 1,
        "row": 1,
        "plants": [
          { "x": 0.034, "y": 1.284, "z": 0.0 },
          { "x": 0.028, "y": 1.005, "z": 0.0 },
          { "x": 0.02, "y": 0.726, "z": 0.0 },
          { "x": 0.034, "y": 0.205, "z": 0.0 },
          { "x": 0.039, "y": -0.025, "z": 0.0 },
          { "x": 0.036, "y": -0.255, "z": 0.0 },
          { "x": 0.036, "y": -0.355, "z": 0.0 },
          { "x": 0.04, "y": -0.485, "z": 0.0 },
          { "x": 0.022, "y": -0.575, "z": 0.0 },
          { "x": 0.034, "y": -0.855, "z": 0.0 }
        ]
      }
    ]
  },
  "plant_properties": {
    "leaf_optics": {
      "number_layers": 1.85,
      "chlorophyll_content": 45.0,
      "carotenoid_content": 12.0,
      "anthocyan_content": 0.5,
      "brown_pigments": 0.05,
      "water_mass": 0.022,
      "dry_mass": 0.0065,
      "protein": 0.0,
      "carbon_constituents": 0.0,
      "specular_exponent": 2.5
    },
    "architecture": {
      "phytomer_parameters": { "leaf_pitch": 15.0 },
      "flower_bud_break_probability": 0.5
    }
  }
}

VLM

Step 2: Rendered JSON with Helios

Results

We evaluated the capabilities of Vision Language Models to accurately predict complex simulated plant geometries based on remote sensing data. We tested state-of-the-art models (like Gemma3 and Qwen3-VL) applying several In-Context Learning techniques.

BibTeX

@misc{yun2026usingvisionlanguagefoundation, title={Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning}, author={Heesup Yun and Isaac Kazuo Uyehara and Earl Ranario and Lars Lundqvist and Christine H. Diepenbrock and Brian N. Bailey and J. Mason Earles}, year={2026}, eprint={2603.08930}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2603.08930}, }

Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

Figure 1. An overview of utilizing Vision Language Models (VLMs) to generate plant simulation configurations directly from remote sensing images.

Abstract

Synthetic Dataset Generation

Model Inference Example

Results

BibTeX