Adversary Agents as Worker and Reviewer
Example prompts that use an agent as a worker and another as a reviewer, communicating through .md files
Reviewer Prompt
Evaluation metrics review prompt
Role: You are a Senior Perception Evaluation Engineer and Code Reviewer. You specialize in Radar/Lidar sensor validation, signal processing metrics, and ground truth (GT) analysis.
Objective:
Conduct a comprehensive audit of the provided codebase to verify the implementation of specific evaluation metrics. You must cross-reference the code logic with the requirements listed below and the definitions found in document/evaluation_api_categories.md.
Instructions:
- Context Analysis: First, read
document/evaluation_api_categories.mdto understand the mathematical definitions and edge cases for each metric. - Code Tracing: Search the codebase for functions, classes, or scripts corresponding to the criteria list. Do not rely solely on function names; verify the logic (e.g., check if “Cluster False Points” actually applies a 40m distance filter as requested).
- Gap Analysis: Identify any metrics that are:
- Missing: No code found.
- Partial/Incorrect: Code exists but lacks specific logic (e.g., missing the “10th percentile” calculation for density).
- Implemented: Logic appears correct and complete.
- Language: Provide your final report in [English or Korean - choose your preference], but keep the Metric Names exactly as they appear in the criteria list for consistency.
Evaluation Criteria List (The Checklist):
-
Detection + Ghost Filter Evaluation
- GT Object Detection Rate (P_det)
- Peak Detection Quality (SNR/RCS Peak within Object)
- Ghost Point Ratio/Density (Non-associated points)
- Continuity (Drop rate)
- First-detection Latency (Initial detection distance)
- Multipath - False Point Evaluation (False Positive)
- Freespace - False Point Evaluation (False Positive)
- Cluster - False Point Evaluation (Check if logic limits evaluation to < 40m range)
- Ghost Filter Performance (Pre/Post comparison)
- Detections Saturation Index
-
Location Accuracy Evaluation
- Range Error (Point-wise)
- Azimuth/Elevation Angle Error
- Object Centroid Accuracy (Cluster vs. GT BBox Center)
- Heading Accuracy (Cluster vs. GT Heading)
- Static Structure (Plane) Alignment Error
-
Doppler (Velocity) Evaluation
- Static Structure Doppler Accuracy (Consistency at 0 velocity)
- Dynamic Object Doppler Accuracy
- Doppler Variance within the same Object
- Doppler-Spatial Consistency
-
RCS Evaluation
- RCS Glint Evaluation (Static Objects)
- RCS Glint Evaluation (Dynamic Objects)
- RCS-Range Consistency
-
Density & Homogeneity Evaluation
- Points per Object (Median) - Grouped by Class/Range bin
- Lower Bound of Points per Object (10th percentile)
- Angular Cell-based Homogeneity (Cell Coverage)
- Object Surface Homogeneity
Required Output Format:
Please present your findings in a Markdown table followed by a summary:
| Category | Metric Name | Status (✅ Implemented / ⚠️ Partial / ❌ Missing) | File/Function Location | Logic Check / Notes |
|---|---|---|---|---|
| Detection | GT Object Detection Rate | ✅ Implemented | eval_metrics.py, calc_p_det() | Correctly matches GT IDs. |
| Detection | Cluster - False Point | ⚠️ Partial | cluster_eval.py | Issue: Logic analyzes all ranges; the < 40m filter is missing. |
Summary & Recommendations:
- List critical missing components.
- Highlight specific logic errors (e.g., where mathematical formulas from the
.mdfile are not followed). - Save the result as .md in reports/yyyymmdd_hhmmss_code_review
Why this structure works better:
- Role Definition: Sets the expectation that the AI needs to understand sensor physics (SNR, RCS, Doppler), not just generic code.
- Logic Verification: The original prompt just asked “if it is implemented.” The new prompt specifically asks the AI to check conditions (like the ”< 40m” note you included), ensuring it doesn’t just match keywords but checks the code’s behavior.
- Structured Output: By forcing a table format, you avoid getting a wall of text. You can immediately see what is
MissingorPartial. - Reference Integration: It explicitly forces the AI to use the
.mdfile as the ground truth for how the metric should be calculated.
First, review the … section. (If there are many sections - divide the prompts)
Follow-up prompt
Then, I need to assign an agent to correctly implement the missing and fix the partial implementations. Give me a structured prompt for this job.
Worker Prompt (Provided by the reviewer prompt)
Role
You are a Senior Perception Evaluation Engineer (Radar/LiDAR) and Python developer. Your job is to implement missing metrics and fix partial/incorrect implementations so the tool matches the definitions in document/evaluation_api_categories.md.
Repository
- Root:
c:\Git\afi-evaluation-tool - Runtime entry:
src/evaluator.py - Metric runner:
src/afi_eval/evaluation/evaluator.py::AfiEvaluator - Metrics:
src/afi_eval/evaluation/metrics.py - Current data model:
src/data_structures.py::ScanData - Reference-only implementations (do not assume wired):
sample_code/afi920_ride_eval/*
Ground Truth for Definitions
- MUST follow:
document/evaluation_api_categories.md - If the code intentionally diverges, either:
- align code to the doc, or
- update doc + add explicit rationale in code comments (choose one; default is align code)
What Exists Today (Baseline)
- Implemented (but partial): Detection P_det, Ghost ratio (simplified), Position accuracy metrics, Dynamic doppler (simplified), RCS CV (generic), Density/Homogeneity (partial).
- Missing: Many Detection+Ghost metrics, plane-based metrics, static doppler, doppler variance, doppler spatial consistency, RCS static glint, RCS–range consistency, object surface homogeneity spec-aligned.
Deliverables
- Code changes implementing/fixing all items in scope below.
- Unit tests under
tests/for every new metric and every changed formula/edge-case. - Updated report outputs: include new metric outputs in
report.jsonand summary where appropriate (src/report_generator.py). - New audit note in
reports/<timestamp>_implementation_notes.mddescribing what changed and why (brief). - Ensure
python -m pytestpasses.
Scope: Implement/Fix These Metrics (Use Metric Names Exactly)
A) Detection + Ghost Filter Evaluation
Implement in runtime (not just sample_code/):
-
Peak Detection Quality (SNR/RCS Peak within Object)
- Compute per-object:
SNR_max,SNR_topK_mean(K=3),RCS_max,RCS_topK_mean - Aggregate by class × range bin × mode
- Compute per-object:
-
Ghost Point Ratio/Density (Non-associated points)
- Must distinguish ghost vs object/plane/freespace association per doc
- Add density outputs: pts/sec, pts/km, pts/m² (define required inputs or document assumptions)
-
Continuity (Drop rate)
- Implement spec: drop event = consecutive miss ≥ K frames (default K=3)
- Track-level stats: drop_rate, drop_events/km, longest_gap
-
First-detection Latency (Initial detection distance)
- Track-level: latency (ms), first_detect_range(m)
- Requires ROI entry time definition per doc
-
Multipath - False Point Evaluation (False Positive)
- Plane-behind classification per doc (signed distance behind plane)
-
Freespace - False Point Evaluation (False Positive)
- Drivable/freespace grid-based FP density/ratio per doc
-
Cluster - False Point Evaluation (Check if logic limits evaluation to < 40m range)
- DBSCAN on ghost points with explicit r ≤ 40 m gating
- Outputs: FP_cluster/km, cluster size distribution, persistence(frames)
-
Ghost Filter Performance (Pre/Post comparison)
- Run two evaluations (OFF vs ON) or accept two report inputs and compute deltas: ΔRecall, ΔFP_* etc.
-
Detections Saturation Index
- sat_ratio, sat_frames/km, flag for N_all ≥ 0.95×2000, and correlation hooks vs Recall/FP
Also fix partial/incorrect:
- GT Object Detection Rate (P_det): ensure logic matches doc (time sync, ROI, N_min, margins, and whether N_occ is part of spec).
- Time sync: enforce
|Δt| ≤ 0.05sor explicitly justify whyselected_frames.jsonguarantees it (and validate).
B) Location Accuracy Evaluation
Fix/implement:
- Range Error (Point-wise): remove undocumented gating or make it configurable + report outlier rate.
- Azimuth/Elevation Angle Error: same gating handling.
- Object Centroid Accuracy (Cluster vs. GT BBox Center): either implement clustering or rename/clarify; match spec conditions (e.g., N_obj ≥ 3).
- Heading Accuracy (Cluster vs. GT Heading): implement
N_heading ≥ 5(configurable), add outlier rejection before PCA, clarify ambiguity resolution. - Static Structure (Plane) Alignment Error: implement plane association + distance stats per doc.
C) Doppler (Velocity) Evaluation
Implement in runtime:
- Static Structure Doppler Accuracy (Consistency at 0 velocity)
- Doppler Variance within the same Object (MAD-based σ_v preferred)
- Doppler-Spatial Consistency (mismatch_ratio with v0=0.3 m/s, sign defined by median)
Fix partial:
- Dynamic Object Doppler Accuracy: incorporate ego compensation if ego inputs exist; otherwise clearly flag limitations and match doc as closely as possible.
D) RCS Evaluation
Implement/fix:
- RCS Glint Evaluation (Static Objects): static targets (plane recommended), CV/IQR over time, enforce N_frames≥20, TopK SNR selection.
- RCS Glint Evaluation (Dynamic Objects): track-based glint, class×range bin aggregation, spec gating.
- RCS-Range Consistency: Spearman ρ and/or regression slope; plane-based recommended.
E) Density & Homogeneity Evaluation
Fix/implement:
- Points per Object (Median): class×range bin×mode outputs; optional
N_obj/N_max. - Lower Bound of Points per Object (10th percentile): add fail_rate vs N_min and sample-size guard.
- Angular Cell-based Homogeneity (Cell Coverage): add p10 coverage (and optional empty_ratio).
- Object Surface Homogeneity: compute CV/entropy/gini over full grid including zero cells (not occupied-only).
Required Architecture Changes (If Needed)
If the spec requires data not present in ScanData, you MUST extend the data model and parser:
- Add to
ScanData:lidar_planes(GT planes)drivable_region/ freespace grid- ego motion:
ego_vel_xy, optionalyaw_rate - timestamps for radar/lidar frames (for dt checks)
- Update
src/data_parser.pyto populate these fields from available inputs (or add TODO + graceful handling if dataset lacks them).
Constraint
Do not break the current “sample_data” runnable flow. If some inputs aren’t available in sample_data, metrics should return nan/empty with explicit reason strings in the report.
Acceptance Criteria
- Every metric in the checklist is either:
- ✅ implemented per doc, or
- returns explicit “not computable” output with a clear reason (only acceptable when required inputs truly do not exist in provided data)
- All metrics appear under
report.jsonin stable keys. - Unit tests cover:
- numerical correctness on small synthetic scenes
- edge cases: no points, no objects, missing planes, low sample counts, dt violations, extreme angles
- No hard-coded magic thresholds unless they are in config (e.g., v0=0.3, K=3, 40m) and defaulted to doc values.
Work Plan (Expected)
- Read
document/evaluation_api_categories.mdand extract exact formulas + gates. - Map each metric to runtime code path; do not rely on names.
- Implement missing data plumbing (ScanData + parser) first.
- Implement metrics with config-driven thresholds.
- Add tests for each metric.
- Run
pytest, runpython src/evaluator.pyon sample_data, confirm report contains new keys.
Useful References in Repo
- Current implementations:
src/afi_eval/evaluation/metrics.py(DetectionMetrics,PositionAccuracyMetrics,DopplerMetrics,RcsMetrics,DensityHomogeneityMetrics)
- Reference-only implementations (can be ported, but must adapt to runtime types):
sample_code/afi920_ride_eval/sheet_detection_ghost_filter.pysample_code/afi920_ride_eval/sheet_position_accuracy.pysample_code/afi920_ride_eval/sheet_doppler_velocity.pysample_code/afi920_ride_eval/sheet_rcs.pysample_code/afi920_ride_eval/sheet_density_homogeneity.py
Output requested from you (the agent)
- A PR-ready implementation with tests and a short summary of changes + how to validate.