Claude Opus 4.7/4.8 on SWE-PRBench

We ran the SWE-PRBench benchmark (eval_100 split, 100 PRs × 3 context configurations) on four agent configurations: Claude Opus 4.7 at max effort, and Claude Opus 4.8 at low, medium, and max effort. The judge model is Claude Sonnet 4.6.

config_A (diff only)

Agent Overall Detection Hallucination F1
opus_4_8_medium 0.479 0.665 0.082 0.517
opus_4_8_max 0.467 0.642 0.072 0.496
opus_4_8_low 0.466 0.633 0.080 0.500
opus_4_7_max 0.452 0.696 0.068 0.447

config_B (with file content)

Agent Overall Detection Hallucination F1
opus_4_8_max 0.416 0.564 0.059 0.446
opus_4_8_low 0.414 0.570 0.058 0.455
opus_4_8_medium 0.410 0.549 0.048 0.435
opus_4_7_max 0.394 0.613 0.055 0.395

config_C (full context)

Agent Overall Detection Hallucination F1
opus_4_8_low 0.430 0.593 0.053 0.471
opus_4_8_max 0.425 0.581 0.057 0.464
opus_4_8_medium 0.412 0.557 0.058 0.450
opus_4_7_max 0.387 0.606 0.055 0.391

Final scores (mean across A, B, C)

Rank Agent Overall
1 opus_4_8_low 0.437
2 opus_4_8_max 0.436
3 opus_4_8_medium 0.434
4 opus_4_7_max 0.411

Source data and reproduction harness