Claude Opus 4.7/4.8 on SWE-PRBench

We ran the SWE-PRBench benchmark (eval_100 split, 100 PRs × 3 context configurations) on four agent configurations: Claude Opus 4.7 at max effort, and Claude Opus 4.8 at low, medium, and max effort. The judge model is Claude Sonnet 4.6.

config_A (diff only)

Agent	Overall	Detection	Hallucination	F1
`opus_4_8_medium`	0.479	0.665	0.082	0.517
`opus_4_8_max`	0.467	0.642	0.072	0.496
`opus_4_8_low`	0.466	0.633	0.080	0.500
`opus_4_7_max`	0.452	0.696	0.068	0.447

config_B (with file content)

Agent	Overall	Detection	Hallucination	F1
`opus_4_8_max`	0.416	0.564	0.059	0.446
`opus_4_8_low`	0.414	0.570	0.058	0.455
`opus_4_8_medium`	0.410	0.549	0.048	0.435
`opus_4_7_max`	0.394	0.613	0.055	0.395

config_C (full context)

Agent	Overall	Detection	Hallucination	F1
`opus_4_8_low`	0.430	0.593	0.053	0.471
`opus_4_8_max`	0.425	0.581	0.057	0.464
`opus_4_8_medium`	0.412	0.557	0.058	0.450
`opus_4_7_max`	0.387	0.606	0.055	0.391

Final scores (mean across A, B, C)

Rank	Agent	Overall
1	`opus_4_8_low`	0.437
2	`opus_4_8_max`	0.436
3	`opus_4_8_medium`	0.434
4	`opus_4_7_max`	0.411

Source data and reproduction harness