Model Compatibility & Benchmarks

This page provides detailed benchmark scores and version compatibility information for all reasoning models supported by Sema4.AI.

Benchmark Scores

Reasoning models combined with Sema4's agent tuning deliver dramatically better accuracy on complex, multi-step tasks - reducing hallucinations and improving the reliability of your agent outcomes.

These improvements have been validated through τ²-telecom benchmark, a widely-recognized evaluation for measuring agent performance on long-running, multi-step tasks. OpenAI and Anthropic use this benchmark to validate their tool-calling accuracy because it tests an agent's ability to maintain context, follow complex instructions, and complete sequential operations without degradation.

Sema4.AI agent accuracy with τ²-telecom benchmark

OpenAI GPT-5 Medium

vs GPT-4.1

34%→93%

Anthropic Claude 4.5 Opus

vs Sonnet 3.7

49%→98%

Full Model Compatibility Matrix

Use this table to determine which version of Sema4.AI products you need for a specific model. The table shows when each model was first supported - newer versions continue to support all previously introduced models unless otherwise noted.

Model	Benchmark Score τ²-telecom	Build	Deploy
Model	Benchmark Score τ²-telecom	Minimum Studio Version	Minimum Agent Compute Enterprise Edition	Minimum Snowflake App Team Edition
OpenAI
GPT-5.2 Reasoning	88%¹	1.6.5	1.6.4	1.4.32
GPT-5.1-codex-max Reasoning	88% (medium)¹	1.6.5	1.6.4	1.4.32¹¹
GPT-5 Reasoning	95% (high)¹ 93% (medium) 90% (low)	1.6	1.4.1	1.4.20
o3 Reasoning	58%²	1.4.6	1.4.1	N/A
o4-mini Reasoning	41%³	1.4.6	1.4.1	N/A
gpt 4.1 Non-reasoning	34%³	1.3	1.3	1.4.9
gpt-4o Non-reasoning	24%²	1.2	1.6	N/A
Anthropic Anthropic calls reasoning "extended thinking"
Claude Opus 4.5 Reasoning	98.2%⁹	1.6.6	1.6.5	1.4.44
Claude Sonnet 4.5 Reasoning	98%⁴	1.6	1.6	1.4.20¹⁰
Claude Haiku 4.5 Reasoning	83%⁶	1.6	1.6	1.4.20¹⁰
Claude Opus 4.1 Reasoning	71%⁴	1.6	1.6	1.4.20
Claude Opus 4 Reasoning	57%⁵	1.6	1.6	1.4.20
Claude Sonnet 4 Reasoning	49.6⁴	1.6	1.6	1.4.20
Claude Sonnet 3.7 Non-reasoning	49³	1.4.6	1.4.1	1.4
Google
Gemini 3.0 Pro Reasoning	87%⁷	1.6.5	1.6.4	1.4.32
Gemini 3.0 Flash Reasoning	80%⁸	1.6.5	1.6.4	1.4.32

Footnotes

Setup Model Providers Troubleshooting

Model Compatibility & Benchmarks

Benchmark Scores

Sema4.AI agent accuracy with τ2-telecom benchmark

Full Model Compatibility Matrix

Footnotes

Sema4.AI agent accuracy with τ²-telecom benchmark