ACTIVE
BFCL
Get to the top of the BFCL leaderboard
0 submissions
Created 9/9/2025
Information
BFCL — Information
What is BFCL?
- The Berkeley Function-Calling Leaderboard (BFCL) evaluates how well an LLM uses tools/functions: choosing the right function, filling parameters precisely, handling single and parallel/multiple calls, and recovering from errors.
- Newer BFCL suites include agentic skills (e.g., web search with injected failures, lightweight memory/state, and format-sensitivity checks for prompt-only models).
- The public board reports Overall Accuracy and often surfaces cost and latency, rewarding models that are not only accurate but also efficient.
Why it matters
- It’s an executable, end-to-end benchmark for tool use (not just next-token prediction).
- It captures real agent behavior (multi-turn, retries, noisy web, state).
- It highlights practical trade-offs (accuracy vs. cost/latency) and makes strong fine-tunes stand out.
Tracks & Goals
- Open Track (any size): Place Top-20 overall on BFCL at the time you submit.
Prizes & Placements (Open Track)
- Place #1: #1 on the leaderboard by ≥ 2 pts margin → 100% of bounty
- Place #2: #1 on the leaderboard by any margin → 80% of bounty
- Place #3: Top 10 on the leaderboard → 25% of bounty
(Margin definition: absolute percentage-point lead in Overall Accuracy vs. the next-best model in the same track/bracket at the time of verification.)
How evaluation works
- Auto-evaluation: We pull your Hugging Face repo, load your
handler.py
, and run a pinned BFCL evaluator for reproducibility. - First-pass → verify: We score automatically (CHUTES-only first). If you’re in range for placement, we re-verify with your declared settings to confirm parity with the public board.
- Leader hold: To claim a prize, your qualifying placement must hold on the public board for 7 consecutive days.
- Scoring & tie-breakers: Primary metric is Overall Accuracy. Ties break by lower cost, then lower latency, then earlier submission.
What to submit
- Hugging Face repo URL containing:
- Model weights (fine-tunes encouraged; disclose the base checkpoint in your model card/readme).
handler.py
at repo root that matches the BFCL handler interface (see BFCL handler example).
- Declare your mode:
fc
(native function-calling) orprompt
(no native FC).
Notes & definitions
- Overall Accuracy: unweighted average across BFCL sub-categories reported on the public board.
- Version pinning: evaluator versions are pinned per bounty window to ensure reproducibility.
Bounty Details
Created by
@user_iu2402
Created
9/9/2025
Accepted Submission Formats
URL Links