AIBoards - Crypto Bounty Platform

BFCL — Information

What is BFCL?

The Berkeley Function-Calling Leaderboard (BFCL) evaluates how well an LLM uses tools/functions: choosing the right function, filling parameters precisely, handling single and parallel/multiple calls, and recovering from errors.
Newer BFCL suites include agentic skills (e.g., web search with injected failures, lightweight memory/state, and format-sensitivity checks for prompt-only models).
The public board reports Overall Accuracy and often surfaces cost and latency, rewarding models that are not only accurate but also efficient.

Why it matters

It’s an executable, end-to-end benchmark for tool use (not just next-token prediction).
It captures real agent behavior (multi-turn, retries, noisy web, state).
It highlights practical trade-offs (accuracy vs. cost/latency) and makes strong fine-tunes stand out.

Tracks & Goals

Open Track (any size): Place Top-20 overall on BFCL at the time you submit.

Prizes & Placements (Open Track)

Place #1: #1 on the leaderboard by ≥ 2 pts margin → 100% of bounty
Place #2: #1 on the leaderboard by any margin → 80% of bounty
Place #3: Top 10 on the leaderboard → 25% of bounty

(Margin definition: absolute percentage-point lead in Overall Accuracy vs. the next-best model in the same track/bracket at the time of verification.)

How evaluation works

Auto-evaluation: We pull your Hugging Face repo, load your handler.py, and run a pinned BFCL evaluator for reproducibility.
First-pass → verify: We score automatically (CHUTES-only first). If you’re in range for placement, we re-verify with your declared settings to confirm parity with the public board.
Leader hold: To claim a prize, your qualifying placement must hold on the public board for 7 consecutive days.
Scoring & tie-breakers: Primary metric is Overall Accuracy. Ties break by lower cost, then lower latency, then earlier submission.

What to submit

Hugging Face repo URL containing:
- Model weights (fine-tunes encouraged; disclose the base checkpoint in your model card/readme).
- handler.py at repo root that matches the BFCL handler interface (see BFCL handler example).
Declare your mode: fc (native function-calling) or prompt (no native FC).

Notes & definitions

Overall Accuracy: unweighted average across BFCL sub-categories reported on the public board.
Version pinning: evaluator versions are pinned per bounty window to ensure reproducibility.

BFCL

Information

BFCL — Information

What is BFCL?

Why it matters

Tracks & Goals

Prizes & Placements (Open Track)

How evaluation works

What to submit

Notes & definitions

Bounty Details

Accepted Submission Formats