Model-based scoring makes use of a model to score output from a solver.
model_graded_qa()
scores how well a solver answers a question/answer task.model_graded_fact()
determines whether a solver includes a given fact in its response.
The two scorers are quite similar in their implementation, but use a different
default template
to evaluate correctness.
Usage
model_graded_qa(
template = NULL,
instructions = NULL,
grade_pattern = "(?i)GRADE\\s*:\\s*([CPI])(.*)$",
partial_credit = FALSE,
scorer_chat = NULL
)
model_graded_fact(
template = NULL,
instructions = NULL,
grade_pattern = "(?i)GRADE\\s*:\\s*([CPI])(.*)$",
partial_credit = FALSE,
scorer_chat = NULL
)
Arguments
- template
Grading template to use–a
glue()
string which will take substitutionsinput
,answer
,criterion
,instructions
.- instructions
Grading instructions.
- grade_pattern
A regex pattern to extract the final grade from the judge model's response.
- partial_credit
Whether to allow partial credit.
- scorer_chat
An ellmer chat used to grade the model output, e.g.
ellmer::chat_anthropic()
.
Value
A function that will grade model responses according to the given instructions.
See Task's scorer
argument for a description of the returned function.
The functions that model_graded_qa()
and model_graded_fact()
output
can be passed directly to $eval()
.
See the documentation for the scorer
argument in Task for more
information on the return type.
See also
scorer_detect for string detection-based scoring.
Examples
# Quality assurance -----------------------------
if (!identical(Sys.getenv("ANTHROPIC_API_KEY"), "")) {
library(ellmer)
library(tibble)
simple_addition <- tibble(
input = c("What's 2+2?", "What's 2+3?"),
target = c("4", "5")
)
tsk <- Task$new(
dataset = simple_addition,
solver = generate(solver_chat = chat_anthropic(model = "claude-3-7-sonnet-latest")),
scorer = model_graded_qa()
)
tsk$eval()
}
#> ℹ Solving
#> ✔ Solving [879ms]
#>
#> ℹ Scoring
#> [working] (0 + 0) -> 1 -> 1 | ■■■■■■■■■■■■■■■■ 50%
#> [working] (0 + 0) -> 0 -> 2 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100%
#> ℹ Scoring
#> ✔ Scoring [9ms]
#>
# Factual response -------------------------------
if (!identical(Sys.getenv("ANTHROPIC_API_KEY"), "")) {
library(ellmer)
library(tibble)
r_history <- tibble(
input = c(
"Who created the R programming language?",
"In what year was version 1.0 of R released?"
),
target = c("Ross Ihaka and Robert Gentleman.", "2000.")
)
tsk <- Task$new(
dataset = r_history,
solver = generate(solver_chat = chat_anthropic(model = "claude-3-7-sonnet-latest")),
scorer = model_graded_fact()
)
tsk$eval()
}
#> ℹ Solving
#> [working] (0 + 0) -> 1 -> 1 | ■■■■■■■■■■■■■■■■ 50%
#> [working] (0 + 0) -> 0 -> 2 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100%
#> ℹ Solving
#> ✔ Solving [2.5s]
#>
#> ℹ Scoring
#> [working] (0 + 0) -> 1 -> 1 | ■■■■■■■■■■■■■■■■ 50%
#> [working] (0 + 0) -> 0 -> 2 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100%
#> ℹ Scoring
#> ✔ Scoring [8ms]
#>