Skip to content

Model-based scoring makes use of a model to score output from a solver.

  • model_graded_qa() scores how well a solver answers a question/answer task.

  • model_graded_fact() determines whether a solver includes a given fact in its response.

The two scorers are quite similar in their implementation, but use a different default template to evaluate correctness.

Usage

model_graded_qa(
  template = NULL,
  instructions = NULL,
  grade_pattern = "(?i)GRADE\\s*:\\s*([CPI])(.*)$",
  partial_credit = FALSE,
  scorer_chat = NULL
)

model_graded_fact(
  template = NULL,
  instructions = NULL,
  grade_pattern = "(?i)GRADE\\s*:\\s*([CPI])(.*)$",
  partial_credit = FALSE,
  scorer_chat = NULL
)

Arguments

template

Grading template to use–a glue() string which will take substitutions input, answer, criterion, instructions.

instructions

Grading instructions.

grade_pattern

A regex pattern to extract the final grade from the judge model's response.

partial_credit

Whether to allow partial credit.

scorer_chat

An ellmer chat used to grade the model output, e.g. ellmer::chat_anthropic().

Value

A function that will grade model responses according to the given instructions. See Task's scorer argument for a description of the returned function. The functions that model_graded_qa() and model_graded_fact() output can be passed directly to $eval().

See the documentation for the scorer argument in Task for more information on the return type.

See also

scorer_detect for string detection-based scoring.

Examples

# Quality assurance -----------------------------
if (!identical(Sys.getenv("ANTHROPIC_API_KEY"), "")) {
  library(ellmer)
  library(tibble)

  simple_addition <- tibble(
    input = c("What's 2+2?", "What's 2+3?"),
    target = c("4", "5")
  )

  tsk <- Task$new(
    dataset = simple_addition, 
    solver = generate(solver_chat = chat_anthropic(model = "claude-3-7-sonnet-latest")), 
    scorer = model_graded_qa()
  )
  
  tsk$eval()
}
#>  Solving
#>  Solving [879ms]
#> 
#>  Scoring
#> [working] (0 + 0) -> 1 -> 1 | ■■■■■■■■■■■■■■■■                  50%
#> [working] (0 + 0) -> 0 -> 2 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100%
#>  Scoring

#>  Scoring [9ms]
#> 

# Factual response -------------------------------
if (!identical(Sys.getenv("ANTHROPIC_API_KEY"), "")) {
  library(ellmer)
  library(tibble)

  r_history <- tibble(
    input = c(
      "Who created the R programming language?",
      "In what year was version 1.0 of R released?"
    ),
    target = c("Ross Ihaka and Robert Gentleman.", "2000.")
  )

  tsk <- Task$new(
    dataset = r_history, 
    solver = generate(solver_chat = chat_anthropic(model = "claude-3-7-sonnet-latest")), 
    scorer = model_graded_fact()
  )
  
  tsk$eval()
}
#>  Solving
#> [working] (0 + 0) -> 1 -> 1 | ■■■■■■■■■■■■■■■■                  50%
#> [working] (0 + 0) -> 0 -> 2 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100%
#>  Solving

#>  Solving [2.5s]
#> 
#>  Scoring
#> [working] (0 + 0) -> 1 -> 1 | ■■■■■■■■■■■■■■■■                  50%
#> [working] (0 + 0) -> 0 -> 2 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100%
#>  Scoring

#>  Scoring [8ms]
#>