Evaluation Task
s provide a flexible data structure for evaluating LLM-based
tools.
Datasets contain a set of labelled samples. Datasets are just a tibble with columns
input
andtarget
, whereinput
is a prompt andtarget
is either literal value(s) or grading guidance.Solvers evaluate the
input
in the dataset and produce a final result.Scorers evaluate the final output of solvers. They may use text comparisons (like
detect_match()
), model grading (likemodel_graded_qa()
), or other custom schemes.
The usual flow of LLM evaluation with Tasks calls $new()
and then $eval()
.
$eval()
just calls $solve()
, $score()
, $measure()
, $log()
,
and $view()
in order. The remaining methods are generally only
recommended for expert use.
See also
generate()
for the simplest possible solver, and
scorer_model and scorer_detect for two built-in approaches to
scoring.
Public fields
dir
The directory where evaluation logs will be written to. Defaults to
vitals_log_dir()
.samples
A tibble representing the evaluation. Based on the
dataset
,epochs
may duplicate rows, and the solver and scorer will append columns to this data.metrics
A named vector of metric values resulting from
$measure()
(called inside of$eval()
). Will beNULL
if metrics have yet to be applied.
Methods
Method new()
The typical flow of LLM evaluation with vitals tends to involve first
calling this method and then $eval()
on the resulting object.
Usage
Task$new(
dataset,
solver,
scorer,
metrics = NULL,
epochs = NULL,
name = deparse(substitute(dataset)),
dir = vitals_log_dir()
)
Arguments
dataset
A tibble with, minimally, columns
input
andtarget
.solver
A function that takes a vector of inputs from the dataset's
input
column as its first argument and determines values approximatingdataset$target
. Its return value must be a list with the following elements:result
- A character vector of the final responses, with the same length asdataset$input
.solver_chat
- A list of ellmer Chat objects that were used to solve each input, also with the same length asdataset$input
.
Additional output elements can be included in a slot
solver_metadata
that has the same length asdataset$input
, which will be logged insolver_metadata
.Additional arguments can be passed to the solver via
$solve(...)
or$eval(...)
. See the definition ofgenerate()
for a function that outputs a valid solver that just passes inputs to ellmer Chat objects'$chat()
method in parallel.scorer
A function that evaluates how well the solver's return value approximates the corresponding elements of
dataset$target
. The function should take in the$samples
slot of a Task object and return a list with the following elements:score
- A vector of scores with length equal tonrow(samples)
. Built-in scorers return ordered factors with levelsI
<P
(optionally) <C
(standing for "Incorrect", "Partially Correct", and "Correct"). If your scorer returns this output type, the package will automatically calculate metrics.
Optionally:
scorer_chat
- If your scorer makes use of ellmer, also include a list of ellmer Chat objects that were used to score each result, also with lengthnrow(samples)
.scorer_metadata
- Any intermediate results or other values that you'd like to be stored in the persistent log. This should also have length equal tonrow(samples)
.
Scorers will probably make use of
samples$input
,samples$target
, andsamples$result
specifically. See model-based scoring for examples.metrics
A metric summarizing the results from the scorer.
epochs
The number of times to repeat each sample. Evaluate each sample multiple times to better quantify variation. Optional, defaults to
1L
. The value ofepochs
supplied to$eval()
or$score()
will take precedence over the value in$new()
.name
A name for the evaluation task. Defaults to
deparse(substitute(dataset))
.dir
Directory where logs should be stored.
Method eval()
Evaluates the task by running the solver, scorer, logging results, and
viewing (if interactive). This method works by calling $solve()
,
$score()
, $log()
, and $view()
in sequence.
The typical flow of LLM evaluation with vitals tends to involve first
calling $new()
and then this method on the resulting object.
Usage
Task$eval(..., epochs = NULL, view = interactive())
Arguments
...
Additional arguments passed to the solver and scorer functions.
epochs
The number of times to repeat each sample. Evaluate each sample multiple times to better quantify variation. Optional, defaults to
1L
. The value ofepochs
supplied to$eval()
or$score()
will take precedence over the value in$new()
.view
Automatically open the viewer after evaluation (defaults to TRUE if interactive, FALSE otherwise).
Method solve()
Solve the task by running the solver
Arguments
...
Additional arguments passed to the solver function.
epochs
The number of times to repeat each sample. Evaluate each sample multiple times to better quantify variation. Optional, defaults to
1L
. The value ofepochs
supplied to$eval()
or$score()
will take precedence over the value in$new()
.
Method log()
Log the task to a directory.
Note that, if an VITALS_LOG_DIR
envvar is set, this will happen
automatically in $eval()
.
Usage
Task$log(dir = vitals_log_dir())
Method view()
View the task results in the Inspect log viewer
Method set_solver()
Set the solver function
Arguments
solver
A function that takes a vector of inputs from the dataset's
input
column as its first argument and determines values approximatingdataset$target
. Its return value must be a list with the following elements:result
- A character vector of the final responses, with the same length asdataset$input
.solver_chat
- A list of ellmer Chat objects that were used to solve each input, also with the same length asdataset$input
.
Additional output elements can be included in a slot
solver_metadata
that has the same length asdataset$input
, which will be logged insolver_metadata
.Additional arguments can be passed to the solver via
$solve(...)
or$eval(...)
. See the definition ofgenerate()
for a function that outputs a valid solver that just passes inputs to ellmer Chat objects'$chat()
method in parallel.
Method set_scorer()
Set the scorer function
Arguments
scorer
A function that evaluates how well the solver's return value approximates the corresponding elements of
dataset$target
. The function should take in the$samples
slot of a Task object and return a list with the following elements:score
- A vector of scores with length equal tonrow(samples)
. Built-in scorers return ordered factors with levelsI
<P
(optionally) <C
(standing for "Incorrect", "Partially Correct", and "Correct"). If your scorer returns this output type, the package will automatically calculate metrics.
Optionally:
scorer_chat
- If your scorer makes use of ellmer, also include a list of ellmer Chat objects that were used to score each result, also with lengthnrow(samples)
.scorer_metadata
- Any intermediate results or other values that you'd like to be stored in the persistent log. This should also have length equal tonrow(samples)
.
Scorers will probably make use of
samples$input
,samples$target
, andsamples$result
specifically. See model-based scoring for examples.
Method set_metrics()
Set the metrics that will be applied in $measure()
(and thus $eval()
).
Examples
if (!identical(Sys.getenv("ANTHROPIC_API_KEY"), "")) {
library(ellmer)
library(tibble)
simple_addition <- tibble(
input = c("What's 2+2?", "What's 2+3?"),
target = c("4", "5")
)
# create a new Task
tsk <- Task$new(
dataset = simple_addition,
solver = generate(chat_anthropic(model = "claude-3-7-sonnet-latest")),
scorer = model_graded_qa()
)
# evaluate the task (runs solver and scorer) and opens
# the results in the Inspect log viewer (if interactive)
tsk$eval()
}
#> ℹ Solving
#> ✔ Solving [2.7s]
#>
#> ℹ Scoring
#> [working] (0 + 0) -> 1 -> 1 | ■■■■■■■■■■■■■■■■ 50%
#> [working] (0 + 0) -> 0 -> 2 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100%
#> ℹ Scoring
#> ✔ Scoring [9ms]
#>