Skip to content

Evaluation Tasks provide a flexible data structure for evaluating LLM-based tools.

  1. Datasets contain a set of labelled samples. Datasets are just a tibble with columns input and target, where input is a prompt and target is either literal value(s) or grading guidance.

  2. Solvers evaluate the input in the dataset and produce a final result.

  3. Scorers evaluate the final output of solvers. They may use text comparisons (like detect_match()), model grading (like model_graded_qa()), or other custom schemes.

The usual flow of LLM evaluation with Tasks calls $new() and then $eval(). $eval() just calls $solve(), $score(), $measure(), $log(), and $view() in order. The remaining methods are generally only recommended for expert use.

See also

generate() for the simplest possible solver, and scorer_model and scorer_detect for two built-in approaches to scoring.

Public fields

dir

The directory where evaluation logs will be written to. Defaults to vitals_log_dir().

samples

A tibble representing the evaluation. Based on the dataset, epochs may duplicate rows, and the solver and scorer will append columns to this data.

metrics

A named vector of metric values resulting from $measure() (called inside of $eval()). Will be NULL if metrics have yet to be applied.

Methods


Method new()

The typical flow of LLM evaluation with vitals tends to involve first calling this method and then $eval() on the resulting object.

Usage

Task$new(
  dataset,
  solver,
  scorer,
  metrics = NULL,
  epochs = NULL,
  name = deparse(substitute(dataset)),
  dir = vitals_log_dir()
)

Arguments

dataset

A tibble with, minimally, columns input and target.

solver

A function that takes a vector of inputs from the dataset's input column as its first argument and determines values approximating dataset$target. Its return value must be a list with the following elements:

  • result - A character vector of the final responses, with the same length as dataset$input.

  • solver_chat - A list of ellmer Chat objects that were used to solve each input, also with the same length as dataset$input.

Additional output elements can be included in a slot solver_metadata that has the same length as dataset$input, which will be logged in solver_metadata.

Additional arguments can be passed to the solver via $solve(...) or $eval(...). See the definition of generate() for a function that outputs a valid solver that just passes inputs to ellmer Chat objects' $chat() method in parallel.

scorer

A function that evaluates how well the solver's return value approximates the corresponding elements of dataset$target. The function should take in the $samples slot of a Task object and return a list with the following elements:

  • score - A vector of scores with length equal to nrow(samples). Built-in scorers return ordered factors with levels I < P (optionally) < C (standing for "Incorrect", "Partially Correct", and "Correct"). If your scorer returns this output type, the package will automatically calculate metrics.

Optionally:

  • scorer_chat - If your scorer makes use of ellmer, also include a list of ellmer Chat objects that were used to score each result, also with length nrow(samples).

  • scorer_metadata - Any intermediate results or other values that you'd like to be stored in the persistent log. This should also have length equal to nrow(samples).

Scorers will probably make use of samples$input, samples$target, and samples$result specifically. See model-based scoring for examples.

metrics

A metric summarizing the results from the scorer.

epochs

The number of times to repeat each sample. Evaluate each sample multiple times to better quantify variation. Optional, defaults to 1L. The value of epochs supplied to $eval() or $score() will take precedence over the value in $new().

name

A name for the evaluation task. Defaults to deparse(substitute(dataset)).

dir

Directory where logs should be stored.


Method eval()

Evaluates the task by running the solver, scorer, logging results, and viewing (if interactive). This method works by calling $solve(), $score(), $log(), and $view() in sequence.

The typical flow of LLM evaluation with vitals tends to involve first calling $new() and then this method on the resulting object.

Usage

Task$eval(..., epochs = NULL, view = interactive())

Arguments

...

Additional arguments passed to the solver and scorer functions.

epochs

The number of times to repeat each sample. Evaluate each sample multiple times to better quantify variation. Optional, defaults to 1L. The value of epochs supplied to $eval() or $score() will take precedence over the value in $new().

view

Automatically open the viewer after evaluation (defaults to TRUE if interactive, FALSE otherwise).

Returns

The Task object (invisibly)


Method solve()

Solve the task by running the solver

Usage

Task$solve(..., epochs = NULL)

Arguments

...

Additional arguments passed to the solver function.

epochs

The number of times to repeat each sample. Evaluate each sample multiple times to better quantify variation. Optional, defaults to 1L. The value of epochs supplied to $eval() or $score() will take precedence over the value in $new().

Returns

The Task object (invisibly)


Method score()

Score the task by running the scorer and then applying metrics to its results.

Usage

Task$score(...)

Arguments

...

Additional arguments passed to the scorer function.

Returns

The Task object (invisibly)


Method measure()

Applies metrics to a scored Task.

Usage

Task$measure()


Method log()

Log the task to a directory.

Note that, if an VITALS_LOG_DIR envvar is set, this will happen automatically in $eval().

Usage

Task$log(dir = vitals_log_dir())

Arguments

dir

The directory to write the log to.

Returns

The path to the logged file, invisibly.


Method view()

View the task results in the Inspect log viewer

Usage

Task$view()

Returns

The Task object (invisibly)


Method set_solver()

Set the solver function

Usage

Task$set_solver(solver)

Arguments

solver

A function that takes a vector of inputs from the dataset's input column as its first argument and determines values approximating dataset$target. Its return value must be a list with the following elements:

  • result - A character vector of the final responses, with the same length as dataset$input.

  • solver_chat - A list of ellmer Chat objects that were used to solve each input, also with the same length as dataset$input.

Additional output elements can be included in a slot solver_metadata that has the same length as dataset$input, which will be logged in solver_metadata.

Additional arguments can be passed to the solver via $solve(...) or $eval(...). See the definition of generate() for a function that outputs a valid solver that just passes inputs to ellmer Chat objects' $chat() method in parallel.

Returns

The Task object (invisibly)


Method set_scorer()

Set the scorer function

Usage

Task$set_scorer(scorer)

Arguments

scorer

A function that evaluates how well the solver's return value approximates the corresponding elements of dataset$target. The function should take in the $samples slot of a Task object and return a list with the following elements:

  • score - A vector of scores with length equal to nrow(samples). Built-in scorers return ordered factors with levels I < P (optionally) < C (standing for "Incorrect", "Partially Correct", and "Correct"). If your scorer returns this output type, the package will automatically calculate metrics.

Optionally:

  • scorer_chat - If your scorer makes use of ellmer, also include a list of ellmer Chat objects that were used to score each result, also with length nrow(samples).

  • scorer_metadata - Any intermediate results or other values that you'd like to be stored in the persistent log. This should also have length equal to nrow(samples).

Scorers will probably make use of samples$input, samples$target, and samples$result specifically. See model-based scoring for examples.

Returns

The Task object (invisibly)


Method set_metrics()

Set the metrics that will be applied in $measure() (and thus $eval()).

Usage

Task$set_metrics(metrics)

Arguments

metrics

A named list of functions that take in a vector of scores (as in task$samples$score) and output a single numeric value.

Returns

The Task (invisibly)


Method clone()

The objects of this class are cloneable with this method.

Usage

Task$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Examples

if (!identical(Sys.getenv("ANTHROPIC_API_KEY"), "")) {
  library(ellmer)
  library(tibble)

  simple_addition <- tibble(
    input = c("What's 2+2?", "What's 2+3?"),
    target = c("4", "5")
  )

  # create a new Task
  tsk <- Task$new(
    dataset = simple_addition,
    solver = generate(chat_anthropic(model = "claude-3-7-sonnet-latest")),
    scorer = model_graded_qa()
  )

  # evaluate the task (runs solver and scorer) and opens
  # the results in the Inspect log viewer (if interactive)
  tsk$eval()
}
#>  Solving
#>  Solving [2.7s]
#> 
#>  Scoring
#> [working] (0 + 0) -> 1 -> 1 | ■■■■■■■■■■■■■■■■                  50%
#> [working] (0 + 0) -> 0 -> 2 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100%
#>  Scoring

#>  Scoring [9ms]
#>