vitals is a framework for large language model evaluation in R. It’s specifically aimed at ellmer users who want to measure the effectiveness of their LLM-based apps.
The package is an R port of the widely adopted Python framework Inspect. While the package doesn’t integrate with Inspect directly, it allows users to interface with the Inspect log viewer and provides an on-ramp to transition to Inspect if need be by writing evaluation logs to the same file format.
Important
🚧 Under construction! 🚧
vitals is highly experimental and much of its documentation is aspirational.
Installation
You can install the developmental version of vitals using:
pak::pak("tidyverse/vitals")
Example
LLM evaluation with vitals is composed of two main steps.
- First, create an evaluation task with the
Task$new()
method.
simple_addition <- tibble(
input = c("What's 2+2?", "What's 2+3?", "What's 2+4?"),
target = c("4", "5", "6")
)
tsk <- Task$new(
dataset = simple_addition,
solver = generate(chat_anthropic(model = "claude-3-7-sonnet-latest")),
scorer = model_graded_qa()
)
Tasks are composed of three main components:
-
Datasets are a data frame with, minimally, columns
input
andtarget
.input
represents some question or problem, andtarget
gives the target response. -
Solvers are functions that take
input
and return some value approximatingtarget
, likely wrapping ellmer chats.generate()
is the simplest scorer in vitals, and just passes theinput
to the chat’s$chat()
method, returning its result as-is. -
Scorers juxtapose the solvers’ output with
target
, evaluating how well the solver solved theinput
.
- Evaluate the task.
tsk$eval()
$eval()
will run the solver, run the scorer, and then situate the results in a persistent log file that can be explored interactively with the Inspect log viewer.
Any arguments to the solver or scorer can be passed to $eval()
, allowing for straightforward parameterization of tasks. For example, if I wanted to evaluate chat_openai()
on this task rather than chat_anthropic()
, I could write:
tsk_openai <- tsk$clone()
tsk_openai$eval(solver_chat = chat_openai(model = "gpt-4o"))
For an applied example, see the “Getting started with vitals” vignette at vignette("vitals", package = "vitals")
.