In the “Getting started with vitals” vignette, we introduced
evaluation Task
s with a basic analysis on “An R Eval.”
Using the built-in solver generate()
, we passed inputs to
models and treated their raw responses as the final result. vitals
supports supplying arbitrary functions as solvers, though. In this
example, we’ll use a custom solver to explore the following
question:
Does equipping models with the ability to peruse package documentation make them better at writing R code?
We’ll briefly reintroduce the are
dataset, define the
custom solver, and then see whether Claude does any better on the
are
Task using the custom solver compared to plain
generate()
.
First, loading the required packages:
An R eval dataset
From the are
docs:
An R Eval is a dataset of challenging R coding problems. Each
input
is a question about R code which could be solved on first-read only by human experts and, with a chance to read documentation and run some code, by fluent data scientists. Solutions are intarget
and enable a fluent data scientist to evaluate whether the solution deserves full, partial, or no credit.
glimpse(are)
#> Rows: 26
#> Columns: 7
#> $ id <chr> "after-stat-bar-heights", "conditional-grouped-summ…
#> $ input <chr> "This bar chart shows the count of different cuts o…
#> $ target <chr> "Preferably: \n\n```\nggplot(data = diamonds) + \n …
#> $ domain <chr> "Data analysis", "Data analysis", "Data analysis", …
#> $ task <chr> "New code", "New code", "New code", "Debugging", "N…
#> $ source <chr> "https://jrnold.github.io/r4ds-exercise-solutions/d…
#> $ knowledge <list> "tidyverse", "tidyverse", "tidyverse", "r-lib", "t…
At a high level:
-
id
: A unique identifier for the problem. -
input
: The question to be answered. -
target
: The solution, often with a description of notable features of a correct solution. -
domain
,task
, andknowledge
are pieces of metadata describing the kind of R coding challenge. -
source
: Where the problem came from, as a URL. Many of these coding problems are adapted “from the wild” and include the kinds of context usually available to those answering questions.
In the introductory vignette, we just provided the
input
s as-is to an ellmer chat’s $chat()
method (via the built-in generate()
solver). In this
vignette, we’ll compare those results to an approach resulting from
first giving the models a chance to read relevant R package
documentation.
Custom solvers
To help us better understand how solvers work, lets look at the
implementation of generate()
. generate()
is a
function factory, meaning that the function itself outputs a
function.
sonnet_3_7 <- chat_anthropic(model = "claude-3-7-sonnet-latest")
generate(solver_chat = sonnet_3_7$clone())
#> function (inputs, ..., solver_chat = chat)
#> {
#> check_inherits(solver_chat, "Chat")
#> ch <- solver_chat$clone()
#> res <- ch$chat_parallel(as.list(inputs), ...)
#> list(result = purrr::map_chr(res, function(c) c$last_turn()@text),
#> solver_chat = res)
#> }
#> <bytecode: 0x55e82092e4b0>
#> <environment: 0x55e8209312b0>
While, in documentation, I’ve mostly written “generate()
just calls $chat()
,” there are a couple extra tricks up
generate()
’s sleeves:
- The chat is first cloned with
$clone()
so that the original chat object isn’t modified once vitals calls it. - Rather than
$chat()
,generate()
calls$chat_parallel()
. This is whygenerate()
takesinputs
–e.g. the wholeinput
column–rather than a singleinput
.$chat_parallel()
will send off all of the requests in parallel (not in the usual R sense of multiple R processes, but in the sense of asychronous HTTP requests), greatly reducing the amount of time it takes to run evaluations.
The function that generate
outputs takes in a list of
inputs
and outputs a list with two elements:
-
result
is the final result from the solver. In this case,purrr::map_chr(res, function(c) c$last_turn()@text)
just grabs the text from the last assistant run and returns it.result
is a vector of length equal toinputs
—it should be able to be tacked ontotask$samples
as another column. -
solver_chat
is a list of chats that led to theresult
. This may seem redundant withresult
, but is important to supply to vitals so that the package can log what happened during the solving process and display it in the log viewer.
We will work with a slightly modified version of
generate()
that equips models with the ability to peruse R
package documentation before answering. We will do so using the btw
package, available on GitHub at https://posit-dev.github.io/btw/. Among other things,
btw provides a function btw_register_tools()
that registers
an ellmer chat with tools to explore documentation.
ch <- sonnet_3_7$clone()
btw_register_tools(ch, tools = "docs")
✔ Registered tool btw_tool_docs_package_help_topics
✔ Registered tool btw_tool_docs_help_page
✔ Registered tool btw_tool_docs_available_vignettes
✔ Registered tool btw_tool_docs_vignette
btw_register_tools()
modifies the chat object in place,
so I can use the chat’s $chat()
method to interact with it
like before. As an example, after registering btw with a Claude Sonnet
3.7 chat object, I can ask it about this package. At the time of
writing, vitals is not on CRAN, and Claude Sonnet 3.7 completed training
several months before I started working on it. Nonetheless, the model is
able to tell me about how custom solvers work:
ch$chat("How do custom solvers work in vitals?")
I’ll help you find information about custom solvers in the vitals package. Let me first check if the vitals package is installed and then explore its documentation.
The models lets us know it’s reading a bunch of documentation.<> To create a custom solver, you need to:
- Define a function that:
- Takes a vector of inputs as its first argument
- Returns a list with two elements:
result
: A vector of final responses (same length as input)solver_chat
: The list of ellmer chat objects used (same length as input)- Your custom solver can include additional logic to:
- Preprocess inputs before sending to the LLM
- Post-process model responses
- Implement special handling like multi-turn conversations
- Incorporate tools or external knowledge
- Add retry mechanisms or fallbacks
Yada yada yada.
Pretty spot-on. But will this capability help the models better
answer questions in R? are
is built from questions about
base R and popular packages like dplyr and shiny, information about
which is presumably baked into their weights.
The implementation of our custom solver looks almost exactly like
generate()
s implementation, except we add a call to
btw_register_tools()
before calling
$chat_parallel()
.
btw_solver <- function(inputs, ..., solver_chat) {
ch <- solver_chat$clone()
btw_register_tools(ch, tools = "docs")
res <- ch$chat_parallel(as.list(inputs))
list(
result = purrr::map_chr(res, function(c) c$last_turn()@text),
solver_chat = res
)
}
This solver is ready to situate in a Task and get solving!
Running the solver
Situate a custom solver in a dataset in the same way you would with a built-in one:
are_btw <- Task$new(
dataset = are,
solver = btw_solver,
scorer = model_graded_qa(partial_credit = TRUE, scorer_chat = sonnet_3_7$clone()),
name = "An R Eval"
)
are_btw
As in the introductory vignette, we supply the are
dataset and the built-in scorer model_graded_qa()
. Then, we
use $eval()
to evaluate our custom solver, evaluate the
scorer, and then explore a persistent log of the results in the
interactive Inspect log viewer.
are_btw$eval(solver_chat = sonnet_3_7$clone())
Any arguments to solving or scoring functions can be passed directly
to $eval()
, allowing for quickly evaluating tasks across
several parameterizations. We’ll just use Claude Sonnet 3.7 for now.
are_btw_data <- vitals_bind(are_btw)
are_btw_data
#> # A tibble: 26 × 4
#> task id score metadata
#> <chr> <chr> <ord> <list>
#> 1 are_btw after-stat-bar-heights I <tibble [1 × 10]>
#> 2 are_btw conditional-grouped-summary P <tibble [1 × 10]>
#> 3 are_btw correlated-delays-reasoning C <tibble [1 × 10]>
#> 4 are_btw curl-http-get C <tibble [1 × 10]>
#> 5 are_btw dropped-level-legend I <tibble [1 × 10]>
#> 6 are_btw geocode-req-perform C <tibble [1 × 10]>
#> 7 are_btw ggplot-breaks-feature I <tibble [1 × 10]>
#> 8 are_btw grouped-filter-summarize P <tibble [1 × 10]>
#> 9 are_btw grouped-mutate C <tibble [1 × 10]>
#> 10 are_btw implement-nse-arg P <tibble [1 × 10]>
#> # ℹ 16 more rows
Claude answered fully correctly in 14 out of 26 samples, and partially correctly 5 times. Using the metadata, we can also determine how often the model looked at at least one help-page before supplying an answer:
chat_called_a_tool <- function(chat) {
turns <- chat[[1]]$get_turns()
any(purrr::map_lgl(turns, turn_called_a_tool))
}
turn_called_a_tool <- function(turn) {
any(purrr::map_lgl(turn@contents, inherits, "ellmer::ContentToolRequest"))
}
are_btw_data %>%
tidyr::hoist(metadata, "solver_chat") %>%
mutate(called_a_tool = purrr::map_lgl(solver_chat, chat_called_a_tool)) %>%
pull(called_a_tool) %>%
table()
#> .
#> FALSE TRUE
#> 2 24
When equipped with the ability to peruse documentation, the model almost always made use of it. To see the kinds of docs it decided to look at, click in a few samples in the log viewer for this eval:
Did perusing documentation actually help the model write better R code, though? In order to quantify “better,” we need to also run this evaluation without the model being able to access documentation.
Using the same code from the introductory vignette:
are_basic <- Task$new(
dataset = are,
solver = generate(solver_chat = sonnet_3_7$clone()),
scorer = model_graded_qa(partial_credit = TRUE, scorer_chat = sonnet_3_7$clone()),
name = "An R Eval"
)
are_basic$eval()
(Note that we also could have used are_btw$set_solver()
here if we didn’t want to recreate the task.)
are_eval_data <-
# the argument names will become the entries in `task`
vitals_bind(btw = are_btw, basic = are_basic) %>%
# and, in this case, different tasks correspond to different solvers
rename(solver = task)
are_eval_data
#> # A tibble: 52 × 4
#> solver id score metadata
#> <chr> <chr> <ord> <list>
#> 1 btw after-stat-bar-heights I <tibble [1 × 10]>
#> 2 btw conditional-grouped-summary P <tibble [1 × 10]>
#> 3 btw correlated-delays-reasoning C <tibble [1 × 10]>
#> 4 btw curl-http-get C <tibble [1 × 10]>
#> 5 btw dropped-level-legend I <tibble [1 × 10]>
#> 6 btw geocode-req-perform C <tibble [1 × 10]>
#> 7 btw ggplot-breaks-feature I <tibble [1 × 10]>
#> 8 btw grouped-filter-summarize P <tibble [1 × 10]>
#> 9 btw grouped-mutate C <tibble [1 × 10]>
#> 10 btw implement-nse-arg P <tibble [1 × 10]>
#> # ℹ 42 more rows
Is this difference in performance just a result of noise, though? We can supply the scores to an ordinal regression model to answer this question.
library(ordinal)
#>
#> Attaching package: 'ordinal'
#> The following object is masked from 'package:dplyr':
#>
#> slice
are_mod <- clm(score ~ solver, data = are_eval_data)
are_mod
#> formula: score ~ solver
#> data: are_eval_data
#>
#> link threshold nobs logLik AIC niter max.grad cond.H
#> logit flexible 52 -52.78 111.56 5(0) 1.22e-07 1.7e+01
#>
#> Coefficients:
#> solverbtw
#> 0.1661
#>
#> Threshold coefficients:
#> I|P P|C
#> -0.821030 0.006203
The coefficient for solver == "btw"
is 0.166, indicating
that allowing the model to peruse documentation tends to be associated
with higher grades. If a 95% confidence interval for this coefficient
contains zero, we can conclude that there is not sufficient evidence to
reject the null hypothesis that the difference between GPT-4o and
Claude’s performance on this eval is zero at the 0.05 significance
level.
confint(are_mod)
#> 2.5 % 97.5 %
#> solverbtw -0.871408 1.210623
If we had evaluated this model across multiple epochs, the question
ID could become a “nuisance parameter” in a mixed model, e.g. with the
model structure
ordinal::clmm(score ~ model + (1|id), ...)
.
This article demonstrated using a custom solver by modifying the
built-in generate()
to allow models to make tool calls that
peruse package documentation. With vitals, tool calling “just works”;
calls to any tool registered with ellmer will automatically be
incorporated into evaluation logs.