close
close

first Drop

Com TW NOw News 2024

(R) BLADE: Benchmarking language model agents for data-driven science
news

(R) BLADE: Benchmarking language model agents for data-driven science

(R) BLADE: Benchmarking language model agents for data-driven science

Paper: https://arxiv.org/pdf/2408.09667

Abstract:

Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, for example about which variables, transformations, and statistical models to consider. LM-based agents equipped with scheduling, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approximations, partially correct steps, and different ways of expressing the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate versatile agent approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed accompanying computational methods to match different analysis representations to this ground truth. While language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents that are able to interact with the underlying data show improved, but still suboptimal, diversity in their analytical decision-making. Our work enables the evaluation of agents for data-driven science and provides researchers with deeper insights into agents’ analytical approaches.

Highlights:

(W)e find that most LMs are quite good at distinguishing decisions and generating non-empty executable analyses. However, these analyses are basic and lack diversity. In particular, LM’s ground truth coverage* for forming statistical models with conceptual variables is below 13%, and for operationalizing variables it is below 27%.
(\ where coverage refers to the recording of several valid analysis methods)*

LMs have difficulty specifying statistical models and operationalizing conceptual variables in concrete terms. LMs perform relatively poorly in building statistical models with appropriate conceptual variables (precision below 35%) and operationalizing the variables (precision below 60%). Moreover, LMs perform even worse in terms of coverage for building statistical models with conceptual variables (coverage@10 below 13%) and operationalizing the variables (coverage@10 below 27%).

…Moreover, we observe low coverage of the ground truth samples (Fig. 4 bottom row), particularly with respect to data transformations and specific model specifications. By qualitatively assessing a random sample of LM-generated analyses, we find that LMs are often limited to performing a basic analysis that can yield decent precision (i.e., matching on the basic decisions), but poor coverage across runs

Visual highlights:

https://preview.redd.it/aojcyh1a8mkd1.png?width=1129&format=png&auto=webp&s=668d44e7a7ea47b6824f43bfb073ddeb960927bf

https://preview.redd.it/t5pqvwub8mkd1.png?width=1119&format=png&auto=webp&s=98a6ca17676fe930d2650464ae0138fd18db6dfa

Performance on multi-choice question part of the benchmark. Unfortunately, the performance is almost saturated right from the start

Performance on the analysis generation portion of the benchmark

https://preview.redd.it/s6dxnya59mkd1.png?width=1151&format=png&auto=webp&s=755bc3aaddd34d11a61439a14d9eb45f26654667

https://preview.redd.it/pfoijafd9mkd1.png?width=1259&format=png&auto=webp&s=76e8dcfe28b961eb83585caeef4c38c7c09484cf

submitted by /u/StartledWatermelon
(link) (reactions)