GitHub - EngineeringSoftware/PLSemanticsBench: The first benchmark to evaluate LLMs' usability as programming-language interpreters

About

PLSemanticsBench is the first benchmark for evaluating LLMs as programming language interpreters. We introduce three tasks to evaluate this:

Task	Description
✨ PredState	Predicts the final program state
✨ PredRule	Predicts the ordered sequence of semantic rules needed to evaluate a program
✨ PredTrace	Predicts the step-by-step execution of a program

PLSemanticsBench is hosted on HuggingFace: PLSemanticsBench.

You must implement BaseRunner(_query method) to evaluate your models. We provide two example implementations for OpenAI models (GPTRunner) and Ollama models (OllamaRunner).

Installation

System Requirements

Conda package management system
Python 3.11 or higher
OpenAI API key (for running experiments with OpenAI models)

Step-by-Step Installation

Create and activate the conda environment:

conda env create -f env.yaml
conda activate plsemanticsbench

Set up your OpenAI API key (only for OpenAI models):

export OPENAI_API_KEY='your-api-key-here'

Quick Start

We provide a bash script quick that:

Sets up the plsemanticsbench conda environment.
Pulls the DeepSeek-R1 1.5B model.
Evaluates the DeepSeek-R1 1.5B model on the PredState task with no-semantics and chain-of-thought prompting on the Human-Written dataset.
Prints the accuracy and malformed-count to screen.
Creates metrics-predstate-deepseek-r1:1.5b.json that contains the evaluation result.

Detailed Usage

Basic Example

Here's a minimal example to get started:

from plsemanticsbench import GPTRunner
from plsemanticsbench import ExperimentArgs, LLMEvaluator
from plsemanticsbench import (
    PROMPT_STRATEGY,
    Task,
    Formalization,
    Semantics_Type,
    Language,
    PLDataset
)

# Model name
model_name = "o3-mini"

# Experiment args: Run the PredState task on the IMP language with
# standard semantics formalized using SOS and with direct prompting
exp_args = ExperimentArgs(
    dataset=PLDataset.Human_Written,
    task=Task.PredState,
    language=Language.IMP,
    formalization=Formalization.SOS,
    semantics_type=Semantics_Type.Standard,
    model_name=model_name,
    prompt_strategy=PROMPT_STRATEGY.DA,
    num_datapoints_to_run=2, # Run just 2 datapoints (omit to run entire dataset)
)
                        
# Run inference using the OpenAI API
gpt_runner = GPTRunner(args=exp_args)

# Generation (generate LLM prediction on the predstate task)
predictions = gpt_runner.do_experiment() # path to dump results can be provided

# Evaluation (evaluate LLM prediction against ground-truth)
llm_eval = LLMEvaluator(task=exp_args.task, semantics_type=exp_args.semantics_type)
evaluation_result = llm_eval.evaluate_from_list(results=predictions, model_name=model_name)
print(evaluation_result)

Expected Output

{
    'accuracy': 1,
    'malformed-count': 0,
}

Benchmark

Our benchmark is hosted on HuggingFace: PLSemanticsBench.

Benchmark Access

You can load the dataset using the datasets library. Here is an example:

from datasets import load_dataset

# Load PredState task with standard semantics (uk) and K-semantics formalization (K) and with the Human Written (human-written) dataset
predstate_IMP_K_uk_human_written = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-K-uk-human-written")

# Load PredRule task with nonstandard semantics (mk) ans SOS formalization (SOS) and with the LLM Translated (llm-translated) dataset
predrule_IMP_SOS_mk_llm_translated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predrule-IMP-SOS-mk-llm-translated")

# Load PredState task with no-semantics (nk) and with the Fuzzer Generated (fuzzer-generated) dataset
predstate_IMP_nk_fuzzer_generated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-nk-fuzzer-generated")

Dataset Split

Task	Split	Description
✨ PredState (Final State Prediction)	predstate-IMP-nk-{dataset-name}	No semantics
	predstate-IMP-K-uk-{dataset-name}	Standard semantics with K-semantics formalization
	predstate-IMP-K-mk-{dataset-name}	Nonstandard semantics with K-semantics formalization
	predstate-IMP-SOS-uk-{dataset-name}	Standard semantics with SOS formalization
	predstate-IMP-SOS-mk-{dataset-name}	Nonstandard semantics with SOS formalization
✨ PredRule (Semantic Rule Prediction)	predrule-IMP-K-uk-human-written	Standard semantics with K-semantics formalization
	predrule-IMP-K-mk-human-written	Nonstandard semantics with K-semantics formalization
	predrule-IMP-SOS-uk-human-written	Standard semantics with SOS formalization
	predrule-IMP-SOS-mk-human-written	Nonstandard semantics with SOS formalization
✨ PredTrace (Execution Trace Prediction)	predtrace-IMP-K-uk-human-written	Standard semantics with K-semantics formalization
	predtrace-IMP-K-mk-human-written	Nonstandard semantics with K-semantics formalization
	predtrace-IMP-SOS-uk-human-written	Standard semantics with SOS formalization
	predtrace-IMP-SOS-mk-human-written	Nonstandard semantics with SOS formalization

Data Example

One example of the dataset is as follows:

{
  "program": "int ans; ans = 1; ...",
  "syntax": "<program> :: ...",
  "semantics": "ℤ := Set of integers ...",
  "mutated-program": "int ans; ans = 1; ...",
  "mutation-pattern": "KeyWordSwap",
  "exec-trace": [
    {
      "linenumber": 1,
      "rule": ["Rule 38", "Rule 39"],
      "state": {"ans": 1}
    }
  ],
  "ground-truth": "<answer>...</answer>"
}

Citation

@inproceedings{ThimmaiahETAL25PLSemanticsBench,
  title     = {LLMs Lean on Priors, Not Programming Language Semantics},
  author    = {Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Junyi Jessy Li, Milos Gligoric},
  year      = {2026},
  booktitle = {ICML}, 
}

License

This project is licensed under the CC BY 4.0 License.