GitHub - EngineeringSoftware/PLSemanticsBench: The first benchmark to evaluate LLMs' usability as programming-language interpreters
Table of Contents
About
PLSemanticsBench is the first benchmark for evaluating LLMs as programming language interpreters. We introduce three tasks to evaluate this:
| Task | Description |
|---|---|
| ✨ PredState | Predicts the final program state |
| ✨ PredRule | Predicts the ordered sequence of semantic rules needed to evaluate a program |
| ✨ PredTrace | Predicts the step-by-step execution of a program |
PLSemanticsBench is hosted on HuggingFace: PLSemanticsBench.
You must implement BaseRunner(_query method) to evaluate your models. We provide two example implementations for OpenAI models (GPTRunner) and Ollama models (OllamaRunner).
Installation
System Requirements
- Conda package management system
- Python 3.11 or higher
- OpenAI API key (for running experiments with OpenAI models)
Step-by-Step Installation
- Create and activate the conda environment:
conda env create -f env.yaml conda activate plsemanticsbench
- Set up your OpenAI API key (only for OpenAI models):
export OPENAI_API_KEY='your-api-key-here'
Quick Start
We provide a bash script quick that:
- Sets up the
plsemanticsbenchconda environment. - Pulls the
DeepSeek-R1 1.5Bmodel. - Evaluates the
DeepSeek-R1 1.5Bmodel on thePredStatetask withno-semanticsandchain-of-thoughtprompting on theHuman-Writtendataset. - Prints the
accuracyandmalformed-countto screen. - Creates
metrics-predstate-deepseek-r1:1.5b.jsonthat contains the evaluation result.
Detailed Usage
Basic Example
Here's a minimal example to get started:
from plsemanticsbench import GPTRunner from plsemanticsbench import ExperimentArgs, LLMEvaluator from plsemanticsbench import ( PROMPT_STRATEGY, Task, Formalization, Semantics_Type, Language, PLDataset ) # Model name model_name = "o3-mini" # Experiment args: Run the PredState task on the IMP language with # standard semantics formalized using SOS and with direct prompting exp_args = ExperimentArgs( dataset=PLDataset.Human_Written, task=Task.PredState, language=Language.IMP, formalization=Formalization.SOS, semantics_type=Semantics_Type.Standard, model_name=model_name, prompt_strategy=PROMPT_STRATEGY.DA, num_datapoints_to_run=2, # Run just 2 datapoints (omit to run entire dataset) ) # Run inference using the OpenAI API gpt_runner = GPTRunner(args=exp_args) # Generation (generate LLM prediction on the predstate task) predictions = gpt_runner.do_experiment() # path to dump results can be provided # Evaluation (evaluate LLM prediction against ground-truth) llm_eval = LLMEvaluator(task=exp_args.task, semantics_type=exp_args.semantics_type) evaluation_result = llm_eval.evaluate_from_list(results=predictions, model_name=model_name) print(evaluation_result)
Expected Output
{
'accuracy': 1,
'malformed-count': 0,
}Benchmark
Our benchmark is hosted on HuggingFace: PLSemanticsBench.
Benchmark Access
You can load the dataset using the datasets library. Here is an example:
from datasets import load_dataset # Load PredState task with standard semantics (uk) and K-semantics formalization (K) and with the Human Written (human-written) dataset predstate_IMP_K_uk_human_written = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-K-uk-human-written") # Load PredRule task with nonstandard semantics (mk) ans SOS formalization (SOS) and with the LLM Translated (llm-translated) dataset predrule_IMP_SOS_mk_llm_translated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predrule-IMP-SOS-mk-llm-translated") # Load PredState task with no-semantics (nk) and with the Fuzzer Generated (fuzzer-generated) dataset predstate_IMP_nk_fuzzer_generated = load_dataset("EngineeringSoftware/PLSemanticsBench", name="predstate-IMP-nk-fuzzer-generated")
Dataset Split
| Task | Split | Description |
|---|---|---|
| ✨ PredState (Final State Prediction) |
predstate-IMP-nk-{dataset-name} | No semantics |
| predstate-IMP-K-uk-{dataset-name} | Standard semantics with K-semantics formalization | |
| predstate-IMP-K-mk-{dataset-name} | Nonstandard semantics with K-semantics formalization | |
| predstate-IMP-SOS-uk-{dataset-name} | Standard semantics with SOS formalization | |
| predstate-IMP-SOS-mk-{dataset-name} | Nonstandard semantics with SOS formalization | |
| ✨ PredRule (Semantic Rule Prediction) |
predrule-IMP-K-uk-human-written | Standard semantics with K-semantics formalization |
| predrule-IMP-K-mk-human-written | Nonstandard semantics with K-semantics formalization | |
| predrule-IMP-SOS-uk-human-written | Standard semantics with SOS formalization | |
| predrule-IMP-SOS-mk-human-written | Nonstandard semantics with SOS formalization | |
| ✨ PredTrace (Execution Trace Prediction) |
predtrace-IMP-K-uk-human-written | Standard semantics with K-semantics formalization |
| predtrace-IMP-K-mk-human-written | Nonstandard semantics with K-semantics formalization | |
| predtrace-IMP-SOS-uk-human-written | Standard semantics with SOS formalization | |
| predtrace-IMP-SOS-mk-human-written | Nonstandard semantics with SOS formalization |
Data Example
One example of the dataset is as follows:
{
"program": "int ans; ans = 1; ...",
"syntax": "<program> :: ...",
"semantics": "ℤ := Set of integers ...",
"mutated-program": "int ans; ans = 1; ...",
"mutation-pattern": "KeyWordSwap",
"exec-trace": [
{
"linenumber": 1,
"rule": ["Rule 38", "Rule 39"],
"state": {"ans": 1}
}
],
"ground-truth": "<answer>...</answer>"
}Citation
@inproceedings{ThimmaiahETAL25PLSemanticsBench, title = {LLMs Lean on Priors, Not Programming Language Semantics}, author = {Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Junyi Jessy Li, Milos Gligoric}, year = {2026}, booktitle = {ICML}, }
License
This project is licensed under the CC BY 4.0 License.