DecodingTrust Benchmark

Trustworthiness Perspectives

⚠️ WARNING: our data contains model outputs that may be considered offensive.

DecodingTrust aims to provide a comprehensive trustworthiness evaluation on the recent large language model GPT-4, in comparison to GPT-3.5, from different perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness under different settings.

Examples of unreliable responses of GPT-4 from different trustworthiness perspectives. GPT-4 can generate undesirable or unreliable content given benign system prompts.