ProText: A Benchmark Dataset for Measuring (Mis)gendering in Long-Form Texts
Summary
A paper introducing ProText, a benchmark dataset for measuring misgendering and gender bias during LLM's long-form text conversion process.
Key Points
- Apple ML Research team released the ProText dataset (including theme nouns such as names, occupations, titles, and kinship terms).
- Composed of 3 dimensions: theme noun type, theme category (male/female/neutral), and pronoun category (male/female/neutral/none).
- Designed to measure LLM's gender bias in text conversion tasks such as summarization and text rewriting.
- Goes beyond traditional pronoun resolution benchmarks to include cases outside of the gender binary.
- Allows deriving detailed insights into gender bias, stereotypes, and misgendering with just 2 prompts and 2 models.
Notable Quotes & Details
Notable Data / Quotes
- Nuanced insights derivable with 2 prompts and 2 models.
- Systematic gender bias appears when explicit gender cues are absent or the model defaults to heteronormative assumptions.
Intended Audience
AI/NLP researchers and LLM fairness and bias researchers