What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models
Abstract
Research explores gender bias origins in neural machine translation by analyzing source sentence contexts that influence target language gender inflections through saliency attribution and contrastive translations.
Interpretability can be implemented to understand decisions taken by (black box) models, such as neural machine translation (NMT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. In this work, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this exploratory study examines which context, in the form of input tokens in the source sentence (EN), influences (or triggers) the NMT model's choice of a certain gender inflection in the target languages (DE/ES). To analyse this, we compute saliency attribution based on contrastive translations. We first address the challenge of the lack of a scoring threshold and specifically examine different attribution levels of source words on the model's gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.
Get this paper in your agent:
hf papers read 2512.08440 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper