Findings of ACL 2026

A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models

Soham Dan · Himanshu Beniwal · Thomas Hartvigsen

A survey of multilingual toxicity detection and detoxification for LLMs, organized around threat models, task setups, detection methods, mitigation strategies, and open evaluation challenges.

PDF

Abstract

Multilingual safety is not English safety translated.

Large language models (LLMs) are increasingly deployed across languages, but their safety behavior remains uneven across linguistic and cultural contexts. This survey synthesizes work on toxicity detection and detoxification for multilingual LLMs. We first catalogue threat models that exploit language choice, translation pivots, code-switching, orthographic variation, multi-turn interaction, and post-deployment fine-tuning to weaken safety alignment. We then organize task formulations, multilingual detection approaches, and mitigation strategies spanning data filtering, supervised and preference-based tuning, decoding-time steering, representation editing, and multilingual guardrails. Across these areas, we identify persistent challenges: uneven language coverage, culturally contingent definitions of harm, fragmented evaluation protocols, and the risk that detoxification suppresses legitimate dialectal or identity-related expression.

Paper Map

What the survey organizes

01

Threat Models

Language-shift, translation-mediated, code-switching, red-teaming, and post-deployment adaptation attacks.

02

Datasets and Metrics

Toxic-to-neutral rewriting, toxic text detection, toxic-generation evaluation, and preservation metrics.

03

Detection

Multilingual transformers, translation pipelines, representation-level probes, and LLM-based detectors.

04

Detoxification

Data filtering, supervised and preference tuning, decoding-time steering, representation editing, and guardrails.

05

Open Challenges

Cross-lingual gaps, cultural misalignment, fragmented evaluation, over-suppression, and code-switching.

Taxonomy

A compact view of multilingual toxicity work

Taxonomy of toxicity in multilingual large language models
A brief taxonomy of toxicity in multilingual LLMs, reproduced from the project source assets.

Takeaways

Research needs highlighted by the paper

Cross-lingual transfer remains unreliable

Safety methods that work in English can underperform in low-resource, morphologically rich, or culturally distant settings.

Evaluation needs cultural grounding

Benchmarks need to move beyond translated prompts and include community-aware definitions of harm.

Detoxification can erase legitimate expression

Mitigation should measure toxicity reduction alongside preservation of meaning, style, dialect, and identity-related speech.

References

Section-by-section bibliography

Full bibliography index