approved
Synthetic Datasets for Fine-Grained Fairness Analysis of Abusive Language Detection Systems

Three synthetic datasets covering different types of bias grouped by target, namely sexism, racism and ableism. The reason for distinguishing the records by abuse targets is due to the need for specialised datasets addressing different phenomena of abusive language with a fine-grained approach. The resulting data do not contain samples from datasets under license: the contents we release are therefore freely available. Briefly, the first dataset on sexism contains 1,200 non-hateful and 4,423 hateful samples; the second one on racism contains 400 non-hateful and 1,500 hateful records; the last one on ableism contains 220 hateful sentences. The label distribution is radically different from traditional abusive language datasets, where the prevalent class is non-hateful. This choice is motivated by the fact that we want to mainly focus on the phenomena surrounding social prejudices providing realistic and diverse examples, with the aim of exploring in depth the language used to convey biases.

Tags
Data and Resources
To access the resources you must log in
Personal Data Attributes

Description: Personal Data related Information

Field Value
ChildrenData No
Cross Border Authorised Yes
General Data Yes
Personal Data No
Personal data was manifestly made public by the data subject No
Sensitive Data No
Additional Info
Field Value
Accessibility Both
Accessibility Mode OnLine Access
Attribution requirements Manerba, Marta Marchiori, and Sara Tonelli. "Fine-grained fairness analysis of abusive language detection systems with checklist." Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021.
Availability On-Line
Basic rights Modification
Creation Date 2021-12-22
Creator Marchiori Manerba, Marta, [email protected]
Dataset Citation Manerba, Marta Marchiori, and Sara Tonelli. "Fine-grained fairness analysis of abusive language detection systems with checklist." Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021.
Dataset Re-Use Safeguards /
Display requirements Paper citation
Field/Scope of use Any use
Format csv
Format Schema text, label
Group Societal Debates and Misinformation
Language eng, English
License term 2023-11-27 /2026-11-27
Manifestation Type Original
Processing Degree Primary
Retention Period 2026-11-27 /2026-11-30
Size 5623 ; 1900 ; 220
SoBigData Node SoBigData EU
SoBigData Node SoBigData IT
Sublicense rights No
Territory of use World Wide
Thematic Cluster Text and Social Media Mining [TSMM]
system:type Dataset
Management Info
Field Value
Author MARCHIORI MANERBA MARTA
Maintainer MARCHIORI MANERBA MARTA
Version 1
Last Updated 9 December 2023, 11:31 (CET)
Created 9 December 2023, 11:31 (CET)