(Translated by https://www.hiragana.jp/)
Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

Rebecca Dorn rdorn@usc.edu University of Southern California ISIMarina del ReyCaliforniaUSA Lee Kezar lkezar@usc.edu University of Southern CaliforniaLos AngelesCaliforniaUSA Fred Morstatter fredmors@isi.edu University of Southern California ISIMarina del ReyCaliforniaUSA  and  Kristina Lerman lerman@isi.edu University of Southern California ISIMarina del ReyCaliforniaUSA
(2024)
Abstract.

Trigger Warning: Profane Language, Slurs
Content moderation on social media platforms shapes the dynamics of online discourse, influencing whose voices are amplified and whose are suppressed. Recent studies have raised concerns about the fairness of content moderation practices, particularly for aggressively flagging posts from transgender and non-binary individuals as toxic. In this study, we investigate the presence of bias in harmful speech classification of gender-queer dialect online, focusing specifically on the treatment of reclaimed slurs. We introduce a novel dataset, QueerReclaimLex, based on 109 curated templates exemplifying non-derogatory uses of LGBTQ+ slurs. Dataset instances are scored by gender-queer annotators for potential harm depending on additional context about speaker identity. We systematically evaluate the performance of five off-the-shelf language models in assessing the harm of these texts and explore the effectiveness of chain-of-thought prompting to teach large language models (LLMs) to leverage author identity context. We reveal a tendency for these models to inaccurately flag texts authored by gender-queer individuals as harmful. Strikingly, across all LLMs the performance is poorest for texts that show signs of being written by individuals targeted by the featured slur (F1 \leq 0.24). We highlight an urgent need for fairness and inclusivity in content moderation systems. By uncovering these biases, this work aims to inform the development of more equitable content moderation practices and contribute to the creation of inclusive online spaces for all users.

Gender identity, Online communities, Content moderation, Toxicity, Chain-of-thought prompting, LGBTQ+
journalyear: 2024doi: XXXXXXX.XXXXXXXbooktitle: Preprintcopyright: noneccs: Human-centered computingccs: Social and professional topics Censorshipccs: Applied computing Sociologyccs: General and reference Evaluation

1. Introduction

Among the functions of social media platforms is providing space for identity exploration, emotional support and community building (Selkie et al., 2020; Herrmann et al., 2023; McInroy et al., 2019). These online spaces are particularly important for gender-queer individuals. These people, including those who are transgender and non-binary, have gender identities that fall outside traditional social norms. As a result, they face an increased risk of discrimination and social isolation (Tabaac et al., 2018). Particularly in geographic areas where support for gender-queer individuals is limited, online spaces are often vital for the health and well-being of transgender and non-binary people.

Done well, content moderation on social media platforms can help to create safe and welcoming environments, protecting online communities from harassment and harm. Traditionally, content moderation has relied on trained machine classifiers to ferret out problematic speech (e.g. (Lees et al., 2022)). More recently, large language models (LLMs) have been used for moderating speech due to their flexibility and unparalleled ability to take into account the context within speech.

Unfortunately, a growing body of research suggests that automated content moderation on social media platforms removes posts and bans users in a manner that inadvertently disadvantages historically marginalized populations (Shahid and Vashistha, 2023; Harris et al., 2023). In particular, studies have shown a disproportionate removal of content posted by transgender individuals, often mislabeled as ‘adult’ or ‘toxic’ (Haimson et al., 2021). In one startling example, the toxicity algorithm that moderates the New York Times comments section ranked tweets from contestants on “RuPaul’s Drag Race”–a drag queen reality television show–as higher in toxicity than tweets from white nationalists (Dias Oliva et al., 2021). This particular type of content moderation error can contribute to the further marginalization of queer people, limiting their participation in the online communities where inclusion is an essential antidote to the alienation that many experience in their day-to-day lives.

Content moderation algorithms appear to play a role in contributing to the censorship of trans and non-binary individuals. (Namaste, 2000; Scheuerman et al., 2021) However, the precise mechanisms that cause this discrimination have not yet been investigated. This paper helps to fill that gap. To do this, we focus on a particular form of speech that has received extensive study by language scholars, namely, linguistic reclamation. A prominent practice within gender-queer communities, linguistic reclamation involves the non-derogatory use of historically derogatory slurs by marginalized groups to reclaim agency and identity (Edmondson, 2021; Worthen, 2020). For example, terms with derogatory histories like ‘queer’ and ‘femboy’ have been reclaimed and repurposed to convey positive, prideful identity within LGBTQ+ discourse (Baker, 2013; Vytniorgu, 2023; Anzani et al., 2021; Gilbert, 2020). This particular form of gender-queer dialect is a promising tool for exploring the effectiveness of content moderation algorithms in meeting the needs of LGBTQ+ communities.

This paper investigates potential biases content moderation algorithms may hold against gender-queer social media users. Specifically, we examine how language models attribute harm to social media posts featuring reclaimed slurs. We assess performance when providing models with additional context about speaker identity. Further, we explore the utility of chain-of-thought explanations to increase the accuracy of language models’ characterization of reclaimed slurs.

To facilitate our study, we introduce QueerReclaimLex, a curated dataset based on real world uses of reclaimed slurs by gender-queer speakers. We obtain ground truth data from gender-queer annotators under varying author identity contexts, and leverage these labels to evaluate the performance of five off-the-shelf language models in assessing harmful speech. Our findings reveal a propensity for these models to erroneously flag texts authored by gender-queer individuals as harmful, with limited improvement observed even with chain-of-thought prompting. Further, we observe that LLMs are particularly likely to mislabel non-derogatory uses of slurs as harmful in posts with clear markers of gender-queer authorship. Such authorship normally signals a substantially reduced likelihood of harmful posting. The inability of LLMs to understand the distinctive dialect used in this particular community reveals an urgent need for content moderation systems to move beyond relying on slurs as keywords and instead consider nuanced in-text contextual cues. Further, it implies the potential need to incorporate members of marginalized communities into the processes used to validate content moderation norms.

This study builds on previous work finding that Twitter users with non-binary pronouns in their biography received less attention on Twitter through retweets and likes, and are flagged for toxicity at alarmingly high rates (Dorn et al., 2023). Here, we shed light on precisely how content moderation algorithms embed bias, with a focus on gender-queer communities. By uncovering high false-positive rates towards reclaimed slurs, we highlight the risks of perpetuating bias against gender-queer populations through large language models. We anticipate that our findings will inform the development of more equitable content moderation systems and guide policymakers in mitigating algorithmic biases to foster inclusive online environments.

Refer to caption
Figure 1. Three prompting schemas vanilla, identity and identity-cot that are used to elicit toxicity scores from our models. Each schema introduces an additional aspect of context to the model. Bold fields include examples.

2. Related Work

2.1. Trans and Non-Binary Dialects

According to a framework from the field of linguistics termed "pragmatics", socio-cultural factors centrally determine and profoundly influence the form and function of language (Joseph, [n. d.]). For example, a 2021 UK study found that non-binary users are more likely than men or women to use words related to gender and sexuality on Twitter (Thelwall et al., 2021). This distinctive pattern of word usage reflects the result of a non-binary dialect because, among the factors relevant to understanding the use of gender and sexuality terms, the speaker’s social group (as opposed to the time or medium of communication) emerges as a strong predictor of characteristic expression patterns.

In addition to shifts in the distribution of tokens over a vocabulary, a dialect may also include more latent shifts in pragmatic intent. For example, many trans and non-binary communities use mock rudeness not to harm the audience but rather to build in-group solidarity and resilience to future discriminatory experiences (McKinnon, 2017). Similarly, in gender-queer dialects, slurs that historically cause harm, such as “fag” or “sissy”, may be repurposed by the in-group to serve nontoxic purposes like identifying oneself (Dias Oliva et al., 2021). In this work, we primarily focus on slur use (as opposed to other features of gender-queer dialects) because they are easy to identify in corpora and may be incorrectly parsed by language models, leading to falsely labeling their use as harmful.

2.2. Linguistics of Slurs

Functionally, slurs both classify a person or group and express a particular perspective that the speaker has towards that person or group. While publicly classifying someone as trans or non-binary can cause harm, especially to those who conceal their identity for their safety, it is the latent perspectives associated with slur that can make them uniquely harmful (relative to other classifications like trans). Frequently, these perspectives evoke feelings of disgust or hatred and have been associated with intimidation or violence.

However, the perspectives that originated a slur do not represent the full range of contemporary uses and intentions. Quotes of others using a slur and discussing the slur itself seem to be more acceptable uses among outgroup members because in these contexts the speaker may be conveying a more neutral or even positive perspective towards the group that is subject to the slur (Hess, 2020). Additionally, slurs can take on new, non-derogatory senses through discussion and use by members of the group subject to the slur, through the process of linguistic reclamation, described earlier (Edmondson, 2021). The extent that an individual of a marginalized group reclaims a may vary with age, gender identity, sexual orientation and relationship with the specific slur (Edmondson, 2021).

Taken together, the boundary between harmful and non-harmful uses of slurs is sometimes (if not usually, given the stigmas often associated with slurs) determined by establishing the speaker’s social identity. This determination is not straightforward or sometimes even possible, further confounding the interpretation of intent. In this work, we not only provide harm assessments across varying uses from people in slur target groups, but also study the extent to which LLMs are able to mimick this ability in the more controlled text domain.

2.3. Gender Variance in NLP

According to a comprehensive review of approximately 200 articles relating to gender bias in natural language processing (NLP), almost no papers in this area conceptualize gender as non-binary  (Devinney et al., 2022). Concurrently, multiple works have found suboptimal performance by NLP systems when confronted by the singular pronoun use of ‘they’ (Baumler and Rudinger, 2022; Ovalle et al., 2023). This observation is further compounded by the finding that Wordnet 3.0 contains representations for only 39% of topical terms from the National Transgender Discrimination Survey (Hicks et al., [n. d.]). It appears as though NLP systems may often be constructed without deeply considering non-binary gender identities. Notably, popular English lexicons for inappropriate language fail to differentiate between pejorative and non-pejorative LGBTQ+ terminology (Ramesh et al., 2022), even neglecting the difference between the terms ‘gay’ and ‘fag’. Nonetheless, there is hope for decreasing bias in NLP systems, as evidenced by Seq2Seq model’s ability to translate gendered pronouns to gender-neutral pronouns with minimal error  (Sun et al., 2021).

2.4. Defining and Detecting Harmful Speech

Determining harm is inherently subjective. Researchers have worked to mitigate this subjectivity by creating frameworks that include facets like target group, explicitness of abuse, speaker intent and power dynamics (Waseem et al., 2017; Zhou et al., 2023; Zhao et al., 2021). Here we incorporate speaker identity as contextual information to help temper some of the subjectivity within our analysis.

When classifiers falsely identify harm they run the risk of suppressing speech. One common contributor to false positives is the over-reliance on keywords rather than contextual clues (e.g. Davidson (Davidson et al., 2019; Yin and Zubiaga, 2022)). According an empirical analysis, linear classifiers struggle to discern between hate speech and profanities (Malmasi and Zampieri, 2018). This concern is compounded by the frequency of profanities in online platforms like Twitter, where approximately one in thirteen tweets includes swear words (Wang et al., 2014). Nonetheless, recent advancements hold promise for reducing the risk of false positives. Leveraging word-level annotations as features has been shown to alleviate some reliance on keywords for abusive language detection (Pamungkas et al., 2023), and novel language classification frameworks have uncovered social positioning as crucial context in detecting offensive language (Diaz et al., 2022).

3. Methods

Refer to caption\Description

[Diagram breaking down ingroup and outgroup for a particular phrase. Consider the phrase "I hate dykes". Lesbian is the neutral correlate of dyke. The ingroup would be lesbians, and the outgroup would be anyone who is not a lesbian.]Diagram breaking down ingroup and outgroup for a particular phrase. Consider the phrase "I hate dykes". Lesbian is the neutral correlate of dyke. The ingroup would be lesbians, and the outgroup would be anyone who is not a lesbian.

Figure 2. Illustrative example of how the terms ingroup and outgroup are used in the scope of this paper.

In the scope of this paper we define the terms ingroup and outgroup as follows. In a sentence with an identity term or slur, we say the ingroup is the population referenced by the identity term or slur’s neutral correlate (e.g. the neutral correlate for ‘dyke’ is ‘lesbian’). The outgroup is the population not referenced by the identity term or neutral correlate. Figure 2 displays an illustrative example of deducing ingroup and outgroup from a piece of text.

Gender is a broad concept with no single comprehensive definition. Before explaining our use of gender-related terms in this paper, we emphasize that we do not see our definitions as comprehensive or universally applicable. We use transgender to refer to any person whose gender identity differs from what is commonly associated with members of similar biological sex (Namaste, 2000). We use non-binary to mean someone who identifies neither exclusively as a man nor exclusively as a woman. This term includes someone who identifies with neither, both, or different labels at different times (Monro, 2019). We use gender-queer as an umbrella term for anyone whose gender identity falls outside socially normative gender expression (Monro, 2019), including both transgender and non-binary individuals.

3.1. QueerReclaimLex Dataset

We present QueerReclaimLex, a collection of statements containing reclaimed slurs. This dataset is created using templates, which allow us to isolate the impacts of individual words on model performance for toxic speech classification. Our design relies upon natural language, which makes our results more applicable to real-life content moderation systems.

3.1.1. Template Creation

We use NB-TwitCorpus3M, a collection of  3 million tweets authored by approximately 3,000 Twitter users who have non-binary pronouns in their profile biography (Dorn et al., 2023; Jiang et al., 2022). The presence of pronouns in this dataset is determined by the user’s specification. In particular, pronouns are gleaned from any combination of {{\{{he, him, his, she, her, hers, they, them, theirs, their, xe, xem, ze, zem}}\}} separated by forward slashes or commas, with any or no white space in their profile descriptions.

We compile potentially non-derogatory uses of slurs posted by non-binary users that are judged highly toxic by Detoxify111https://github.com/unitaryai/detoxify. After compiling posts, we replace the slur with the token [SLUR] to transform the text into a template (see Figure  3). This process results in 109 curated templates.

3.1.2. Gender-queer Slurs

Next, we translate each template into multiple unique instances by iteratively replacing [SLUR]delimited-[]𝑆𝐿𝑈𝑅[SLUR][ italic_S italic_L italic_U italic_R ] with a term from a list of slurs.We begin with a set of slurs that reference non-socially normative gender identities: ‘tranny’, ‘femboy’, ‘sissy’, ‘shemale’ and ‘transvestite’. Due to the strong overlap between queer identification in gender and sexuality (Flores and Conron, 2023), we add the slurs ‘queer’, ‘fag’ and ‘dyke’. We attempted to incorporate pejoratives that target non-binary people and people assigned female at birth, however we found that most of these phrases are unsuitable for the grammatical set-up of our dataset (e.g. ‘a they/them’).

The slurs on our list feature a range of positions within the linguistic reclamation process. Though once a stigmatized reference for non-socially normative presentation, ‘queer’ is now argued as successfully reclaimed (Worthen, 2020) and has been ranked as one of the least likely LGBTQ+ slurs to be seen as offensive (Edmondson, 2021). In contrast, the term ‘dyke’ remains socially taboo with growing participation of ‘Dyke Marches’ functioning in part to take back the slur (Sayers, 2023; Baim, 2015).

Refer to caption\Description

[Original tweet is "Dear GOD I wish I had another queer coworker". Template is "Dear GOD I wish I had another slur coworker." When we assign the slur token to the slur tranny, the instance becomes "Dear GOD I wish I had another tranny coworker."]Original tweet is "Dear GOD I wish I had another queer coworker". Template is "Dear GOD I wish I had another slur coworker." When we assign the slur token to the slur tranny, the instance becomes "Dear GOD I wish I had another tranny coworker."

Figure 3. Examples of how tweets from gender-queer authors become templates, and how those templates translate to instances of QueerReclaimLex. The original reclaimed slurs are in purple, positions for slurs are in green and inserted slurs are in blue.

3.1.3. Annotator Recruitment & Demographics

Whether a slur should be considered ‘reclaimed’ is relative, varying with the speaker, observer and label being used  (Anderson and Lepore, 2013a; Sturaro et al., 2023). Further, whether a particular use of a slur is considered derogatory depends on its characterization by members of the targeted group (Anderson and Lepore, 2013b). At the same time, it is unclear how many members must label a use as harmful and what proportion must agree before the use is considered harmful based on member feedback (Anderson and Lepore, 2013b).

We obtained annotations for QueerReclaimLex from members of the target group: gender-queer individuals. In this way we allowed gender-queer individuals to have a say in what is considered harmful to their own community. We recruited six annotators from a call for transgender and non-binary annotators in QueerinAI222https://www.queerinai.com. Each (template,[SLUR]delimited-[]𝑆𝐿𝑈𝑅[SLUR][ italic_S italic_L italic_U italic_R ]) tuple is evaluated by two annotators, resulting in every template being evaluated by five to six annotators.

In an optional demographic survey given to annotators, four out of six annotators provided information about their gender, age, English comprehension, and race or ethnicity. For gender identity, annotators were asked for their relationship to both gender identity descriptors as well as slurs included in our dataset. All four respondents identified with the slur ‘queer’. Half of the respondents identified with ‘dyke’, half with ‘fag’ and one with ‘tranny’. When asked for the terms describing their current gender identity, respondents on average listed 2.25 terms, with three out of four respondents identifying with ‘Non-Binary’. Two respondents identified with‘Transgender’ and ‘Feminine’ and single respondents identified with ‘Agender’, ‘Trans Femme’ and ‘Gender Fluid’.

Respondents reported an age range of 25 through 34. Half of them described their ethnicity and/or race as ‘White’, one as ‘South Asian’ and one as ‘Black Latino Caribbean’. Three of four respondents scored their English comprehension as a 5 out of 5, and one as a 4 out of 5.

3.1.4. Annotation Fields

Given the literature’s diverse interpretations of harm and recommendations to clearly define harm labels (Fortuna et al., [n. d.]), we provided annotators with an explicit definition of “harmful”. This definition was based on offensive speech definitions from concepts founded in critical race theory (Waseem and Hovy, 2016). See the appendix for a full list of changes we made to the original definitions and our rationale for each change.

Definition 3.1.

A post should be considered harmful if an outgroup member:

  1. (1)

    uses a slur in a pejorative way.

  2. (2)

    attacks a minority.

  3. (3)

    seeks to silence a minority.

  4. (4)

    criticizes a minority (without a well founded argument).

  5. (5)

    promotes, but does not directly use, hate speech or violent crime.

  6. (6)

    blatantly misrepresents truth or seeks to distort views on a minority with unfounded claims.

  7. (7)

    negatively stereotypes a minority.

  8. (8)

    shows support of harmful hashtags, as per the previous criteria.

  9. (9)

    contains a screen name that is harmful, as per the previous criteria.

Or anyone who:

  1. (1)

    causes another person reasonable fear of personal injury.

  2. (2)

    defends an oppressive system, such as xenophobia or sexism.

The annotators labeled instances for two definitions of harm, depending on speaker identity: 1) HARMFUL IN, denoting whether the post was harmful given that the author was an ingroup member; and 2) HARMFUL OUT denoting whether a post was harmful given that the author was an outgroup member. Annotators additionally labeled the post IMPLIED INGROUP to describe whether the text indicated that the author was a member of the ingroup. Annotators had three options for characterizing harm: 1, for harmful posts; 0, for posts that were not harmful; and 0.5, for posts where harm was uncertain.

Instance Slur Usage
straight creators stop putting the word [SLUR]delimited-[]𝑆𝐿𝑈𝑅[SLUR][ italic_S italic_L italic_U italic_R ] in
your edgy show/movie challenge 2020
Counter Speech,
Discussion of Slur
i love being a himbo [SLUR]delimited-[]𝑆𝐿𝑈𝑅[SLUR][ italic_S italic_L italic_U italic_R ]. its my entire
personality and im fine with that
Self Label,
Sexualization
intergenerational [SLUR]delimited-[]𝑆𝐿𝑈𝑅[SLUR][ italic_S italic_L italic_U italic_R ] friendships are soooo
important
Discussion of Identity,
Reclamation
history will say they hated him for his [SLUR]delimited-[]𝑆𝐿𝑈𝑅[SLUR][ italic_S italic_L italic_U italic_R ]
swag
Sarcasm
Curtsy lunges? [SLUR]delimited-[]𝑆𝐿𝑈𝑅[SLUR][ italic_S italic_L italic_U italic_R ] squats? Who the fuck is
naming these exercises
Neologism, Quote
Table 1. Five instances of QueerReclaimLex with corresponding SLUR USAGE framework classifications.

Two of the annotators with backgrounds in computational linguistics added an additional label of SLUR USAGE, a multiple selection category describing the context in which a slur was used. We based this field on a previously developed cohesive taxonomy of slur use created from a blend of semantic and pragmatic linguistic strategy and an analysis of 40k Reddit comments featuring pejorative language (Kurrek et al., 2020; Hom, 2008). Table 1 shows examples from the template dataset. We made four small alterations to the taxonomy to better fit our research goals, which are detailed in the appendix.

The twelve subcategories within SLUR USAGE are:

  • Recollection: Recollection of a time a slur was used.

  • Neologism: Slur contorted to a new linguistic format, such as using a noun as a verb or creating a new word entirely.

  • Self Label: Speaker uses slur to reference themselves as a member of the ingroup.

  • Other Label: Slur ascribed to someone who is not the speaker.

  • Group Label: Slur used to describe a group of people.

  • Reclamation: Slur use that places power with ingroup members.

  • Counter Speech: Response to an instance of derogation, in defense against a comment made by a single speaker or group.

  • Quote: Reference to a slur embedded in a quote or paraphrase.

  • Discussion of Slur: Discussion of a slur, its origin, or acceptable use cases.

  • Discussion of Identity: Discussion of in-group identity dynamics and related concepts.

  • Sexualization: Speaker uses slur to reference themselves as a member of the ingroup.

  • Sarcasm: A slur used ironically, contrary to its original meaning.

3.2. Harm Classification

As noted earlier, we use five language models to evaluate whether text is harmful: three large language models and two toxicity classifiers. Unlike traditional toxicity classifiers, large language models (LLMs) can be guided through prompts. In this work, we leverage prompts to teach LLMs to incorporate additional context about author identity. Additionally, we teach LLMs about harmful and non-harmful speech by providing examples and explaining rationale with chain-of-thought prompting.

3.2.1. Toxicity Classifier Selection

We use two popular models for toxicity detection to classify posts as harmful: Google Jigsaw’s Perspective API, a multilingual Charformer model (Tay et al., 2022; Lees et al., 2022) and Detoxify333https://github.com/unitaryai/detoxify which is a RoBERTa-based model fine-tuned on multiple Jigsaw challenges designed to classify toxic comments. These models are widely used in computational social science, although some research has reported disparity of toxicity scores towards non-binary individuals (Dorn et al., 2023).

3.2.2. Large Language Model Selection

We use three LLMs. The first is OpenAI’s GPT 3.5444https://platform.openai.com/docs/models/gpt-3-5 text generation model, included due to its popularity, affordability (compared to GPT 4) and strong performance (Brown et al., 2020). We use a version of Meta’s LLaMA 2555https://huggingface.co/meta-llama/Llama-2-13b-hf dialogue-based system (13 billion parameters) that has been subject to reinforcement learning with human feedback. We include this model because of its extensive reporting of safety protocols (Touvron et al., 2023), such as that 5.91% of training documents contain a gender and sex descriptor term, with ‘transgender’ the fourth most common term (4.2% of gender and sex descriptors). The final LLM used is Mistral666https://huggingface.co/mistralai/Mistral-7B-v0.1, chosen because of its emphasis on remaining open-source and its strong performance that is comparable to levels achieved by LLaMA 2 (Jiang et al., 2023).

3.2.3. Large Language Model Prompting

For text featuring slurs, determining harm requires additional context, such as the relationship between the author and the slur’s target group (Hess, 2020). For this reason, we test the prompt-based LLMs over three different prompting schemas, varying in the extent to which they provide identity context and chain-of-thought explanations. These prompting schemas are summarized in Fig. 1.

The vanilla prompting schema takes a zero-shot approach in which the model receives never before seen text as input. We instruct the model with a task explanation, definition of harm (same as the provided definition to human annotators), definitions of ingroup and outgroup terminology, and a short-form textual post to annotate. The model outputs a score between 0 and 1 representing harm for the input post.

The prompting schema identity is designed to provide explicit context defining the relationship between the post’s author and the slur’s target group. We alter the zero-shot prompt such that before the social media post is a declarative statement expressing author membership (i.e. ‘An ingroup member wrote the following post.’). An example of this prompt is shown with dark blue text in Figure 4.

### INSTRUCTION: A post should be considered harmful if anyone (1) causes another person reasonable fear of personal injury (2) defends an oppressive system, such as xenophobia or sexism. Or, if an out-group member (1) uses a slur in a pejorative way (2) attacks a minority (3) seeks to silence a minority (4) criticizes a minority (without a well founded argument) (5) promotes, but does not directly use, hate speech or violent crime (6) blatantly misrepresents truth or seeks to distort views on a minority with unfounded claims (7) negatively stereotypes a minority (8) shows support of harmful hashtags as per the previous criteria (9) contains a screen name that is harmful as per the previous criteria. An ingroup member is someone who belongs to the group that is being called a slur. If a tweet says ‘I am a tranny’, the ingroup is transgender people. ### INPUT: An ingroup member wrote the following post. Post: Old lady didnt wanna pay for her coffee because a "transvestite" was serving her. ### RESPONSE: The tweet is written by a trans person. The tweet itself discusses a time when the author had reasonable fear of personal injury, but does not itself cause that fear. It is not harmful for a trans person to talk about a time they were called a slur. {"harm": 0}
Figure 4. Blue text contains identity prompting schema formatted for LLaMA 2. The post featured is an example post. When the instance contains the purple text after the blue, the prompt becomes a version of our identity-cot prompting schema with only one example rather than four.
\Description

[Sample of what text is included for prompts.]Blue text saying "INSTRUCTION: A post should be considered harmful if anyone (1) causes another person reasonable fear of personal injury (2) defends an oppressive system, such as xenophobia or sexism. Or, if an out-group member (1) uses a slur in a pejorative way (2) attacks a minority (3) seeks to silence a minority (4) criticizes a minority (without a well founded argument) (5) promotes, but does not directly use, hate speech or violent crime (6) blatantly misrepresents truth or seeks to distort views on a minority with unfounded claims (7) negatively stereotypes a minority (8) shows support of harmful hashtags as per the previous criteria (9) contains a screen name that is harmful as per the previous criteria. An ingroup member is someone who belongs to the group that is being called a slur. If a tweet says ‘I am a tranny’, the ingroup is transgender people. INPUT: An ingroup member wrote the following post. Post: Old lady didnt wanna pay for her coffee because a "transvestite" was serving her. Purple text saying: "RESPONSE: The tweet is written by a trans person. The tweet itself discusses a time when the author had reasonable fear of personal injury, but does not itself cause that fear. It is not harmful for a trans person to talk about a time they were called a slur. {"harm": 0}"

Our third schema incorporates chain-of-thought reasoning, where models are provided examples with an enunciated step-by-step process (Wei et al., 2022). This an approach has been shown to improve model performance on a wide range of tasks (Wei et al., 2022). The schema identity-cot prompts the model with four examples of author identity context, tweet, harm score and explanation before providing a new tweet and eliciting a harm score. Figure 4 shows one example of chain-of-thought reasoning here in purple.

4. Results

HARMFUL IN  (n=752)𝑛752(n=752)( italic_n = 752 ) HARMFUL OUT  (n=641)𝑛641(n=641)( italic_n = 641 )
vanilla identity identity-cot vanilla identity identity-cot
Model P R F1 P R F1 P R F1 P R F1 P R F1 P R F1
Detoxify .15 .66 .25 .78 .47 .59
Perspective .23 .55 .33 .80 .28 .41
GPT-3.5 .18 .97 .31 .24 .92 .39 .31 .90 .47 .84 .64 .72 .83 .80 .81 .87 .53 .66
LLaMA-2 .19 .90 .31 .18 .92 .30 .40 .78 .53 .82 .54 .65 .79 .80 .80 .81 .81 .81
Mistral .24 .65 .36 .31 .42 .36 .32 .28 .30 .81 .32 .46 .80 .20 .32 .80 .49 .61
Table 2. Precision (P), recall (R), and F1 scores for each model under each prompting strategy. Results are segmented by author identity. Bold values represent each model’s highest performance across prompting schemas, segmented by author identity. Across all models, instances featuring linguistic reclamation are overwhelmingly falsely flagged as harmful.

4.1. QueerReclaimLex Results

First, we analyze the QueerReclaimLex dataset presented in this paper, and use these data to analyze the performance of models on the task of identifying toxic language.

4.1.1. Annotator Agreement

Overall, QueerReclaimLex annotators agreed on 76.7% of instances and score high annotator agreement (Cohen’s κ=0.76𝜅0.76\kappa=0.76italic_κ = 0.76). For HARMFUL IN   annotators agreed on the harm score in 83% of instances, again with high annotator agreement score (κ=0.80𝜅0.80\kappa=0.80italic_κ = 0.80). For HARMFUL OUT, annotators agreed on the harm score 69.5% of the time with moderate annotator agreement (κ=0.60𝜅0.60\kappa=0.60italic_κ = 0.60). This disparity in agreement between authorship indicates that judging harm in outgroup posts elicits less consensus for our annotators. Additionally, to help account for the subjective nature of the task, we removed 88 instances with extreme annotator disagreement– that is, receives harm scores of (0,1)01(0,1)( 0 , 1 ).

4.1.2. Slur Usage

Refer to caption
Figure 5. Frequency of SLUR USAGE depending on expert-obtained harm scores. Ingroup posts are far less likely to be harmful.

The most frequent categories of slur use in our dataset were ‘Group Label’, ‘Discussion of Identity’ and ‘Self Label’, altogether accounting for 46.9% of instances (data not shown). To understand the interplay of slur usage, identity group and harm, we break down the distribution of SLUR USAGE with respect to group membership in Figure 5. It is immediately clear that ingroup membership is a strong predictor of harm score: For HARMFUL IN, 15.5% of the instances are labeled as harmful, while 82.4% of the HARMFUL OUT instances are labeled as harmful.

Among ingroup posts, ‘Group Label’, ‘Other Label’ and ‘Sarcasm’77766% of instances labeled ‘Sarcasm’ are also characterized as ‘Group Label’ or ‘Other Label’. were more likely to be labeled as harmful, suggesting that the predominant reason why ingroup uses are considered harmful is because the speaker is slurring someone else. Two categories are never labeled as harmful: ‘Recollection’ and ‘Sexualization’.

Among outgroup posts, five categories are less likely to be considered harmful: ‘Recollection’, ‘Counter Speech’, ‘Quote’, ‘Discussion of Slur’, and ‘Discussion of Identity’. These categories also involve more frequent “uncertain” labels (for the selected labels: n0.5in=82superscriptsubscript𝑛0.5𝑖𝑛82n_{0.5}^{in}=82italic_n start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT = 82, n0.5out=221superscriptsubscript𝑛0.5𝑜𝑢𝑡221n_{0.5}^{out}=221italic_n start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT = 221). This finding supports the notion that it is more acceptable for an outgroup member to use a slur if and only if the speaker’s intent is clearly non-disparaging, such as speaking out against a third party’s derogatory use of a slur.

IMPLIED INGROUP (n=464)
vanilla identity identity-cot
Model P R F1 P R F1 P R F1
Detoxify .04 .60 .08
Perspective .08 .47 .13
GPT 3.5 .05 .93 .10 .07 .80 .13 .10 .73 .23
LLaMA 2 .06 .87 .11 .06 .87 .11 .14 .73 .23
Mistral .08 .73 .15 .11 .40 .17 .19 .33 .24
Table 3. Model performance on the subset of QueerReclaimLex where annotators agree that the text clearly indicates it was written by an ingroup member. Even with chain-of-thought prompting, all models perform with minimal precision.

4.2. Model Performance

Next, we study how well the five language models are able to recognize non-derogatory slur use. See Table 2 for the precision, recall, and F1 for the models under different prompting strategies. Our analyses begin with the two off-the-shelf toxicity classifiers (Detoxify and Perspective, 4.2.1), followed by the three LLMs (GPT 3.5, LLaMA 2, and Mistral, 4.2.2). We then report the effect of explicitly providing author identity (4.2.3) and chain-of-thought reasoning (4.2.4). Finally, we interpret the results with respect to author ingroupness (4.2.5) and slur choice (4.2.6).

4.2.1. Toxicity Classifier Performance

Texts written by ingroup members obtain high false positive rates (0.15prec.0.40.15prec.0.40.15\leq\text{prec.}\leq 0.40.15 ≤ prec. ≤ 0.4; 0.55recall0.97)0.55\leq\text{recall}\leq 0.97)0.55 ≤ recall ≤ 0.97 ). We also observe moderate performance on texts written by outgroup members (F1 .59absent.59\leq.59≤ .59, recall .47absent.47\leq.47≤ .47), suggesting that traditional toxicity classification methods under-classify the harm of slur use in texts authored by outgroup members. Overall, these results suggest that traditional toxicity methods may over-rely on the presence of keywords (slurs) instead of leveraging context, adversely affecting the accuracy of moderation decisions in very problematic ways.

4.2.2. Vanilla Performance

Large language models exhibit poor performance for default prompting via vanilla. This is particularly true for HARMFUL IN, where the highest F1 score is 0.36 (Mistral) vs. 0.72 for HARMFUL OUT (GPT 3.5). The low F1 scores result from high false positive rates, as evidenced by the low precision and high recall, meaning that baseline LLM behavior frequently flags ingroup speech as harmful when the text is not actually harmful. The vanilla performance on HARMFUL OUT, as measured by F1, is generally low in for the task of harmful speech detection. However, outgroup posts yield notably higher performance than for ingroup posts. Outgroup posts exhibit a reverse trend to ingroup, where models obtain high precision (\geq .78) and moderate-to-low recall (\leq .64). This suggests an under-classification of harmful language to outgroup members. Together, these observations suggest that the baseline behavior of toxicity detection as performed by LLMs does not account for the distinctive dialects of queer participants in online communities. Instead, these models seem to implicitly assume that all online speakers communicate as outgroup members.

4.2.3. Identity Context

We attempt to guide the LLMs to incorporate the identity of the author when judging harm using identity prompting. At first glance, we observe limited improvement in performance for identity prompting. All three large language models continue to show poor performance (F10.39𝐹10.39F1\leq 0.39italic_F 1 ≤ 0.39) on toxicity detection for HARMFUL IN, even though the prompt specifies that the author is an ingroup member. However, in comparison with vanilla scores, the false positive rates of GPT 3.5 and Mistral decrease, suggesting that the F1 scores may result from increased leniency of assigning harm to ingroup posts. Additionally, in posts written by outgroup members, models become better at identifying toxic posts.The performance on HARMFUL OUT, as measured by F1 score, improves for GPT 3.5 and LLaMA 2 (F1=0.81𝐹10.81F1=0.81italic_F 1 = 0.81 and F1=0.82𝐹10.82F1=0.82italic_F 1 = 0.82, respectively) due to big increases in recall. This suggests that the models are learning to be more accurate in assigning harm scores to outgroup posts, but they remain largely incapable of accurately characterizing harm for ingroup posts.

4.2.4. Identity Context with Chain-of-Thought

The LLMs can be taught about ingroup language using explanatory examples via identity-cot prompting. When predicting HARMFUL IN, GPT 3.5 and LLaMA 2 obtain their highest F1 scores across prompting schemas (F1=0.47𝐹10.47F1=0.47italic_F 1 = 0.47 and F1=0.53𝐹10.53F1=0.53italic_F 1 = 0.53, respectively) and their lowest false positive rates. This indicates that chain-of-thought prompting aids GPT 3.5 and LLaMA 2 in learning to be lenient, hence more accurate, when assigning harm scores to ingroup member slur use. For HARMFUL OUT, LLaMA 2 and Mistral obtain their highest F1 performance across prompting schemas. Additionally, their increase in recall suggests that the models become less strict in assigning high harm to posts with outgroup authorship. These results show that chain-of-thought with identity prompting can be leveraged to improve the ability of LLMs to recognize reclaimed slurs.

4.2.5. Clear Ingroup Membership

Determining in-groupness based on a short text is a challenging, sometimes impossible task. However, some posts contain clear indicators that they were written by an ingroup member. For instance, in the template ‘Not all [SLUR]s do coding and stuff, some of us are actually really dumb’, the pronoun “us” indicates that the author identifies with the group connected to the slur. Using the collected annotations for IMPLIED INGROUP, which indicate whether the text is implied to have been authored by an ingroup member, we study the selected models’ sensitivity to ingroupness while predicting harmful intent.

First, we look at model performance on a subset of QueerReclaimLex where both annotators mark IMPLIED INGROUP as true. We find that the large language models do not effectively use in-post context of ingroup membership to adjust their understanding of slur use. Table 3 reports model performance only on HARMFUL IN   since our concern here lies with posts written by ingroup members. We observe that, for all models and all prompting schemas, the precision, recall and F1 are even worse than the overall results in Table 2. The highest vanilla F1 performance is merely 0.15 (Mistral). In fact, across all prompting schemas, the highest F1 is only a low 0.24 (Mistral). This observation shows that language models do not seem to correctly characterize statements featuring gender-queer dialect, despite the reported ability of large language models to make use of context.

4.2.6. Reliance on Specific Slurs

Refer to caption
Figure 6. Mean model harm scores split by specific slur and prompting schema. The pink vertical line denotes the mean over gold labels for HARMFUL IN, and the blue vertical line denotes the mean over gold labels for HARMFUL OUT. Whiskers show standard error.

We recognize that slurs subjectively evoke feelings of hatred according to individual priors shaped by past experience. These priors play a central role in determining the extent to which a slur is harmful, however, it is not clear to what extent language models capture this prior. To analyze the extent to which slurs are perceived independently of context, we compare model performance on instances which only vary with respect to the slur.

Figure 6 displays each model’s average harm score over our list of slurs. The pink vertical line denotes the mean over gold labels for HARMFUL IN, and the blue vertical line denotes the mean over gold labels for HARMFUL OUT. We observe that all three models give their highest harm scores to instances featuring the slurs ‘fag’, ‘shemale’ and ‘tranny’. It appears as though models have some sort of hierarchy of slurs.

Refer to caption
Figure 7. Linear Regression R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT statistics measuring correlation between model harm scores and slur featured in text. All R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT values are significant at p <<< .01. For convenience, identity-cot is shortened to cot.

Following related analyses in econometrics (Chen et al., 2024), we learn a linear regressor to predict a model’s harm scores using a one-hot encoding of slur presence as input. We add a disturbance term ϵitalic-ϵ\epsilonitalic_ϵ and randomly remove one slur to prevent linear dependence between predictors. We formulate the linear regression model as follows:

(1) yi(x)=ϵi+β0+j|S|βj×𝟏x(Sj)subscript𝑦𝑖𝑥subscriptitalic-ϵ𝑖subscript𝛽0subscript𝑗subscriptabsent𝑆subscript𝛽𝑗subscript1𝑥subscript𝑆𝑗y_{i}(x)=\epsilon_{i}+\beta_{0}+\sum_{j\in\mathbb{N}_{\leq|S|}}\beta_{j}\times% \mathbf{1}_{x}(S_{j})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j ∈ blackboard_N start_POSTSUBSCRIPT ≤ | italic_S | end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × bold_1 start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

In other words, model i𝑖iitalic_i’s predicted harm score yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the provided bag-of-words x𝑥xitalic_x is the sum of weights βjsubscript𝛽𝑗\beta_{j}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT associated with each unique slur Sjxsubscript𝑆𝑗𝑥S_{j}\in xitalic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_x. Figure 7 shows the resulting R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT values across the 15 model ×\times× prompt settings. Our analysis revealed that the choice of slur becomes less explanatory for harm scores as more context is introduced in the prompting schema.

The gold labels are barely explained by choice of slur (R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT \leq .05). Simultaneously, when given a vanilla schema, the choice of slur has a large role in determining the harm score for all three models (R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT \geq 0.19). As each level of context is introduced, R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT decreases, indicating a reduced reliance on the specific slur for determining harm scores. In other words, increasing context provided in the model’s prompt appears to lead the LLM to better understand the surrounding setting a slur is used in.

5. Discussion and Conclusion

We presented QueerReclaimLex, a novel dataset of short-form posts featuring LGBTQ+ slurs as an aspect of gender-queer dialect. The dataset was annotated by six gender-queer individuals for subjective harm assessments and twelve types of pragmatic use. QueerReclaimLex was created to facilitate multifaceted analyses related to gender-queer dialect biases in language models, particularly for harmful speech detection.

The dataset composition process provided valuable insights into the sociolinguistic dynamics of gender-queer dialects and harm. We observed a higher annotator agreement for tweets with ingroup authorship over outgroup authorship, indicating less controversy in determining harm when a slurring pejorative is used by an ingroup member. This fits into prior work asserting that a general condition for non-derogatory slur use is that the speaker identifies with the slur’s target group (Hess, 2020). Additionally, we observed that ingroup authorship of slurring pejoratives can pose a high risk of harm when the pejoratives are used to label others. We found that slurring pejoratives used by outgroup authors may be less harmful when used in reference to someone else using a slurring pejorative (i.e. a quote or recollection). Socio-linguistic work has similarly observed that a slur can be referenced but not used when embedded in a quote, allowing its derogatory effect to possibly become neutralized (Hess, 2020).

We also present an in-depth analysis of how five off-the-shelf language models perform on the task of harmful speech detection, revealing high false positive rates for gender-queer authors (the maximum precision across ingroup users was 0.40.40.40.4). We found evidence that this low performance was explained in part by an inability to leverage relevant social context. Even when explicitly told relevant social context—that the author is an ingroup member—the tested models decreased in F1 by, on average, 19.8%. This implies that these models further marginalize queer users by wrongly labeling their posts as harmful and removing them from online discussions.

This work emphasizes the need to consider gender minority groups in the creation and deployment of large language models for harm detection. Previous work has found that toxicity detection performed by language models struggles to understand minority dialects, due in large part to spurious correlations (Xu et al., 2021). A possible strategy for removing these spurious correlations may involve supplementing datasets with ingroup language of historically marginalized groups.

We hope to build upon this work by retrieving annotations for QueerReclaimLex from popular annotation channels like Amazon Mechanical Turk. This could help clarify the perspectives which underlay the annotations used to train machine learning classifiers. Additionally, we are interested in analyzing model performance on these templates but with neutral identity terms (e.g., ‘gay’) rather than slurs. At a more mundane level, we hope to replicate this work by using a larger number of annotators from the queer community, to help gain a sense of the number of participants from marginalized communities whose input is needed to assess LLM model performance. Finally, we hope to experiment with LLM alignment, the task of training an LLM to align with some specific identity, to understand how models make sense of LGBTQ+ communities.

Ethics and Limitations

Template Format. Template datasets often capture non-natural language. In this work, we deliberately create templates which are based on real-life speech. Template datasets are inherently limited to a finite set of sentence structures. Unfortunately, due to the manual labor required in compiling natural language templates, the dataset we curate is fairly small. A valuable avenue for future work lies in expanding the dataset to include additional templates.

Conceptualization of Gender. In this work, we group different gender-queer identities together under the umbrella term ‘gender-queer’ and treat them as somewhat synonymous. This is a flawed conceptualization. Though gender-queer identities (e.g. ‘non-binary’, ‘transgender’) are not mutually exclusive, many individuals identify with only a subset of label(s). However, given the limited personally-expressed data for people with non-socially normative gender identities, we see this work as an opportunity to improve content moderation for many gender-queer communities. As the concept of gender continues to shift and grow, we look forward to seeing how research moves with it.

Subjectivity of Harm. Gold labels for harm are inherently affected by the subjectivity of the task. To promote reliability of annotations, we analyze model performance only on those instances with some consistency between annotations. In the future we would like to expand the number of annotators assigned to each instance.

Generalization. In this work we highlight how an aspect of gender-queer dialect is treated by one function of large language models. We recognize that these findings may not generalize towards other model functions, as the slightest perturbations in large language model prompts have been shown to alter classification scores (Salinas and Morstatter, 2024). Additionally, these findings may not generalize towards other historically marginalized communities. We would be interested to see how model harm scores vary on dialect aspects from other groups.

English Only. In this work we only consider English tweets and pronouns. Obviously, most all languages contain speakers with gender-queer identities. We hope to see this work repeated with languages other than English.

Dataset Availability. The dataset is available on GitHub at https://github.com/rebedorn/QueerReclaimLex, along with its corresponding data sheet.

Acknowledgements

We sincerely thank our annotators, including Umut Pajaro Velasquez as well as those who wish to remain anonymous, without whom this work could not exist. We also thank the QueerinAI community for their valuable insights and support in shaping this work. Thank you to Negar Mokhberian for providing valuable feedback throughout the process. Additionally, we would like to thank the SoCalNLP conference participants for their helpful suggestions.

References

  • (1)
  • Anderson and Lepore (2013a) Luvell Anderson and Ernie Lepore. 2013a. Slurring words. Noûs 47, 1 (2013), 25–48.
  • Anderson and Lepore (2013b) Luvell Anderson and Ernie Lepore. 2013b. What did you call me? Slurs as prohibited words setting things up. Analytic Philosophy 54, 3 (2013), 350–63.
  • Anzani et al. (2021) Annalisa Anzani, Louis Lindley, Giacomo Tognasso, M Paz Galupo, and Antonio Prunas. 2021. “Being talked to like I was a sex toy, like being transgender was simply for the enjoyment of someone else”: Fetishization and sexualization of transgender and nonbinary individuals. Archives of Sexual Behavior 50, 3 (2021), 897–911.
  • Baim (2015) Tracy Baim. 2015. Chicago Dyke March. , 33 pages. Copyright - Copyright Windy City Media Group Jul 1, 2015; Document feature - Photographs; Last updated - 2023-09-25.
  • Baker (2013) Paul Baker. 2013. From gay language to normative discourse: A diachronic corpus analysis of Lavender Linguistics conference abstracts 1994–2012. Journal of Language and Sexuality 2, 2 (2013), 179–205.
  • Baumler and Rudinger (2022) Connor Baumler and Rachel Rudinger. 2022. Recognition of They/Them as Singular Personal Pronouns in Coreference Resolution. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 3426–3432. https://doi.org/10.18653/v1/2022.naacl-main.250
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
  • Chen et al. (2024) Yi Chen, Manming Fang, Yi Zhao, and Zibo Zhao. 2024. RECOVERING OVERLOOKED INFORMATION IN CATEGORICAL VARIABLES WITH LLMS: AN APPLICATION TO LABOR MARKET MISMATCH. National Bureau of Economic Research (2024).
  • Davidson et al. (2019) Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. 2019. Racial Bias in Hate Speech and Abusive Language Detection Datasets. In Proceedings of the Third Workshop on Abusive Language Online. Association for Computational Linguistics, Florence, Italy, 25–35. https://doi.org/10.18653/v1/W19-3504
  • Devinney et al. (2022) Hannah Devinney, Jenny Björklund, and Henrik Björklund. 2022. Theories of “Gender” in NLP Bias Research. arXiv:2205.02526 (May 2022). http://arxiv.org/abs/2205.02526 arXiv:2205.02526 [cs].
  • Dias Oliva et al. (2021) Thiago Dias Oliva, Dennys Marcelo Antonialli, and Alessandra Gomes. 2021. Fighting Hate Speech, Silencing Drag Queens? Artificial Intelligence in Content Moderation and Risks to LGBTQ Voices Online. Sexuality and Culture 25, 2 (Apr 2021), 700–732. https://doi.org/10.1007/s12119-020-09790-w
  • Diaz et al. (2022) Mark Diaz, Razvan Amironesei, Laura Weidinger, and Iason Gabriel. 2022. Accounting for Offensive Speech as a Practice of Resistance. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics, Seattle, Washington (Hybrid), 192–202. https://doi.org/10.18653/v1/2022.woah-1.18
  • Dorn et al. (2023) Rebecca Dorn, Julie Jiang, Jeremy Abramson, and Kristina Lerman. 2023. Non-Binary Gender Expression in Online Interactions. arXiv preprint arXiv:2303.04837 (2023).
  • Edmondson (2021) Daniel Edmondson. 2021. Word norms and measures of linguistic reclamation for LGBTQ+ slurs. Pragmatics and Cognition 28, 1 (Dec 2021), 193–221. https://doi.org/10.1075/pc.00023.edm
  • Flores and Conron (2023) Andrew Flores and Kerith Conron. 2023. Adult LGBT Population in the United States. Williams Institute (2023).
  • Fortuna et al. ([n. d.]) Paula Fortuna, Juan Soler, and Leo Wanner. [n. d.]. Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets. ([n. d.]).
  • Gilbert (2020) Aster Gilbert. 2020. Sissy Remixed: Trans* Porno Remix and Constructing the Trans* Subject. Transgender Studies Quarterly 7, 2 (2020), 222–236.
  • Haimson et al. (2021) Oliver L. Haimson, Daniel Delmonaco, Peipei Nie, and Andrea Wegner. 2021. Disproportionate Removals and Differing Content Moderation Experiences for Conservative, Transgender, and Black Social Media Users: Marginalization and Moderation Gray Areas. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (Oct 2021), 1–35. https://doi.org/10.1145/3479610
  • Harris et al. (2023) Camille Harris, Amber Gayle Johnson, Sadie Palmer, Diyi Yang, and Amy Bruckman. 2023. “Honestly, I Think TikTok has a Vendetta Against Black Creators”: Understanding Black Content Creator Experiences on TikTok. Proceedings of the ACM on Human-Computer Interaction 7, CSCW2 (Sept. 2023), 1–31. https://doi.org/10.1145/3610169
  • Herrmann et al. (2023) Lena Herrmann, Carola Bindt, Sarah Hohmann, and Inga Becker-Hebly. 2023. Social media use and experiences among transgender and gender diverse adolescents. International Journal of Transgender Health (2023), 1–14.
  • Hess (2020) Leopold Hess. 2020. Practices of Slur Use. Grazer Philosophische Studien 97, 1 (Mar 2020), 86–105. https://doi.org/10.1163/18756735-09701006
  • Hicks et al. ([n. d.]) Amanda Hicks, Michael Rutherford, Christiane Fellbaum, and Jiang Bian. [n. d.]. An Analysis of WordNet’s Coverage of Gender Identity Using Twitter and The National Transgender Discrimination Survey. ([n. d.]).
  • Hom (2008) Christopher Hom. 2008. The semantics of racial epithets. The Journal of Philosophy 105, 8 (2008), 416–440.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
  • Jiang et al. (2022) Julie Jiang, Emily Chen, Luca Luceri, Goran Murić, Francesco Pierri, Ho-Chun Herbert Chang, and Emilio Ferrara. 2022. What are Your Pronouns? Examining Gender Pronoun Usage on Twitter. arXiv:2207.10894 (Oct. 2022). http://arxiv.org/abs/2207.10894 arXiv:2207.10894 [cs].
  • Joseph ([n. d.]) John E Joseph. [n. d.]. Historical perspectives on language and identity. ([n. d.]).
  • Kurrek et al. (2020) Jana Kurrek, Haji Mohammad Saleem, and Derek Ruths. 2020. Towards a Comprehensive Taxonomy and Large-Scale Annotated Corpus for Online Slur Usage. In Proceedings of the Fourth Workshop on Online Abuse and Harms. Association for Computational Linguistics, Online, 138–149. https://doi.org/10.18653/v1/2020.alw-1.17
  • Lees et al. (2022) Alyssa Lees, Vinh Q Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. 2022. A new generation of perspective api: Efficient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3197–3207.
  • Malmasi and Zampieri (2018) Shervin Malmasi and Marcos Zampieri. 2018. Challenges in discriminating profanity from hate speech. Journal of Experimental and Theoretical Artificial Intelligence 30, 2 (March 2018), 187–202. https://doi.org/10.1080/0952813X.2017.1409284
  • McInroy et al. (2019) Lauren B McInroy, Rebecca J McCloskey, Shelley L Craig, and Andrew D Eaton. 2019. LGBTQ+ youths’ community engagement and resource seeking online versus offline. Journal of Technology in Human Services 37, 4 (2019), 315–333.
  • McKinnon (2017) Sean McKinnon. 2017. “Building a thick skin for each other”: The use of ‘reading’ as an interactional practice of mock impoliteness in drag queen backstage talk. Journal of Language and Sexuality 6, 1 (Jun 2017), 90–127. https://doi.org/10.1075/jls.6.1.04mck
  • Monro (2019) Surya Monro. 2019. Non-binary and genderqueer: An overview of the field. International Journal of Transgenderism 20, 2-3 (2019), 126–131.
  • Namaste (2000) Viviane Namaste. 2000. Invisible lives: The erasure of transsexual and transgendered people. University of Chicago Press.
  • Ovalle et al. (2023) Anaelia Ovalle, Palash Goyal, Jwala Dhamala, Zachary Jaggers, Kai-Wei Chang, Aram Galstyan, Richard Zemel, and Rahul Gupta. 2023. “I’m fully who I am”: Towards Centering Transgender and Non-Binary Voices to Measure Biases in Open Language Generation. In 2023 ACM Conference on Fairness, Accountability, and Transparency. 1246–1266. https://doi.org/10.1145/3593013.3594078 arXiv:2305.09941 [cs].
  • Pamungkas et al. (2023) Endang Wahyu Pamungkas, Valerio Basile, and Viviana Patti. 2023. Investigating the role of swear words in abusive language detection tasks. Language Resources and Evaluation 57, 1 (Mar 2023), 155–188. https://doi.org/10.1007/s10579-022-09582-8
  • Ramesh et al. (2022) Krithika Ramesh, Sumeet Kumar, and Ashiqur Khudabukhsh. 2022. Revisiting Queer Minorities in Lexicons. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics, Seattle, Washington (Hybrid), 245–251. https://doi.org/10.18653/v1/2022.woah-1.23
  • Salinas and Morstatter (2024) Abel Salinas and Fred Morstatter. 2024. The butterfly effect of altering prompts: How small changes and jailbreaks affect large language model performance. arXiv preprint arXiv:2401.03729 (2024).
  • Sayers (2023) William Sayers. 2023. The Etymology of Dyke and Bull-dyke. ANQ: A Quarterly Journal of Short Articles, Notes and Reviews 36, 3 (2023), 307–308. https://doi.org/10.1080/0895769X.2021.1980717
  • Scheuerman et al. (2021) Morgan Klaus Scheuerman, Madeleine Pape, and Alex Hanna. 2021. Auto-essentialization: Gender in automated facial analysis as extended colonial project. Big Data Society 8, 2 (2021), 20539517211053712.
  • Selkie et al. (2020) Ellen Selkie, Victoria Adkins, Ellie Masters, Anita Bajpai, and Daniel Shumer. 2020. Transgender adolescents’ uses of social media for social support. Journal of Adolescent Health 66, 3 (2020), 275–280.
  • Shahid and Vashistha (2023) Farhana Shahid and Aditya Vashistha. 2023. Decolonizing Content Moderation: Does Uniform Global Community Standard Resemble Utopian Equality or Western Power Hegemony?. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. ACM, Hamburg Germany, 1–18. https://doi.org/10.1145/3544548.3581538
  • Sturaro et al. (2023) Samuel Sturaro, Caterina Suitner, and Fabio Fasoli. 2023. When is Self-Labeling Seen as Reclaiming? The Role of User and Observer’s Sexual Orientation in Processing Homophobic and Category Labels’ use. Journal of Language and Social Psychology 42, 4 (Sep 2023), 464–475. https://doi.org/10.1177/0261927X231173147
  • Sun et al. (2021) Tony Sun, Kellie Webster, Apu Shah, William Yang Wang, and Melvin Johnson. 2021. They, Them, Theirs: Rewriting with Gender-Neutral English. arXiv:2102.06788 (Feb. 2021). http://arxiv.org/abs/2102.06788 arXiv:2102.06788 [cs].
  • Tabaac et al. (2018) Ariella Tabaac, Paul B Perrin, and Eric G Benotsch. 2018. Discrimination, mental health, and body image among transgender and gender-non-binary individuals: constructing a multiple mediational path model. Journal of gay lesbian social services 30, 1 (2018), 1–16.
  • Tay et al. (2022) Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. 2022. Charformer: Fast Character Transformers via Gradient-based Subword Tokenization. arXiv:2106.12672 [cs.CL]
  • Thelwall et al. (2021) Mike Thelwall, Saheeda Thelwall, and Ruth Fairclough. 2021. Male, Female, and Nonbinary Differences in UK Twitter Self-descriptions: A Fine-grained Systematic Exploration. Journal of Data and Information Science 6, 2 (April 2021), 1–27. https://doi.org/10.2478/jdis-2021-0018
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  • Vytniorgu (2023) Richard Vytniorgu. 2023. Effeminate gay bottoms in the West: Narratives of pussyboys and boiwives on tumblr. Journal of Homosexuality 70, 10 (2023), 2113–2134.
  • Wang et al. (2014) Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, and Amit P. Sheth. 2014. Cursing in English on twitter. In Proceedings of the 17th ACM conference on Computer supported cooperative work and social computing. ACM, Baltimore Maryland USA, 415–425. https://doi.org/10.1145/2531602.2531734
  • Waseem et al. (2017) Zeerak Waseem, Thomas Davidson, Dana Warmsley, and Ingmar Weber. 2017. Understanding Abuse: A Typology of Abusive Language Detection Subtasks. In Proceedings of the First Workshop on Abusive Language Online. Association for Computational Linguistics, Vancouver, BC, Canada, 78–84. https://doi.org/10.18653/v1/W17-3012
  • Waseem and Hovy (2016) Zeerak Waseem and Dirk Hovy. 2016. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. In Proceedings of the NAACL Student Research Workshop. Association for Computational Linguistics, San Diego, California, 88–93. https://doi.org/10.18653/v1/N16-2013
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  • Worthen (2020) Meredith GF Worthen. 2020. Queers, bis, and straight lies: An intersectional examination of LGBTQ stigma. Routledge.
  • Xu et al. (2021) Albert Xu, Eshaan Pathak, Eric Wallace, Suchin Gururangan, Maarten Sap, and Dan Klein. 2021. Detoxifying language models risks marginalizing minority voices. arXiv preprint arXiv:2104.06390 (2021).
  • Yin and Zubiaga (2022) Wenjie Yin and Arkaitz Zubiaga. 2022. Hidden behind the obvious: Misleading keywords and implicitly abusive language on social media. Online Social Networks and Media 30 (July 2022), 100210. https://doi.org/10.1016/j.osnem.2022.100210
  • Zhao et al. (2021) Zhixue Zhao, Ziqi Zhang, and Frank Hopfgartner. 2021. SS-BERT: Mitigating Identity Terms Bias in Toxic Comment Classification by Utilising the Notion of" Subjectivity" and" Identity Terms". arXiv preprint arXiv:2109.02691 (2021).
  • Zhou et al. (2023) Xuhui Zhou, Hao Zhu, Akhila Yerukola, Thomas Davidson, Jena D. Hwang, Swabha Swayamdipta, and Maarten Sap. 2023. COBRA Frames: Contextual Reasoning about Effects and Harms of Offensive Statements. arXiv:2306.01985 (June 2023). http://arxiv.org/abs/2306.01985 arXiv:2306.01985 [cs].

Appendix

Changes made to offensive definition. We change the definition of toxicity as detailed in (Waseem and Hovy, 2016) in the following ways:

  • The definition is separated into two parts to account for speaker identity’s influence on whether something is derogatory.

  • The enumeration of oppressive systems is generalized.

  • The straw man argument is consolidated with ‘criticizing a minority without a well-founded argument’.

  • A harassment clause is added. The definition of harassment was altered from Cornell Law School’s Legal Information Institute: "when [someone] intentionally and repeatedly harasses another person by by following such person in or about a public place or places or by engaging in a course of conduct or by repeatedly committing acts which places such person in reasonable fear of physical injury" 888https://www.law.cornell.edu/wex/harassment. We leave out intent and repetition due to the inability of portrayal for singular social media posts.

Changes made to SLUR USAGE taxonomy. We update the taxonomy in the following ways:

  • Removal of overarching categories that describe whether subcategories are offensive

  • Creation of a new category for discussions of identity

  • Expansion of the direct quote category to include paraphrases

  • Expansion of the definition of sexualization to include non-demeaning