Insight Culture: Detecting homophobic speech patterns in under-resourced languages online

Submitted on Monday, 16/09/2024

The anonymity and social distance provided by internet platforms has created a safe space for individuals to abuse others with impunity.

An abusive, hateful, threatening and discriminatory act that targets gay, lesbian, transgender or bisexual individuals is called homophobia and transphobia. Detecting these types of acts on social media is called homophobia and transphobia detection.

Identifying homophobic and transphobic content for under-resourced languages is a challenging task.

Natural language processing techniques are increasingly being used to automate the detection of abusive language online. There has been a proliferation of recent strategies for automatic homophobia and transphobia detection during a social media text. However, operating with social media text is challenging as people use various languages, spellings and words that may not be found in any standard dictionary.

This is especially true for under-resourced languages like Tamil, Malayalam and Hindi, where the culture itself enforces the issue of LGBTQ+ as taboo. Despite millions of people speaking these languages, the tools and resources for developing strong NLP applications for these languages are underdeveloped.

A paper published in the International Journal of Data Science and Analytics in December 2023, by Insight member Bharathi Raja Chakravarthi of University of Galway, presents a new high-quality dataset for detecting homophobia and transphobia in Malayalam and Hindi languages.

The dataset consists of 5,193 comments in Malayalam and 3,203 comments in Hindi. The team also submitted the experiments performed with traditional machine learning and transformer-based deep learning models on the Malayalam, Hindi, English, Tamil and Tamil-English datasets.

The work was also published in the Natural Language Processing Journal

Back to spotlight on research