Project Overview

This project set out to analyse the time users spend in discussion fora related to articles of derStandard.at and how it relates to the user-generated content (forum postings) on those pages. One aim was to understand how various linguistic and semantic aspects of the user-generated content might influence the time users spend in the fora, either following the discussions or posting themselves.

A particular focus lay on female posters. While the percentage of female and male readers of derStandard.at is near equal (45% to 55%), there is a large gender mismatch in active posters, i.e., only 20% of those who write in forums are female). Therefore, important goals for the project were to find out reasons for this gender disparity and to take measures to encourage female contributions. On the one hand, methods of data science and natural language processing were employed to identify correlations between forum dwell time, and linguistic and semantic properties of forum contributions, including gender-fair language use. On the other hand, forum users were asked in a survey what encourages or hinders them to actively taking part in forum discussions. Both, regular male and female posters stated that a constructive discussion atmosphere is important for them to post. And both gender groups stated that they become reluctant to post after they had seen harsh, embarrasing and derogatory postings in reply to other people’s posts. Whereby the silencing effect was larger for women then men. In this respect, misogynist postings are a factor that hinders women from taking part in forum discussions.

In order to support forum moderators to counteract misogyny in forum discussions, a classifier was developed which was trained on examples of misogynist and non-sexist postings from the fora. In a first step annotation guidelines were developed taking into account works from gender studies, current research on machine learning based sexism classifiaction as well as existing guidelines for forum moderation at derStandard. Based on these guidelines, 6600 postings (plus 1400 in a second round) were independently annotated by three trained annotators (from a pool of 8 annotators, male and female, forum moderators and a language technology expert)as non-misogynist (0) or mildly (1) to serverely misogynist (4). See an example of the annotation interface. At the end of the process, each posting was assigned with 3 labels, one per annotator. This approach reflects the inherent fuzziness and a certain level of subjectivity in judging an utterance as being sexist or not. Data set and machine learning model were developed in parallel in a bootstrapping approach where step by step the data set was enlarged and the model training was improved (deepset/gbert Transformer Model with own classification head for binary – misogynist/non-misogynist – and multi-class labels – 0 … 4). A scientific article discussing data sampling and the details of the classifier development is available here.

The classifier, when integrated in the moderation interface, is a useful support tool for the every day work of the moderators, as it helps them (i) to identify potentially misogynist postings within the large stream of postings being written by forum users, and (ii) to be alerted when the discussion in a particular forum becomes increasingly misogynist. Thus, they can direct their moderation efforts towards supporting a constrictive discussion. This potential has awarded the classifer the reknown media price “Medienlöwe” (article in German).

Project participants:

This project is supported through Call FemPower IKT 2018