Subtask 2

IMPORTANT: please note that there is a closed and an open track for this subtask!

In subtask 2 the goal is to predict the distribution for each text in a dataset where the distribution is derived from the original distribution of labels assigned by several human annotators.

The human annotators assigned (according to the annotation guidelines ) the strength of misogyny/sexism present in the given text via the following labels:

While the annotation guidelines define what kind of sexism/misogyny should get annotated, there has been made no attempt to give rules about how to decide on the strength. For this reason, if an annotator decided that sexism/misogyny is present in a text, the strength assigned is a matter of personal judgement.

The distributions to predict in subtask 2 are

Data

For the trial phase of subtask 1, we provide a small dataset, containing

For the development phase of subtask 1, we provide all participants with the following data:

For the competition phase of subtask 1, we provide

All of the five files are in JSONL format (one JSON-serialized object per line) where each object is a dictionary with the following fields:

You can download the data for each phase as soon as the corresponding phase starts.

Submission

Your submission must be a file in TSV (tab separated values) format which contains the following columns in any order:

Note that the way how you derive those values is up to you (as long as the rules for the closed or open tracks are followed):

To submit your predictions to the competition:

Submission errors and warnings

Phases

Evaluation

System performance on subtask 2 is evaluated using the Jensen-Shannon distance for both (i) the prediction of the binary distribution, and (ii) the prediction of the multi score distribution. We chose the Jensen-Shannon distance as it is a standard method for measuring the similarity between two probability distributions and it is a proper distance metric which is between 0 and 1. It is the square root of the Jensen-Shannon divergence, which is based on the Kullback-Leibler divergence.

The overall score which is used for ranking the submissions is calculated as the unweighted average between the two JS-distances.