Understanding the Low Inter-Rater Agreement on Aggressiveness on the Linux Kernel Mailing List
Technical, source-code-related communication among software developers plays an essential role in open-source software (OSS) projects. Not unexpectedly, previous studies have shown that the conversational tone and, in particular, aggressiveness influence the participation of developers in OSS projects. Therefore, we aimed at studying aggressive behavior on the Linux Kernel Mailing List (LKML), which is known for technical, source-code-related discussions and for the aggressiveness of some of its contributors. To that aim, we attempted to assess the extent of aggressiveness of 720 e-mails from the LKML with a human annotation study, involving multiple annotators, to select a suitable sentiment analysis tool. The results of our annotation study revealed that there is substantial disagreement among humans, which uncovers a deeper methodological challenge of studying aggressiveness (and emotions, in general) in the software-engineering domain. Adjusting our focus, we dug deeper and investigated why the agreement among humans is generally low, based on manual investigations of ambiguously rated technical e-mails. Our results illustrate that human perception is individual and context dependent. Although we identified multiple potential causes for disagreement using an open coding approach, we did not find a general theme beside the fact that different individuals may perceive aggressiveness in technical discussions differently. Thus, when identifying aggressiveness in software-engineering texts, it is not sufficient to rely on aggregated measures of human annotations. Hence, sentiment analysis tools specifically trained on human-annotated data do not necessarily match human perception of aggressiveness, and corresponding results need to be taken with a grain of salt. Our findings suggest that research in the software-engineering domain needs to differentiate between specific forms of aggressiveness which can be identified with less ambiguity and depend less on the personality and context of the person rating a text. By reporting our results and experience, we want to call attention to the methodological challenge when studying aggressiveness (and sentiment, in general) in the software-engineering domain, which should become an important part of future research.