Literate programming environments like Jupyter and R Markdown notebooks, coupled with easy-to-use languages like Python and R, put a plethora of statistical methods right at a data analyst’s fingertips. But are these methods being used correctly? Statistical methods make statistical assumptions about samples being analyzed, and in many cases produce reasonable looking results even if assumptions are not met.
We propose an approach that allows library developers to annotate functions with statistical assumptions, phrases them as hypotheses about the data, and inserts hypothesis tests investigating the likelihood that the assumption is met. As a proof of concept, we implement this approach in two tools: prob-check-py for Python, and prob-check-r for R. To evaluate these, we identify common hypothesis testing and statistical modeling functions in Python and R, annotate them with the relevant statistical assumptions, and run 128 Kaggle notebooks that use those methods to identify misuses. Our investigation reveals that at least one statistical assumption was violated in 84.38% of surveyed notebooks, and that assumptions were violated in 53.36% of calls to annotated functions. Moreover, had the appropriate hypothesis testing method been chosen given the characteristics of the data, a different conclusion would have been drawn in 11.51% of cases.
Wed 25 JunDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
Cosmos 3C is the third room in the Cosmos 3 wing.
When facing the main Cosmos Hall, access to the Cosmos 3 wing is on the left, close to the stairs. The area is accessed through a large door with the number “3”, which will stay open during the event.