I just can't with this one - gzt does statistics

As scientists, we have ethical responsibilities related to the uses of our work, and sometimes these are not entirely obvious. We have to at least pretend to think about them. If our work is “basic” research, we can claim moral ambiguity sometimes (whether such claims are justified is an empirical question). If our work is “applied”, we have little cover.

When I read articles about uncritical applications of “machine learning” to “social issues”, I just can’t handle it. Kudos to the engineer who pushed back against the presenter (he sounds very good) and the others who questioned it. Regardless of the application, any uncritical use of existing labels is going to reproduce the biases of the system. When dealing with something as fraught as gang membership, especially when using geography as one of the few inputs (understand the context of your application, even just a little!), doing so is almost certainly going to reinforce systemic biases. This does not require special knowledge of the field (though it helps), this is part of our training as statisticians or data scientists or whatever else we want to call people creating these models. We are supposed to understand biased labels and consider the cost of misclassification even before we consider the specific details of the application.

However, as much as I try to avoid it (numbers are fun and learning is boring! (this is false, I’m a humanist who only does numbers because I’m better at it)), whenever I have consulted as a statistician on a project, I have had to learn something about the data generating process and the context of the application in order to do my work effectively. So I just can’t get around this blindness. Perhaps I have a poor imagination or perhaps I’m doing too much moral grandstanding here. We will all be tested and we will all fail, hopefully only in minor ways. But, dear Lord, we have to at least try.