Tomas Tokar 1, Igor Jurisica 1, 2, 3,
1 Krembil Research Institute, University Health Network, Toronto, ON, Canada
2 Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
3 Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
Over the last few years, application of machine learning in healthcare started to gain broader acceptance. This area of artificial intelligence is expected to revolutionize medicine by automating the diagnostics and optimizing therapeutic procedures, increasing quality of care while reducing the costs. While there are numerous example of machine learning algorithms significantly outperforming expert physicians in terms of diagnostic accuracy or therapeutic adequacy, its use in clinical practice remains rather rare.
One of the major reasons is that it is difficult to obtain an unbiased estimation of reliability for each individual prediction. In many sectors, machine learning algorithms’ performance is assessed across large number of past predictions, and is assumed to be indicative for all the future predictions. However, in medicine (and other risk-sensitive applications) this frequentist approach is not sufficient, as it fails to provide assessment of how much we can rely on a given individual prediction.
Here, we summarize several distinct methods for evaluating the reliability of individual predictions. We also demonstrate their application across the set of publicly available medical datasets. This work highlights the need of a novel industry standards for assessment of machine learning algorithms’ performance, designed specifically for application in medicine.