As always, your clarity is fantastic - and scares me. We were already complacent about bias-reinforcing, over-simplified statistics and it will just get worse by the looks of it. Keep up the good work!
Okay, now I feel dumb. I had assumed that the presented output from these probability models would be related to the expected value over a reasonably large number of model runs.
Students would fail graduate courses for failing to do this in cost effectiveness modelling.
I suspect it’s partly cultural - a lot of AI engineering is about maximising model performance against a particular metric, rather than estimating a value with appropriate uncertainty.
As you say in your article, there is a well developed set of statistical tools for addressing this stuff. I wonder whether outputs should include some 'stability' index with each 'answer' based on one or more of these. Perhaps users could include requests for this in their prompts, rather than waiting for the companies to provide them.
This reminds me of the early days of simulation research in the 1980s/1990s. Some researchers were finding odd idiosyncratic results and outputs until they realized that some compilers had very rudimentary random number generators that were simply sequences of what appeared to be random numbers. If you did a large enough number of replications or cycles the sequence would repeat. Some projects ended up with completely spurious results that depended largely on the random number seed because of the repeating. (This was a bit before my time but I remember the warnings from my thesis advisors to not use random numbers from certain compilers.)
As always, your clarity is fantastic - and scares me. We were already complacent about bias-reinforcing, over-simplified statistics and it will just get worse by the looks of it. Keep up the good work!
Okay, now I feel dumb. I had assumed that the presented output from these probability models would be related to the expected value over a reasonably large number of model runs.
Students would fail graduate courses for failing to do this in cost effectiveness modelling.
Is it too computationally expensive to do this?
I suspect it’s partly cultural - a lot of AI engineering is about maximising model performance against a particular metric, rather than estimating a value with appropriate uncertainty.
As you say in your article, there is a well developed set of statistical tools for addressing this stuff. I wonder whether outputs should include some 'stability' index with each 'answer' based on one or more of these. Perhaps users could include requests for this in their prompts, rather than waiting for the companies to provide them.
This reminds me of the early days of simulation research in the 1980s/1990s. Some researchers were finding odd idiosyncratic results and outputs until they realized that some compilers had very rudimentary random number generators that were simply sequences of what appeared to be random numbers. If you did a large enough number of replications or cycles the sequence would repeat. Some projects ended up with completely spurious results that depended largely on the random number seed because of the repeating. (This was a bit before my time but I remember the warnings from my thesis advisors to not use random numbers from certain compilers.)