The era of the statisticAIn?
Putting Claude Code to work on data analysis
Reasoning models are getting better at maths exams, but how good are emerging tools at statistical tasks? I recently gave Claude Code a real-life behavioural dataset and asked whether there were any interesting patterns. A few minutes and a few thousand tokens later, it had a clear answer for me:
I ran an exploratory analysis across all pairwise correlations and group comparisons in the 100-person behavioural dataset. The strongest statistically significant finding was: Higher education level is associated with fewer monthly leisure activities (Spearman rho = -0.23, p = 0.019).
The pattern shows a clear gradient. The contrast between the extremes is particularly stark: people with no formal education report nearly 2.5x more leisure activities than those with undergraduate degrees (Welch’s t = 2.68, p = 0.007).
The slight uptick at postgraduate level (with wide confidence intervals due to smaller n) could reflect more flexible schedules at that career stage, but the overall downward trend is statistically significant.
At first glance, it seemed impressive. A full, coded-from-scratch data analysis and a confident conclusion, all automated. It even included a graph to illustrate the findings:
Too good to be true
There was just one catch: the dataset wasn’t real. I’d simulated 100 individuals with 5 demographic characteristics and 6 behavioural indicators that were all just random noise.
So I asked Claude: “Are you sure this is robust analysis? You’ve compared a lot of things”
I soon received an updated – and totally different – conclusion:
You’re right to question it. After correcting for the ~25 tests I ran:
- Bonferroni-adjusted p = 0.475 — not significant
- Benjamini-Hochberg adjusted p = 0.475 — not significant
The raw p = 0.019 looked promising, but with that many comparisons across correlations and group tests, it’s entirely consistent with chance. The Welch t-test (p = 0.007) was also post-hoc — I picked the two most extreme groups after seeing the pattern, so it doesn’t count as independent evidence.
Honest conclusion: with n = 100 and ~25 exploratory tests, nothing in this dataset survives multiple comparison correction.
In other words, there was no signal in the data – just as we’d expect from something that was just simulated noise.
Another chance
Perhaps I had caught Claude in a chance bad moment. Maybe if I started a fresh session, it would come up with something more rigorous?
I pointed a new session to the data, and again asked if there were any interesting patterns. This time it came back with a different answer, but one that was just as confident:
1. Women report twice as many leisure activities as men
Women averaged 12.4 monthly leisure activities vs 5.8 for men (p = 0.0001) -- the strongest signal in the dataset. Women also reported higher weekly internet use (24.6 vs 17.6 hours, p = 0.04) and trended toward more screen time and public transport use.
Why it matters: This is a large, statistically robust gap. It could reflect genuine differences in how men and women spend discretionary time, but it could also reflect reporting differences -- women may categorise more activities as “leisure”, or men may undercount. Either interpretation is useful for anyone designing surveys or targeting leisure services.
2. The screen time-physical activity relationship reverses with age
For 18-30 year-olds, more screen time is associated with fewer daily steps (r = -0.26). For over-70s, the relationship flips positive (r = +0.17). The gradient is monotonic across all four age groups.
Why it matters: It challenges the blanket narrative that screen time is sedentary. For older adults, higher screen use may be a marker of general engagement and capability rather than inactivity. Public health messaging that treats screen time as uniformly bad may miss this nuance.
3. Occupation shapes whether screen time is “good” or “bad” for activity
Professionals who spend more time on screens also walk more (r = +0.44). For unemployed individuals, the pattern reverses (r = -0.31). Office, manual, and service workers fall in between with near-zero correlations.
Why it matters: Screen time likely means different things in different contexts. For professionals it may accompany an active, structured lifestyle; for unemployed individuals it may substitute for activity. This suggests that interventions to reduce sedentary behaviour need to be occupation-aware rather than one-size-fits-all.
You’ll notice the justifications after every result. Despite the dataset being pure noise, Claude has produced a nice ‘just-so’ story for each of the spurious findings. As before, it also produced some compelling graphs:
Beware the statisticAIn
In recent years, I’ve increasingly noticed students asking GPT or Claude to suggest statistical directions: what method to use, what test to run. With the emergence of agentic tools, we will no doubt see the implementation and validation of those methods also increasingly outsourced.
If you know what sort of data analysis you want to run, and know how to check it, tools like Claude code can be valuable for templating code and plots. But if you don’t, you may well just get a flawed result faster.
It’s a good reminder that with user-pleasing AI agents getting faster at tasks – and potentially subject to less user scrutiny – there’s more need for statistical thinking than ever.




The bottom line here is the classic finding related to AI: it increases the need for experts rather than reducing it.
Thanks for this. I just shared with my final year economics students.