Remix.run Logo
i_am_proteus 5 days ago

>These 54 participants were between the ages of 18 to 39 years old (age M = 22.9, SD = 1.69) and all recruited from the following 5 universities in greater Boston area: MIT (14F, 5M), Wellesley (18F), Harvard (1N/A, 7M, 2 Non-Binary), Tufts (5M), and Northeastern (2M) (Figure 3). 35 participants reported pursuing undergraduate studies and 14 postgraduate studies. 6 participants either finished their studies with MSc or PhD degrees, and were currently working at the universities as post-docs (2), research scientists (2), software engineers (2)

I would describe the study size and composition as a limitation, and a reason to pursue a larger and more diverse study for confirmation (or lack thereof), rather than a reason to expect an "uphill battle" for replication and so forth.

tomrod 5 days ago | parent | next [-]

> I would describe the study size and composition as a limitation, and a reason to pursue a larger and more diverse study for confirmation, rather than a reason to expect an "uphill battle" for replication and so forth.

Maybe. I believe we both agree it is a critical gap in the research as-is, but whether it is a neutral item or an albatross is an open question. Much of psychology and neuroscience research doesn't replicate, often because of the limited sample size / composition as well as unrealistic experimental design. Your approach of deepening and broadening the demographics would attack generalizability, but not necessarily replication.

My prior puts this on an uphill battle.

genewitch 5 days ago | parent [-]

do you feel this way about every study with N~=54? For instance the GLP-1 brain cancer one?

tomrod 5 days ago | parent [-]

You'll need to specify the study, I see several candidates in my search, several that are quite older.

Generally, yes, low N is unequivocally worse than high N in supporting population-level claims, all else equal. With fewer participants or observations, a study has lower statistical power, meaning it is less able to detect true effects when they exist. This increases the likelihood of both Type II errors (failing to detect a real effect) and unstable effect size estimates. Small samples also tend to produce results that are more vulnerable to random variation, making findings harder to replicate and less generalizable to broader populations.

In contrast, high-N studies reduce sampling error, provide more precise estimates, and allow for more robust conclusions that are likely to hold across different contexts. This is why, in professional and academic settings, high-N studies are generally considered more credible and influential.

In summary, you really need a large effect size for low-N studies to be high quality.

sarchertech 5 days ago | parent [-]

The need for a large sample size is dependent on effect size.

The study showed that 0 of the AI users could recall a quote correctly while more than 50% of the non AI users could.

A sample of 54 is far, far larger than is necessary to say that an effect that large is statistically significant.

There could be other flaws, but given the effect size you certainly cannot say this study was underpowered.

tomrod 5 days ago | parent [-]

You would need the following cohort size per alpha level (currently 18) at a power level of 80% with an effect size of 50%:

0.05: 11 people per cohort

0.01: 16 people per cohort

0.001: 48 people per cohort

So they do clear the effect size bar for that particular finding at the 99% level, though not quite the 99.9% level. Further, selection effects matter -- are there any school-cohort effects? Is there a student bias (i.e. would a working person at the same age, or someone from a difficult culture or background see the same effect?). Was the control and test truly random? etc. -- all of which would need a larger N to overcome.

So for students from the handful of colleges they surveyed, they identified the effect, but again, it's not bulletproof yet.

sarchertech 4 days ago | parent [-]

With a greater than 99% probability that this is a real effect, i wouldn’t expect this to be difficult to reproduce.

But it turns out I misread the paper. It was actually an 80% effect size so greater than 99.9% chance of being a real effect.

Of course it could be the case that there is something different about young college students that makes them react very; very differently to LLM usage, but I wouldn’t bet on it.

hedora 5 days ago | parent | prev | next [-]

The experimental setup is hopelessly flawed. It assumes that people’s tasks will remain unchanged in the presence of an LLM.

If the computer writes the essay, then the human that’s responsible for producing good essays is going to pick up new (probably broader) skills really fast.

efnx 5 days ago | parent [-]

Sounds like a hypothesis! You should do a study on that.

stackskipton 5 days ago | parent | prev | next [-]

I'd love to see much more diverse selection of schools. All of these schools are extremely selective so you are looking at extremely selective slice of the population.

sarchertech 4 days ago | parent [-]

Is your hypothesis that very smart people are much, much less likely to be able to remember quotes from essays they wrote with LLM assistance than dumber people.

I wouldn’t bet on that being the case.

stackskipton 4 days ago | parent [-]

No, my hypothesis is this is happening to people at very selective school, the damage it's doing at less selective schools is much much greater.

jdietrich 5 days ago | parent | prev [-]

Most studies don't replicate. Unless a study is exceptionally large and rigorous, your expectation should be that it won't replicate.

sarchertech 4 days ago | parent [-]

That isn’t correct it has to do with the likelihood that the study produced an effect that was actually just random chance. Both the sample size and the effect size are equally important.

This study showed an enormous effect size for some effects, so large that there is a 99.9% chance that it’s a real effect.