Tuesday, 19 September 2023

Data is Expensive, Conclusions Are Cheap: How To Fix Research Fraud

It's probably just my echo chamber, but I've seen a number of YT's on scientific fraud recently. This does not shake my faith in Quantum Mechanics, because this isn't happening in real science. It's happening in psychology (evolutionary or otherwise), behavioural economics and other such pseudo-subjects with lousy replicability, and a tendency to pass off small samples of undergraduates as sufficient data. I've read my share of pop-science from these people, and while I've been amused and intrigued, I've never been convinced. The samples are too small. The conclusions are too darn cute, and fit way too well into the current academic Goodthink. Also a lot of it is just plain wrong.

What does one do about all this nonsense research?

Realise that statistical analyses, summaries, graphics, and conclusions are cheap.

It's the data that matters.

Any research project funded by the taxpayer must make its raw data publicly available, along with a detailed description of how the data was obtained.

With no controls over access. In CSV format so we don't have to write complicated scripts to read it.

And at no charge. We already paid with our taxes.

Give us the data, and we will draw our own conclusions, thank you. Research will become valuable because it produces data that people use.

Not because some publicity-savvy academic produces an eye-catching claim.

The infamous thirty-undergraduate sample will simply vanish.

Researchers who provide lots of dimensions of analysis that can be correlated with ONS data will get readers, those who use a few that maybe can't be matched against anything else will be passed by.

It works like this.

Hypothesis: children from single-parent families do better at school than children from two-parent families. 'Do better' means more and better grades at GCSE. So get a sample of single-parent households with kids who just did their GCSE's and another of dual-parent households with kids who just did their GCSEs. Same size, as there are plenty of both.

Recognise that the initial question is attractive but silly. It's the kind of question a single-purpose charity might ask, and if it liked the answer, would use in their next fund-raising round.

"Single-parent homes" are not all the same. Neither are "dual-parent homes". Families are all different. And they are an effect, not a cause. Parental behaviour, sibling examples, household economics, the location, the religion and the culture are causes.

Here's your chance to get some data-kudos.

Get a decent sample size. 10,000 or so of each.

Get the results for the kids. Grade by subject. With the exam board. No summarising or grouping. I've got a computer to do that if I want it.

And get the number of GCSEs the kids were entered for, because Head Teachers game the stats like crazy. While you're doing that, find out how else the Heads game the stats.

Get the details about those households. Age, religion, nationality, gender, political allegiance if any, car owner, rent / mortgage, highest level of education reached by parent(s), subject of degree, employed / self-employed / unemployed / retired / not able to work, occupation if working, postcode (all of it), place of work, large or small employer, private or public sector. Income and sources, expenses and spending patterns. Savings. Help from relatives. Drug use. Exercise regimes.

How long had the parents been divorced before the GCSE exams? How long had they been co-habiting or married? What are the childcare arrangements? What are the visitation rights? How often are these denied? Has the divorced partner lost touch with their children? Are the divorced parents still co-operating with each other over raising the children? Has the custodial parent moved home? How far away are the parents from each other? Was a family member in jail when the kids were taking the exams? Is the father in the dual-household away a lot? Do any of the parents work unsociable hours? Do they use daycare?

See how that data could be interesting to certain groups? Even if they weren't interested in GCSE results?

Did the parents arrange private tutoring? Help their children with their homework? Do the children have long-term health problems? Did they have health problems at the time of the exams? Were they able to revise? What is the school's record in the league tables?

You get the idea. Ask a wide range of detailed questions to cover the vast complexity of human life. Notice when a colleague demurs at something that allows the data to show the influence of (enter taboo subject here). Find somewhere else they can be useful and send them there. Do the same to yourself. The question you resist the most is the one everyone wants answered.

Test the questions. Test the interview process and the online questionnaire (if you must). Do A/B layout and question-order tests. Learn and make adjustments.

Now go out and ask the questions. Tabulate the answers. No leaving anyone out because they missed a bunch of answers. I can deal with that in my analysis. No corrections for this or that. No leaving out the answers to some questions because of "sensitivity" or "mis-interpretation".

That's where you put in the effort. If too many people give incomplete answers, go recruit some more people. Comparing those who gave complet(er) answers to those who didn't to see if there's a pattern.

Put the raw results up on Github or wherever. Along with the questionnaire, the times and dates of each interview, and a video of the whole thing if possible. I want to see their body language to judge which questions are likely to have, uh, aspirational answers. (Okay, that's asking a lot.)

I'll do my own analyses.

The researchers can publish a summary and conclusion if they want. With a keep-it-simple press release for the science journalists. The rest of us will dig into the data and draw our own conclusions.

The people who don't do data analysis can get some popcorn and follow the disputes.

Data financed by private money? Make it public or we get to treat it as self-serving.

Faced with some conclusion about medicines or human behaviour, ask if the raw data and research protocols are publicly available. If the answer is NO, or "you have to pay", dismiss the conclusion, because there is no evidence that you can judge for yourself. Without the data, we have to take their word for it or not, which means we need to judge their competence, honesty and career pressures. That makes it about the researchers, and it isn't. They may be insightful and honest, or they may be academic hacks. You can't judge that either. What you can judge is that they are hiding their data. If they are, it fails the smell test.

No comments:

Post a Comment