New blog post on the NeurIPS'21 experiment re authors' perceptions of their own papers!

https://blog.ml.cmu.edu/2022/11/22/neurips2021-author-perception-experiment/

Key findings:

1) Authors significantly overestimate their papers' chances of acceptance. By like a LOT.

>

How do Authors' Perceptions about their Papers Compare with Co-authors’ Perceptions and Peer-review Decisions?

Alina Beygelzimer, Yann N. Dauphin, Percy Liang, Jennifer Wortman Vaughan(NeurIPS 2021 Program Chairs) Charvi Rastogi, Ivan Stelmakh, Zhenyu Xue, Hal Daumé III, Emma Pierson, and Nihar B. Shah There is a considerable body of research on peer review. Within the machine learning community, there

Machine Learning Blog | ML@CMU | Carnegie Mellon University
2) Miscalibration is lower for more "senior" authors ("seniority" measured by their role in the conference), and slightly higher for women (note also that women are less likely to be senior in this data/definition, but we controlled for this in the analysis).

3. For authors who submit more than one paper, when asked to RANK their papers by scientific merit, most of the time (93%) their ranking agrees with their estimated probabilities of acceptance. In 7% of cases they report that the paper they rank as having higher merit they say has lower chance of acceptance.

>

4. (AND I FOUND THIS MOST FASCINATING!) The amount of disagreement between CO-AUTHORS in terms of the perceived relative scientific contribution of their papers is SIMILAR to the amount of disagreement between authors and reviewers.

That is - even though we worry a lot about REVIEWER disagreement, there seems to be just about as much AUTHOR disagreement about the same paper!

>

5. About half of authors report that their perception of their own paper changed after seeing the initial reviewers. Additionally, among both accepted and rejected papers, over 30% of authors report that their perception became more positive.

>

Conclusions:

Vast overestimates of probability of acceptance suggests we should recalibrate expectations (one way or the other).

Disagreements around around paper quality suggest that assessing paper quality is not only extremely noisy, but lacks an objective right answer.

>

It was super fun working with a host of great people on this experiment:

NeurIPS 2021 Program Chairs Alina Beygelzimer, Yann N. Dauphin, Percy Liang and Jennifer Wortman Vaughan...

... and colleagues Charvi Rastogi, Ivan Stelmakh, Zhenyu Xue, Emma Pierson, and Nihar Shah.

/end

@hal Reading the blog post, this point really struck me. I would never assume there could be an objectively right answer for assessing paper quality, but are there those in the NeurIPS community who think there could be?
@JoFrhwld Honestly IDK. To me, it's kinda obvious that there's no right answer. (We iterated MANY times even on how to ask the question and it's certainly imperfect.) BUT at the same time when I hear "grumble reviewers grumble", and when I myself grumble, I often skim over this point, perhaps too much.
@hal In other words -- graduate students shouldn't be discouraged because, on average, they aren't going to paper into NeurIPS until their fourth try... It's a crapshoot. A game of chance for all but the worst papers, which are properly filtered out by quality.
@tedpavlic i definitely agree that students (and others) shouldn’t be discouraged. the cited neurips’2020 experiment talks explicitly about randomness in the review process and i think is a much better study to base this type of conclusion on - it specifically looked at randomness in reviews. i don’t quite conclude this from ours which is much more about author *impressions* of the process rather than something about how random the process is itself
@hal Interesting. My mentor had an idea to calibrate the author opinion part wrt experience, e.g., if student A and B have divergent opinions, can the advisor do better and settle the dispute?
@x the follow up post shows more “senior” folks do indeed produce better calibrated responses - but not hugely so
Alternative interpretation: Above a relatively low threshold, acceptance is randomized due to lack of space leading to loss of predictive power in that regime. @hal

@ted_dunning I may be missing something (correct me!), but I think in order to get that, most respondents would've had to interpret the question as about the QUALITY of the paper, rather than its CHANCE of acceptance.

It's entirely possible that what you're saying is true - in which case, if one believed their paper was "good enough" they should have answered ~30% - but that's not what happened, which at least suggests people don't *think* that's the case.

Who knows what lurks in the hearts of authors?

I have never been sure about how people truly interpret questions. My users have confounded me far too many times for me to have illusions that the question asked is the question answered.

@hal

@ted_dunning Yup, that's entirely possible. We hoped that giving them the past rate would help the interpretation, but it's definitely possible they misinterpreted.

If that's the case, there's still a big gap because if we really believe everything over a threshold is random, then no one should be saying anything over say 50%, but clearly a lot of people are.

There is definitely a gap, but my first interpretation is that people are complicated and they assume that any questioner is complicated. And then they estimate what you really meant by your question in some complicated way based on their estimate of your estimate of their mental state.

I still love y'all's work here and the graph speaks volumes. It also makes me re-think what I think about publications. That's probably true of others as well.

@hal

@ted_dunning "people are complicated" --- something about truer words... :)

But yes, I agree there should be a lot of room for interpretation given how these darned complicated people interpreted things :

My coming of age moment in this respect was when I was first analyzing the behavior of people relative to music.

I found that if you looked at how much of a song people let play before hitting skip versus our estimate of how much they liked the song that the behavior was very non-intuitive.

Skipping after less than 15 seconds generally seemed to indicate radical dislike of an entire genre. Country music for a heavy metal fan. Or metal for a classical music listener.

1/2

@hal

That made sense. People can determine rough genre in a few hundred milliseconds.

But people skipped their absolute favorite songs frequently after about 30-60 seconds had played. Quizzing users about this indicated that they weren't even quite aware of doing this, but it seemed that they knew the songs well enough that this was enough to get the high.

This behavior had clear ramification for building a recommender.

And none of it much carried over to video watching.

2/2

@hal

there's nowt so queer as folk

Definition of there's nowt so queer as folk in the Idioms Dictionary by The Free Dictionary

TheFreeDictionary.com
@ted_dunning that’s an amazing example! both surprising but also i can totally see how it’s true

@hal

Are we looking at another study that corroborates Dunning-Kruger?

I'm thinking a lot of recent PhD's submitting their theses and young Assistant Professors trying to learn how to publish.

Of course, I would have to dig into the data to see if my casual speculations hold any water.

@vicuzumeri The second result is suggestive of this, though even the "senior" people (who have arguably become part of the "NeurIPS community") are pretty poorly calibrated too.

@hal Odd that the lower 1/3rd tracks the yellow dotted line. But above it's basically horizontal.

It could also be said that publications are only slightly okay at rejecting the bottom quarter of papers, and for the better papers is just a 30% chance tossup.

@HenkPoley @hal yeah, i had this observation too — it looks like about 30pts absolute of the 70% rejection is corroborated by author's own opinions, but outside of that, bad correlation with authors own estimates
@trochee @HenkPoley yup just to be clear these are - if people answered the question they were asked - their opinion *of whether it was likely to get in* not their opinion of *whether it should get in*.
@hal @HenkPoley and it sounds like many authors approximated "will get in" with "deserves to get in"
@trochee @HenkPoley why do you conclude that?

@hal @HenkPoley that's what the down elbow suggests to me — overall, if you think you're not in the bottom 30% , you have a poorly calibrated estimator for acceptance

Maybe that's not the same thing

@hal Isn't it possible that authors are pretty good at guessing the quality of their papers, but the limited capacity of NeurIPS (and the wide net they cast to find any reviewers) creates a "Type-II functional response" where reviews become equivalent to a coin flip for all papers above a saturation threshold?

If your paper is good enough, acceptance is basically random. Only really poor papers actually get filtered by relative quality.

@tedpavlic it’s possible but i think would require one of two things to be true:
1) people systematically misinterpreted the question to be about quality not chance of acceptance (see this thread https://mastodon.social/@ted_dunning/109390402626776481)

OR

2) what you’re saying is true BUT people don’t believe it - else no one would report a probability over 30%

would also need to square this with result #4 which is about merit not chance

@hal The one quick mention of the *median* prediction being 70% made me boggle, so I went looking for this graph in the paper. Almost 10% of answers thought their paper was *certain* to get in!
@hal interesting stuff. Question though: do you determine the empirical acceptance rate for a paper? (I must have missed something?)