๐๐๐๐ผ๐บ๐ฎ๐๐ถ๐ฐ ๐ฟ๐ฒ๐๐ถ๐ฒ๐๐ฒ๐ฟ๐ ๐ฐ๐ฎ๐ป ๐บ๐ถ๐๐ ๐ณ๐๐ป๐ฑ๐ฎ๐บ๐ฒ๐ป๐๐ฎ๐น ๐ฟ๐ฒ๐ฎ๐๐ผ๐ป๐ถ๐ป๐ด ๐ฒ๐ฟ๐ฟ๐ผ๐ฟ๐.
๐ LLM-generated reviews may look convincing โ but how reliable are they in practice?
In our recent TACL paper, we introduce a ๐ฐ๐ผ๐ป๐๐ฟ๐ผ๐น๐น๐ฒ๐ฑ ๐ฐ๐ผ๐๐ป๐๐ฒ๐ฟ๐ณ๐ฎ๐ฐ๐๐๐ฎ๐น ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป ๐ณ๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ to systematically test automatic reviewers.
๐ช๐ต๐ฎ๐ ๐๐ฒ ๐ณ๐ถ๐ป๐ฑ:
๐ They rely heavily on surface-level signals
โ ๏ธ They often miss mismatches between claims and actual results
