Mastodawn

Humans learn language by acting in the world. Can RL agents do the same? lilGym is a new benchmark 🏋️ for RL + natural language + visual reasoning

https://arxiv.org/abs/2211.01994
https://lil.nlp.cornell.edu/lilgym/

Chief RL trainer: @[email protected] , in collboration with @[email protected] and @[email protected]

lilGym: Natural Language Visual Reasoning with Reinforcement Learning

We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We introduce a new approach for exact reward computation in every possible world state by annotating all statements with executable Python programs. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty. We experiment with lilGym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, lilGym forms a challenging open problem. lilGym is available at https://lil.nlp.cornell.edu/lilgym/.

arXiv.org

Show thread

Yoav Artzi Dec 21, 2022

The agent's goal is to modify the environment so the grounded truth-value of the given statement is a target boolean. The language in lilGym is semantically-rich and human-written. Including: set reasoning, spatial relations, cardinality, and more.

Show thread

Yoav Artzi Dec 21, 2022

This level of reasoning is missing in RL benchmarks, that mostly use simplified synthetic 🤖 language? No real human language 🗣️ Why? Because computing rewards 🥇 requires resolving language meaning -> It's really a 🐔+🥚 situation!

Show thread

Yoav Artzi Dec 21, 2022

How do we compute reward? We annotate all sentences with Python 🐍 programs -> can test every possible state 🚀. This is a hard semantic parsing annotation problem, so we build an interactive platform with auto validation against hidden examples 🙈

Show thread

Yoav Artzi Dec 21, 2022

lilGym includes two types of environments, varying 🤸‍♂️🚴‍♀️⛹️‍♀️🤺🧘‍♀️ the difficulty of the task. Plus, flexible simplification control to tune complexity.

Show thread

Yoav Artzi Dec 21, 2022

Strong baselines -> a lot of room for improvement. Even with extra help, they are far from expert rewards. Because of the rich environment: more conventional reward without our annotation approach -> learning is hopeless -> flat (zero!) learning curve!