Humans learn language by acting in the world. Can RL agents do the same? lilGym is a new benchmark πŸ‹οΈ for RL + natural language + visual reasoning

https://arxiv.org/abs/2211.01994
https://lil.nlp.cornell.edu/lilgym/

Chief RL trainer: @[email protected] , in collboration with @[email protected] and @[email protected]

lilGym: Natural Language Visual Reasoning with Reinforcement Learning

We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We introduce a new approach for exact reward computation in every possible world state by annotating all statements with executable Python programs. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty. We experiment with lilGym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, lilGym forms a challenging open problem. lilGym is available at https://lil.nlp.cornell.edu/lilgym/.

arXiv.org
The agent's goal is to modify the environment so the grounded truth-value of the given statement is a target boolean. The language in lilGym is semantically-rich and human-written. Including: set reasoning, spatial relations, cardinality, and more.
This level of reasoning is missing in RL benchmarks, that mostly use simplified synthetic πŸ€– language? No real human language πŸ—£οΈ Why? Because computing rewards πŸ₯‡ requires resolving language meaning -> It's really a πŸ”+πŸ₯š situation!
How do we compute reward? We annotate all sentences with Python 🐍 programs -> can test every possible state πŸš€. This is a hard semantic parsing annotation problem, so we build an interactive platform with auto validation against hidden examples πŸ™ˆ
lilGym includes two types of environments, varying πŸ€Έβ€β™‚οΈπŸš΄β€β™€οΈβ›ΉοΈβ€β™€οΈπŸ€ΊπŸ§˜β€β™€οΈ the difficulty of the task. Plus, flexible simplification control to tune complexity.
Strong baselines -> a lot of room for improvement. Even with extra help, they are far from expert rewards. Because of the rich environment: more conventional reward without our annotation approach -> learning is hopeless -> flat (zero!) learning curve!