Mahi Shafiullah

8 Followers
14 Following
14 Posts

Teaching robots how to do all my chores @NYU @MetaAI.

Previously @MIT.

🏛️New York University
🏫Massachusetts Institute of Technology
🎓https://scholar.google.com/citations?user=vAOw6aQAAAAJ&hl=en
🌐https://mahis.life/

Hi 🐘.

If you are looking for a winter break project, here is the full collection of ML/coding puzzles.

* https://github.com/srush/tensor-puzzles
* https://github.com/srush/gpu-puzzles
* https://github.com/srush/autodiff-puzzles
* https://github.com/srush/raspy

GitHub - srush/Tensor-Puzzles: Solve puzzles. Improve your pytorch.

Solve puzzles. Improve your pytorch. Contribute to srush/Tensor-Puzzles development by creating an account on GitHub.

GitHub

Hi All! I just migrated to the sigmoid.social server. It definitely feels more focussed compared to mastodon.social.

Say Hi if you are interested in #robotics or #machinelearning for decision making!

#Introduction I'm Mahi, third year PhD at NYU and visiting researcher at FAIR working on the intersection of #robotics and #machinelearning ! Since CLIP-Fields recently got outstanding paper award at #CoRL22 LangRob workshop, felt right to start by importing this post over from twitter.

Also check out https://mahis.life for my other research projects!

(Coming from hashtags? Click on this post to see the full thread)

Nur Muhammad

Trying to teach robots to do all my chores at @NYU-robot-learning @facebookresearch

Thanks to my advisors and collaborators @cpaxton @lerrel @soumith and Arthur Szlam, and finally Meta AI for an amazing internship!

Paper: http://arxiv.org/abs/2210.05663
More video/demos: http://mahis.life/clip-fields/
Code and interactive tutorials: http://github.com/notmahi/clip-fields

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

We propose CLIP-Fields, an implicit scene model that can be used for a variety of tasks, such as segmentation, instance identification, semantic search over space, and view localization. CLIP-Fields learns a mapping from spatial locations to semantic embedding vectors. Importantly, we show that this mapping can be trained with supervision coming only from web-image and web-text trained models such as CLIP, Detic, and Sentence-BERT; and thus uses no direct human supervision. When compared to baselines like Mask-RCNN, our method outperforms on few-shot instance identification or semantic segmentation on the HM3D dataset with only a fraction of the examples. Finally, we show that using CLIP-Fields as a scene memory, robots can perform semantic navigation in real-world environments. Our code and demonstration videos are available here: https://mahis.life/clip-fields

arXiv.org
We can train a CLIP field from scratch under an hour, including automated labeling, thanks to advances in NeRF literature such as instant-NGP. Our trained model can then be used on a robot to find "blue book with a house on the cover" or a place to "throw out my trash". https://t.co/VuYoMPdfHx
Mahi Shafiullah on Twitter

“We can train a CLIP field from scratch under an hour, including automated labeling, thanks to advances in NeRF literature such as instant-NGP. Our trained model can then be used on a robot to find "blue book with a house on the cover" or a place to "throw out my trash".”

Twitter
For real world exps, we collect RGB-D data using an iPhone 13 Pro and pre-process them using open-label detection/segmentation models like Detic and LSeg.
We then convert the data to world coordinates and semantic/visual reps using Sentence-BERT and CLIP on the labels and bboxes. https://t.co/9POmdCnyp0
Mahi Shafiullah on Twitter

“For real world exps, we collect RGB-D data using an iPhone 13 Pro and pre-process them using open-label detection/segmentation models like Detic and LSeg. We then convert the data to world coordinates and semantic/visual reps using Sentence-BERT and CLIP on the labels and bboxes.”

Twitter
At the core of CLIP-Field lies a neural field that maps real world coordinates to the semantic representation spaces underlying pretrained models like CLIP and Sentence-BERT. This mapping enables our model to respond to open-ended semantic or visual queries. https://t.co/sLxatUxBpP
Mahi Shafiullah on Twitter

“At the core of CLIP-Field lies a neural field that maps real world coordinates to the semantic representation spaces underlying pretrained models like CLIP and Sentence-BERT. This mapping enables our model to respond to open-ended semantic or visual queries.”

Twitter
Since then, we've also developed Conditional-BeT, a way to train goal-conditioned BeT from fully uncurated data. C-BeT makes sense of "play" style robot demos w/ no labels and no RL to extract conditional policies on real robots from 4.5 hrs of play data!
https://twitter.com/LerrelPinto/status/1582774757157896193
Lerrel Pinto on Twitter

“Almost ♾ unlabeled data is the “secret sauce” for today's ML, but how do we use uncurated datasets in robot learning? Conditional Behavior Transformer makes sense of "play" style robot demos w/ no labels and no RL to extract conditional policies! https://t.co/2uCJtol5Wt 🧵”

Twitter
I'll be at NeurIPS presenting Behavior Transformers -- find us at Hall J #110 on Tuesday morning at the very first session! Feel free to hit me up on DM/email if you want to grab ☕ and chat about robot learning for household robots, or just catch up.
https://twitter.com/LerrelPinto/status/1540357198009798656
Lerrel Pinto on Twitter

“Robotic imitation often relies on curated, task-specific expert data. But human data is neither expert nor unimodal! Behavior Transformers can model task-agnostic play data, capture their underlying modes, and solve tasks through unconditional rollouts. Lead by @notmahi. (1/n)”

Twitter
To recap, Behavior Transformer (BeT) is a new architecture for behavior cloning that can model task-agnostic multi-modal play data, capture their underlying modes, and solve tasks through unconditional rollouts.
📄+🎥+ code https://mahis.life/bet/
Behavior Transformers: Cloning k modes with one stone

Behavior Transformers: Cloning k modes with one stone

Behavior Transformers: Cloning k modes with one stone