5 Followers
42 Following
37 Posts
A beginner in the field of AI/ML. Interested in working on CV algorithms.
Websitehttps://theprojectsguy.github.io
GitHubhttps://github.com/TheProjectsGuy

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Wearable glasses with stereo cameras, IMUs, mics, GPS, and eye gaze tracking. Can be used for egocentric research, SLAM, etc. (all backend at Meta)

My summary on HFPapers: https://huggingface.co/papers/2308.13561#65294ae125168ba8451b2579

arxiv: https://arxiv.org/abs/2308.13561

#arxiv #meta #slam

Paper page - Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Join the discussion on this paper page

Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding

3D Scene Graph using CLIP features (also use them for image retrieval and AKAZE local features for loop closures). SLAM using CLIP (LLM/VLM).

My summary on HFPapers: https://huggingface.co/papers/2309.15065#652946a97d6f8e0bf0e5c120

arxiv: https://arxiv.org/abs/2309.15065

#SLAM #robotics #LLM #arxiv

Paper page - Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding

Join the discussion on this paper page

Mistral 7B

Grouped query attention and sliding window attention increase efficiency and performance of LLMs. Source of training data not disclosed, but the model is released under a permissible license.

My summary on HFPapers: https://huggingface.co/papers/2310.06825#6527a2be0ef49cfb784b936f

arxiv: https://arxiv.org/abs/2310.06825

#paper #newpaper #llm #arxiv

Paper page - Mistral 7B

Join the discussion on this paper page

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

An X2T-centric architecture with w2v-BERT; speech synthesis using T2U (text-to-unit) and then HiFi-GAN. Many languages, huge datasets, and thorough comparisons.

My summary on HFPapers: https://huggingface.co/papers/2308.11596#6526ab95a7925a43f6024fbf

arXiv: https://arxiv.org/abs/2308.11596
HFSpace (cool translation app): https://huggingface.co/spaces/facebook/seamless_m4t

#arxiv #paper #meta #nlp #translation

Paper page - SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

Join the discussion on this paper page

Objaverse-XL: A Universe of 10M+ 3D Objects

Web crawl 10M+ 3D assets. Filter them using CLIP features from multiple views. Using this data improves Zero123 and PixelNeRF.

My summary on HFPapers: https://huggingface.co/papers/2307.05663#6512be82f60393414aecfd75
arXiv: https://arxiv.org/abs/2307.05663

#data #arxiv #paper

Paper page - Objaverse-XL: A Universe of 10M+ 3D Objects

Join the discussion on this paper page

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

Graphs can be a very powerful (and sophisticated) prompting technique (cheaper than ToT) for LLMs.

My summary on HFPapers: https://huggingface.co/papers/2308.09687#650ef7ba1765bd51f4ad15bc
arXiv: https://arxiv.org/abs/2308.09687

#arXiv #paper

Paper page - Graph of Thoughts: Solving Elaborate Problems with Large Language Models

Join the discussion on this paper page

3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

Encode scene semantics in a hierarchy using RGB (registered panorama) and mesh.

My summary on HFPapers: https://huggingface.co/papers/1910.02527#650b278bf795a59f49fd7050
arXiv: https://arxiv.org/abs/1910.02527

#arXiv #paper

Paper page - 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

Join the discussion on this paper page

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Swin transformer backbone for 3D point clouds (with some quirks for self-attention mechanism).

My summary on HFPapers: https://huggingface.co/papers/2304.06906#650b26f11f2949b99e500f06
arXiv: https://arxiv.org/abs/2304.06906

#arXiv #paper

Paper page - Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Join the discussion on this paper page

Unified Visual Relationship Detection with Vision and Language Models

VLM for scene understanding (VRD). DETR-like object detector (with bounding box prediction) and Perceiver Resampler for relationship decoder.

My summary on HFPapers: https://huggingface.co/papers/2303.08998#64ff22002597506d5adf7966
arXiv: https://arxiv.org/abs/2303.08998

#arxiv #paper #FoundationModels

Paper page - Unified Visual Relationship Detection with Vision and Language Models

Join the discussion on this paper page

Sparse 3D Topological Graphs for Micro-Aerial Vehicle Planning

Build scene-graph like sparse graph for fast planning of MAVs. Voronoi diagram, connected neighbors, and efficient pruning and joining methods can take you very far.

My summary on HFPapers: https://huggingface.co/papers/1803.04345#64fdcff6dc46569735b670d5
arXiv: https://arxiv.org/abs/1803.04345

#arXiv #paper #planning

Paper page - Sparse 3D Topological Graphs for Micro-Aerial Vehicle Planning

Join the discussion on this paper page