Enjoy Image Matching Challenge 2023 recap:

https://ducha-aiki.github.io/wide-baseline-stereo-blog/2023/07/05/IMC2023-Recap.html

tl;dr:
- SfM is not solved
- global desc similarity is hard
- orientation invariance for SuperGlue is easy
- PixSfM is good idea, but need follow-ups
- KeyNetAffNet
@kornia_foss rocks

#IMC2023 #CVPR2023

@cvpr

#computervision #deeplearning
#dmytrotweetsaboutDL

Image Matching Challenge 2023: The Unbearable Weight of the Bundle Adjustment

3D reconstruction is harder than two view matching

Wide baseline stereo meets deep learning

DΓ€RF: Boosting Radiance Fields from Sparse Inputs with Monocular Depth Adaptation

Jiuhn Song, Seonghoon Park, Honggyu An, Seokju Cho, Min-Seop Kwak, Sungjin Cho, Seungryong Kim

tl;dr: feed NERF-rendered novel view into monodepth, and optimize consistency
https://arxiv.org/abs/2305.19201.pdf

#computervision #deeplearning
#dmytrotweetsaboutDL

DaRF: Boosting Radiance Fields from Sparse Inputs with Monocular Depth Adaptation

Neural radiance fields (NeRF) shows powerful performance in novel view synthesis and 3D geometry reconstruction, but it suffers from critical performance degradation when the number of known viewpoints is drastically reduced. Existing works attempt to overcome this problem by employing external priors, but their success is limited to certain types of scenes or datasets. Employing monocular depth estimation (MDE) networks, pretrained on large-scale RGB-D datasets, with powerful generalization capability would be a key to solving this problem: however, using MDE in conjunction with NeRF comes with a new set of challenges due to various ambiguity problems exhibited by monocular depths. In this light, we propose a novel framework, dubbed DΓ€RF, that achieves robust NeRF reconstruction with a handful of real-world images by combining the strengths of NeRF and monocular depth estimation through online complementary training. Our framework imposes the MDE network's powerful geometry prior to NeRF representation at both seen and unseen viewpoints to enhance its robustness and coherence. In addition, we overcome the ambiguity problems of monocular depths through patch-wise scale-shift fitting and geometry distillation, which adapts the MDE network to produce depths aligned accurately with NeRF geometry. Experiments show our framework achieves state-of-the-art results both quantitatively and qualitatively, demonstrating consistent and reliable performance in both indoor and outdoor real-world datasets. Project page is available at https://ku-cvlab.github.io/DaRF/.

arXiv.org

D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-Based Transformers

Jianfeng He, Yuan Gao, Tianzhu Zhang, Zhe Zhang, Feng Wu

tl;dr: no idea how that works, hierarchical attention something. No eval on #IMC

https://openaccess.thecvf.com/content/CVPR2023/papers/He_D2Former_Jointly_Learning_Hierarchical_Detectors_and_Contextual_Descriptors_via_Agent-Based_CVPR_2023_paper.pdf
#CVPR2023
#computervision #deeplearning
#dmytrotweetsaboutDL

Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence

Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski Trevor Darrell

tl;dr: diffusion features are good descriptors for semantic corrs, if aggregated among timesteps.

https://arxiv.org/abs/2305.14334

#computervision #deeplearning
#dmytrotweetsaboutDL

Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence

Diffusion models have been shown to be capable of generating high-quality images, suggesting that they could contain meaningful internal representations. Unfortunately, the feature maps that encode a diffusion model's internal information are spread not only over layers of the network, but also over diffusion timesteps, making it challenging to extract useful descriptors. We propose Diffusion Hyperfeatures, a framework for consolidating multi-scale and multi-timestep feature maps into per-pixel feature descriptors that can be used for downstream tasks. These descriptors can be extracted for both synthetic and real images using the generation and inversion processes. We evaluate the utility of our Diffusion Hyperfeatures on the task of semantic keypoint correspondence: our method achieves superior performance on the SPair-71k real image benchmark. We also demonstrate that our method is flexible and transferable: our feature aggregation network trained on the inversion features of real image pairs can be used on the generation features of synthetic image pairs with unseen objects and compositions. Our code is available at \url{https://diffusion-hyperfeatures.github.io}.

arXiv.org

DiffHand: End-to-End Hand Mesh Reconstruction via Diffusion Models

Lijun Li, Li'an Zhuo, Bang Zhang, Liefeng Bo, Chen Chen

tl;dr: diffusion models can do mesh reconstruction.
https://arxiv.org/abs/2305.13705

#computervision #deeplearning
#dmytrotweetsaboutDL

DiffHand: End-to-End Hand Mesh Reconstruction via Diffusion Models

Hand mesh reconstruction from the monocular image is a challenging task due to its depth ambiguity and severe occlusion, there remains a non-unique mapping between the monocular image and hand mesh. To address this, we develop DiffHand, the first diffusion-based framework that approaches hand mesh reconstruction as a denoising diffusion process. Our one-stage pipeline utilizes noise to model the uncertainty distribution of the intermediate hand mesh in a forward process. We reformulate the denoising diffusion process to gradually refine noisy hand mesh and then select mesh with the highest probability of being correct based on the image itself, rather than relying on 2D joints extracted beforehand. To better model the connectivity of hand vertices, we design a novel network module called the cross-modality decoder. Extensive experiments on the popular benchmarks demonstrate that our method outperforms the state-of-the-art hand mesh reconstruction approaches by achieving 5.8mm PA-MPJPE on the Freihand test set, 4.98mm PA-MPJPE on the DexYCB test set.

arXiv.org

VanillaNet: the Power of Minimalism in Deep Learning

Hanting Chen, Yunhe Wang, Jianyuan Guo, Dacheng Tao
tl;dr: 4x4conv/4->n x {1x1conv->{seriesAct}->MaxPool2x2}.

seriesAct = stack of BN(ReLU(BN(ReLU)))

https://arxiv.org/abs/2305.12972

#computervision #deeplearning
#dmytrotweetsaboutDL

VanillaNet: the Power of Minimalism in Deep Learning

At the heart of foundation models is the philosophy of "more is different", exemplified by the astonishing success in computer vision and natural language processing. However, the challenges of optimization and inherent complexity of transformer models call for a paradigm shift towards simplicity. In this study, we introduce VanillaNet, a neural network architecture that embraces elegance in design. By avoiding high depth, shortcuts, and intricate operations like self-attention, VanillaNet is refreshingly concise yet remarkably powerful. Each layer is carefully crafted to be compact and straightforward, with nonlinear activation functions pruned after training to restore the original architecture. VanillaNet overcomes the challenges of inherent complexity, making it ideal for resource-constrained environments. Its easy-to-understand and highly simplified architecture opens new possibilities for efficient deployment. Extensive experimentation demonstrates that VanillaNet delivers performance on par with renowned deep neural networks and vision transformers, showcasing the power of minimalism in deep learning. This visionary journey of VanillaNet has significant potential to redefine the landscape and challenge the status quo of foundation model, setting a new path for elegant and effective model design. Pre-trained models and codes are available at https://github.com/huawei-noah/VanillaNet and https://gitee.com/mindspore/models/tree/master/research/cv/VanillaNet.

arXiv.org

MFT: Long-Term Tracking of Every Pixel

Michal Neoral, JonΓ‘Ε‘ Ε erΓ½ch, JiΕ™Γ­ Matas

tl;dr: RAFT for frame-sequence + log-frame-sequence. Also propagate uncertainty and occlusion. generate several hypothesis, select least uncertain.
https://arxiv.org/abs/2305.12998

#computervision #deeplearning
#dmytrotweetsaboutDL

MFT: Long-Term Tracking of Every Pixel

We propose MFT -- Multi-Flow dense Tracker -- a novel method for dense, pixel-level, long-term tracking. The approach exploits optical flows estimated not only between consecutive frames, but also for pairs of frames at logarithmically spaced intervals. It selects the most reliable sequence of flows on the basis of estimates of its geometric accuracy and the probability of occlusion, both provided by a pre-trained CNN. We show that MFT achieves competitive performance on the TAP-Vid benchmark, outperforming baselines by a significant margin, and tracking densely orders of magnitude faster than the state-of-the-art point-tracking methods. The method is insensitive to medium-length occlusions and it is robustified by estimating flow with respect to the reference frame, which reduces drift.

arXiv.org

Fast Monocular Scene Reconstruction with Global-Sparse Local-Dense Grids

Wei Dong, Chris Choy, Charles Loop, Or Litany, Yuke Zhu, Anima Anandkumar

tl;dr: monodepth + SfM to init non-zero voxel grid, then densify and refine -> ScanNet scene <30 min

https://arxiv.org/abs/2305.13220

#computervision #deeplearning
#dmytrotweetsaboutDL

Fast Monocular Scene Reconstruction with Global-Sparse Local-Dense Grids

Indoor scene reconstruction from monocular images has long been sought after by augmented reality and robotics developers. Recent advances in neural field representations and monocular priors have led to remarkable results in scene-level surface reconstructions. The reliance on Multilayer Perceptrons (MLP), however, significantly limits speed in training and rendering. In this work, we propose to directly use signed distance function (SDF) in sparse voxel block grids for fast and accurate scene reconstruction without MLPs. Our globally sparse and locally dense data structure exploits surfaces' spatial sparsity, enables cache-friendly queries, and allows direct extensions to multi-modal data such as color and semantic labels. To apply this representation to monocular scene reconstruction, we develop a scale calibration algorithm for fast geometric initialization from monocular depth priors. We apply differentiable volume rendering from this initialization to refine details with fast convergence. We also introduce efficient high-dimensional Continuous Random Fields (CRFs) to further exploit the semantic-geometry consistency between scene objects. Experiments show that our approach is 10x faster in training and 100x faster in rendering while achieving comparable accuracy to state-of-the-art neural implicit methods.

arXiv.org

NeRFuser: Large-Scale Scene Representation by NeRF Fusion

Jiading Fang, Shengjie Lin, Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Adrien Gaidon, Gregory Shakhnarovich, Matthew R. Walter

tl;dr: render->SuperGlue registration->weighted blend
https://arxiv.org/abs/2305.13307.pdf

#computervision #deeplearning
#dmytrotweetsaboutDL

NeRFuser: Large-Scale Scene Representation by NeRF Fusion

A practical benefit of implicit visual representations like Neural Radiance Fields (NeRFs) is their memory efficiency: large scenes can be efficiently stored and shared as small neural nets instead of collections of images. However, operating on these implicit visual data structures requires extending classical image-based vision techniques (e.g., registration, blending) from image sets to neural fields. Towards this goal, we propose NeRFuser, a novel architecture for NeRF registration and blending that assumes only access to pre-generated NeRFs, and not the potentially large sets of images used to generate them. We propose registration from re-rendering, a technique to infer the transformation between NeRFs based on images synthesized from individual NeRFs. For blending, we propose sample-based inverse distance weighting to blend visual information at the ray-sample level. We evaluate NeRFuser on public benchmarks and a self-collected object-centric indoor dataset, showing the robustness of our method, including to views that are challenging to render from the individual source NeRFs.

arXiv.org

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching
v
Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, Chunhua Shen

tl;dr: segmented reference image of the same class -> use semantic correspondences to segment target image.
https://arxiv.org/abs/2305.13310.pdf

#computervision #deeplearning
#dmytrotweetsaboutDL

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models require a task-specific model structure followed by fine-tuning on specific tasks. In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher can segment anything by using an in-context example without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$ with one example, surpassing the state-of-the-art specialist model by 1.6%. In addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92$^i$ for one-shot semantic segmentation, outperforming the state-of-the-art generalist model by 14.4%. Our visualization results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild. Our code can be found at https://github.com/aim-uofa/Matcher.

arXiv.org