A timeline of the latest AI models for audio generation, starting in 2023
github: https://github.com/archinetai/audio-ai-timeline

GitHub - archinetai/audio-ai-timeline: A timeline of the latest AI models for audio generation, starting in 2023!
A timeline of the latest AI models for audio generation, starting in 2023! - GitHub - archinetai/audio-ai-timeline: A timeline of the latest AI models for audio generation, starting in 2023!
GitHubRT @[email protected]
Text-to-motion is a thing now 🤯 https://huggingface.co/spaces/vumichien/generate_human_motion
This demo is built as part of our community sprint where we build demos to cutting edge models, you can join us in discord here 👉 http://hf.co/join/discord
🐦🔗: https://twitter.com/mervenoyann/status/1620387099672473600

Generate Human Motion - a Hugging Face Space by vumichien
Discover amazing ML apps made by the community

AK on Twitter
“Looped Transformers as Programmable Computers
abs: https://t.co/wZTUGiY7vk”
TwitterRT @[email protected]
Thank you, AK!
The contrastive language-audio pretraining latents enable AudioLDM to learn regenerating audio in training stage, while realizing text to audio in sampling stage. AudioLDM is trained on a single GPU, and advantageous in sample quality and audio manipulation. https://twitter.com/_akhaliq/status/1620239832856363009
🐦🔗: https://twitter.com/ZehuaChenICL/status/1620258287987077121

AK on Twitter
“AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
abs: https://t.co/G6568wgwky
project page: https://t.co/L1jLVcPTdz”
TwitterMax Woolf on Twitter
“@_akhaliq 16GB of memory ought to be enough for anybody.”
Twitterwhen you Memory limit exceeded (16G) 😢
Sample Efficient Deep Reinforcement Learning via Local Planning
abs: https://arxiv.org/abs/2301.12579

Sample Efficient Deep Reinforcement Learning via Local Planning
The focus of this work is sample-efficient deep reinforcement learning (RL)
with a simulator. One useful property of simulators is that it is typically
easy to reset the environment to a previously observed state. We propose an
algorithmic framework, named uncertainty-first local planning (UFLP), that
takes advantage of this property. Concretely, in each data collection
iteration, with some probability, our meta-algorithm resets the environment to
an observed state which has high uncertainty, instead of sampling according to
the initial-state distribution. The agent-environment interaction then proceeds
as in the standard online RL setting. We demonstrate that this simple procedure
can dramatically improve the sample cost of several baseline RL algorithms on
difficult exploration tasks. Notably, with our framework, we can achieve
super-human performance on the notoriously hard Atari game, Montezuma's
Revenge, with a simple (distributional) double DQN. Our work can be seen as an
efficient approximate implementation of an existing algorithm with theoretical
guarantees, which offers an interpretation of the positive empirical results.
arXiv.org

A theory of continuous generative flow networks
Generative flow networks (GFlowNets) are amortized variational inference
algorithms that are trained to sample from unnormalized target distributions
over compositional objects. A key limitation of GFlowNets until this time has
been that they are restricted to discrete spaces. We present a theory for
generalized GFlowNets, which encompasses both existing discrete GFlowNets and
ones with continuous or hybrid state spaces, and perform experiments with two
goals in mind. First, we illustrate critical points of the theory and the
importance of various assumptions. Second, we empirically demonstrate how
observations about discrete GFlowNets transfer to the continuous case and show
strong results compared to non-GFlowNet baselines on several previously studied
tasks. This work greatly widens the perspectives for the application of
GFlowNets in probabilistic inference and various modeling settings.
arXiv.org

Adaptive Computation with Elastic Input Sequence
Humans have the ability to adapt the type of information they use, the
procedure they employ, and the amount of time they spend when solving problems.
However, most standard neural networks have a fixed function type and
computation budget regardless of the sample's nature or difficulty. Adaptivity
is a powerful paradigm as it not only imbues practitioners with flexibility
pertaining to the downstream usage of these models but can also serve as a
powerful inductive bias for solving certain challenging classes of problems. In
this work, we introduce a new approach called AdaTape, which allows for dynamic
computation in neural networks through adaptive tape tokens. AdaTape utilizes
an elastic input sequence by equipping an architecture with a dynamic
read-and-write tape. Specifically, we adaptively generate input sequences using
tape tokens obtained from a tape bank which can be either trainable or derived
from input data. We examine the challenges and requirements to obtain dynamic
sequence content and length, and propose the Adaptive Tape Reading (ATR)
algorithm to achieve both goals. Through extensive experiments on image
recognition tasks, we show that AdaTape can achieve better performance while
maintaining the computational cost. To facilitate further research, we have
released code at https://github.com/google-research/scenic.
arXiv.orgSeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation
abs: https://arxiv.org/abs/2301.13156
github: https://github.com/fudan-zvg/SeaFormer

SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation
Since the introduction of Vision Transformers, the landscape of many computer
vision tasks (e.g., semantic segmentation), which has been overwhelmingly
dominated by CNNs, recently has significantly revolutionized. However, the
computational cost and memory requirement render these methods unsuitable on
the mobile device, especially for the high-resolution per-pixel semantic
segmentation task. In this paper, we introduce a new method squeeze-enhanced
Axial TransFormer (SeaFormer) for mobile semantic segmentation. Specifically,
we design a generic attention block characterized by the formulation of squeeze
Axial and detail enhancement. It can be further used to create a family of
backbone architectures with superior cost-effectiveness. Coupled with a light
segmentation head, we achieve the best trade-off between segmentation accuracy
and latency on the ARM-based mobile devices on the ADE20K and Cityscapes
datasets. Critically, we beat both the mobile-friendly rivals and
Transformer-based counterparts with better performance and lower latency
without bells and whistles. Beyond semantic segmentation, we further apply the
proposed SeaFormer architecture to image classification problem, demonstrating
the potentials of serving as a versatile mobile-friendly backbone.
arXiv.org