You can run a transcription model and a language model (the AI you talk to) locally however you will need a beefy GPU especially if you want to run the large models for better results.
OpenAI’s Whisper is open source and does transcription, and you can run inference on language models like LLaMa (+variants) or GPT4all locally. To store information long term (“AI memory”) you could find an open source vector database but I don’t have experience with this.
Take a look at around 0:15 of the video - it’s clear that the pumping motion of the camera which is synced to the beat is a result of the speed ramping. When it’s slowed down, you can still see the camera shakes, but they’re not as apparent because they’re slow. At the moment where it’s sped up, the camera motion becomes much more apparent because the speed ramping allows more camera movement (shake) in a certain period of time.
Of course, there is no right or wrong order. You can do this by adjusting your shake keyframes to match the beat but if you’re looking to recreate that effect this is my best guess
I believe this is what’s happening in your reference clip. The reason the camera shake looks like it’s synchronized to the music is because the speed ramps up on every beat, causing the camera shake to be more pronounced
If you added the shake after the speed ramping the “intensity” (really frequency) of the motion would remain constant and it wouldn’t look like the reference video
This is a camera shake effect that has been added before the speed ramping. Most video editors can do this effect: