Just implemented something crazy, No clue how to make it converge or whether it's even feasible 😅

Env: Cartpole + continuous action space + image observation + Frame stacking

Policy: Use CNN encoded features and predicts the state, and weights of a LQR system (A, B, Q, R) with MLP -> solves the LQR over a horizon using differentiable convex optimization with constraint on actions, output is the first optimal action.

Loss: Policy gradient on rewards

#ReinforcementLearning #OptimalControl