Just implemented something crazy, No clue how to make it converge or whether it's even feasible 😅
Env: Cartpole + continuous action space + image observation + Frame stacking
Policy: Use CNN encoded features and predicts the state, and weights of a LQR system (A, B, Q, R) with MLP -> solves the LQR over a horizon using differentiable convex optimization with constraint on actions, output is the first optimal action.
Loss: Policy gradient on rewards