As some of my other tests have hinted at, the #LX2160a in the #HoneyCombLX2 has reduced memory bandwidth when using threads on the same cluster. Max memory bandwidth is achieved at 1 thread per cluster.