Writing properly parallelized code is still a barrier to most data scientists out there, but that is a must in some cases. We have to apply a doubly constrained aggregate MNL destination choice model for 1,250 demand segments on a 12,000 zone model, which is completely out of the ordinary (and pretty ludicrous at first glance).
To simply compute the probability matrices, a pure Python with a lot of Numpy magic takes about 3h, while the latest version of Larch does it in 5h. A first stab at implementing it in Cython takes it down to 1h on a modern 6 core laptop, and about 30 minutes on a proper workstation. I knew it was going to be good, but the results blew me out of the water a little... Does anyone have interest in that code? I could look into making it open-source.