You need to perform matrix multiplication on two large matrices, A and B, where A is of size M x N and B is of size N x P. You have multiple machines available for distributed computing. Which approach would likely yield the BEST performance improvement?