The thing is, it's a common misconception that neural networks are somehow intrinsically related to linear algebra. Matrix multiplications are just a convenient way to build functions with lots of tuneable degrees of freedom. Just like they do in finite element simulations, sparse matrices tend to allude to the fact that the underlying problem is more graph-based in nature. While I have no way to prove this, I've strongly suspected for a while that most of the weights in dense matrix deep learning models don't actually have an effect on the output, and that we've been unnecessarily burning cycles to compute their products. The trouble of course is figuring out which ones are useful and which ones aren't.

This is the idea behind "Learning both Weights and Connections for Efficient Neural Networks (2015)", and yes, it works: https://arxiv.org/abs/1506.02626

How to figure out which ones are useful and which ones aren't? Why, you can try the simplest thing that could possibly work. Quoting the paper: "All connections with weights below a threshold are removed from the network". Is that all? Yes it is.

Interesting research; I would have suspected that there would be a snowball effect where those minuscule weights add up to significant changes in the end, but seems not.

reply