Neuron Explanations

The Neuron Explanation project aims to understand what deep neural networks (CNN, LLMs) learn during the training process by analyzing what individual neurons are able to recognize in terms of concepts (e.g., cat, building, etc). Specifically, we aim to achieve this goal by associating logical rules (e.g., (Cat OR Dog) AND NOT person)) to each neuron that express the (spatial) alignment between neurons activations and concept locations. These rules are usually extracted by combining search algorithms, clustering algorithms and statistical analysis of neurons activation.

Previous Publications

[1] “Towards a fuller understanding of neurons with Clustered Compositional Explanations”. Biagio La Rosa, Leilani Gilpin, and Roberto Capobianco. NeurIPS 2023 (LINK)

[1] Mu, Jesse, and Jacob Andreas. “Compositional explanations of neurons.” Advances in Neural Information Processing Systems 33 (2020): 17153-17163. (LINK)

[2] Bau, David, et al. “Network dissection: Quantifying interpretability of deep visual representations.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. (LINK)

Neuron Explanations

Previous Publications

Related Readings