Anthropic LLM Interpretability Paper – Abstract Concepts as Feature Sets
So the black box problem of NLP models is beginning to be probed here, with some interesting results. This post is simply to store Notes on this paper and a video by bycloud interpreting the interpretability paper.
Paper:
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Video: