LDA Topic Modelling
LDA Modelling - way to summarize text or to detect topics in a body of text
“Topic modeling algorithms are statistical methods that analyze the words of the original text to discover the themes that run through them, how those themes are connected to each other, and how they change over time.”
Understanding:
Think how a computer envisions how documents are written.
Right hand side à words that are included in these topic boxes
Left-hand side à percentages upon which the computer believes that any one of these topics should make up in the documents themselves
Steps:
1. Go through each document, and randomly assign each word in the document to one of the K topics. Say we have 3 topics. So we go through each document and randomly assign each word to one of these 3 topics.
2. Note that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
3. To improve on these calculations, to make sure that the words are actually in the right topic for each document
a) Go through each word w in d (assuming all other topic assignments are correct)
For each topic t, compute two things:
i) X = p (topic t | document d) = proportion of words in document d that are currently assigned to topic t
ii) Y = p(word w | topic t) = proportion of assignments to topic t over all documents that come from this word w
b) Reassign with a new topic, where we choose topic t with probability X*Y = probability that topic t generated word w
4. In the previous step, we assume that all topic assignments except for the current word in question are correct, and then we update the assignment of the current word using our model of how documents are generated.
5. After repeating the previous step a large number of times, we eventually reach a rough steady state where the assignments are pretty good.
6. The human still needs to determine what the topics are
Data, number, computer à technology
Brain, neuron, nerve à nervous system
“Topic modeling algorithms are statistical methods that analyze the words of the original text to discover the themes that run through them, how those themes are connected to each other, and how they change over time.”
Understanding:
Think how a computer envisions how documents are written.
Right hand side à words that are included in these topic boxes
Left-hand side à percentages upon which the computer believes that any one of these topics should make up in the documents themselves
Steps:
1. Go through each document, and randomly assign each word in the document to one of the K topics. Say we have 3 topics. So we go through each document and randomly assign each word to one of these 3 topics.
2. Note that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
3. To improve on these calculations, to make sure that the words are actually in the right topic for each document
a) Go through each word w in d (assuming all other topic assignments are correct)
For each topic t, compute two things:
i) X = p (topic t | document d) = proportion of words in document d that are currently assigned to topic t
ii) Y = p(word w | topic t) = proportion of assignments to topic t over all documents that come from this word w
b) Reassign with a new topic, where we choose topic t with probability X*Y = probability that topic t generated word w
4. In the previous step, we assume that all topic assignments except for the current word in question are correct, and then we update the assignment of the current word using our model of how documents are generated.
5. After repeating the previous step a large number of times, we eventually reach a rough steady state where the assignments are pretty good.
6. The human still needs to determine what the topics are
Data, number, computer à technology
Brain, neuron, nerve à nervous system