N-Gram - Frequcy Count and phase mining
Text pre-processed Counting Approach
Resulting set of words from pre-processing: “love” “friday” “hate” “monday” “monday” “turn” “21”
Resulting frequency counts:
“monday” = 2
“love” = 1
“friday” = 1
“hate” = 1
“turn” = 1
“21” = 1
N-grams
N-grams – multi-word phrases (can be multi-character, etc)
N-grams is a concept where you can take multi word tokens, or rather, instead of doing whitespace tokenization, you can do some sort of algorithm that will do every other whitespace. And then you're going to end up with these tokens that are kind of two word phrases
Unigram (one gram) – love
Bigram (two gram) love Friday
Trigram - love Friday hate
4-gram - love Friday hate Monday
Cons: How we choose “n” is important à A number of n may work for a certain number of words but won’t work for another set of words.
Solution à Phrase mining
Phrase mining
Phrase mining refers to the process of automatic extraction of high-quality phrases (e.g. scientific terms and general entity names) in a given corpus (e.g. research papers and news). Representing the text with quality phrases instead of n-grams can improve computational models for applications such as information extraction/retrieval, taxonomy, construction and topic modeling.
POS-guided phrasal segmentation – Part of speech
Part of speech guided phrasal segmentation which is taking the actual phrases actual sentences from let's say these news articles.
So the first one is US President Barack Obama speaks at a town hall Meeting with CNN, Anderson Cooper. And what computationally is being done here is a method that is very common within data science which is there is what are called part of speech tagger is POS taggers, and they can identify within a sentence, what's the noun, what's the verb, what's the adjective,et cetera. What phrase mining is doing is it's now on this side without looking at Wikipedia. It's segmenting parts of speech, and then combining what it finds, namely, it's going to wait nouns very high. So US President Barack Obama, Anderson Cooper.
It's going to combine mathematically and we won't go too into the mathematics of this, but combine that with the positive pool words that are Wikipedia entries. And then it's going to give a confidence score which is right at the middle or the kind of right middle of this figure in the box called robust positive only distance training. Where it'll say, with great confidence 0.9999% confidence we believe US president is a quality phrase. 98% confidence Anderson Cooper is a quality phrase, so on and so forth. Whereas with 30% confidence, we think speaks at is a quality phrase. Or 0.2 or 20% confidence.
Through this phrase mining exercise, we can see that it is a very cutting edge technique and that you get a much better picture as compared to choosing a random value of N and using N-grams.
Resulting set of words from pre-processing: “love” “friday” “hate” “monday” “monday” “turn” “21”
Resulting frequency counts:
“monday” = 2
“love” = 1
“friday” = 1
“hate” = 1
“turn” = 1
“21” = 1
N-grams
N-grams – multi-word phrases (can be multi-character, etc)
N-grams is a concept where you can take multi word tokens, or rather, instead of doing whitespace tokenization, you can do some sort of algorithm that will do every other whitespace. And then you're going to end up with these tokens that are kind of two word phrases
Unigram (one gram) – love
Bigram (two gram) love Friday
Trigram - love Friday hate
4-gram - love Friday hate Monday
Cons: How we choose “n” is important à A number of n may work for a certain number of words but won’t work for another set of words.
Solution à Phrase mining
Phrase mining
Phrase mining refers to the process of automatic extraction of high-quality phrases (e.g. scientific terms and general entity names) in a given corpus (e.g. research papers and news). Representing the text with quality phrases instead of n-grams can improve computational models for applications such as information extraction/retrieval, taxonomy, construction and topic modeling.
POS-guided phrasal segmentation – Part of speech
Part of speech guided phrasal segmentation which is taking the actual phrases actual sentences from let's say these news articles.
So the first one is US President Barack Obama speaks at a town hall Meeting with CNN, Anderson Cooper. And what computationally is being done here is a method that is very common within data science which is there is what are called part of speech tagger is POS taggers, and they can identify within a sentence, what's the noun, what's the verb, what's the adjective,et cetera. What phrase mining is doing is it's now on this side without looking at Wikipedia. It's segmenting parts of speech, and then combining what it finds, namely, it's going to wait nouns very high. So US President Barack Obama, Anderson Cooper.
It's going to combine mathematically and we won't go too into the mathematics of this, but combine that with the positive pool words that are Wikipedia entries. And then it's going to give a confidence score which is right at the middle or the kind of right middle of this figure in the box called robust positive only distance training. Where it'll say, with great confidence 0.9999% confidence we believe US president is a quality phrase. 98% confidence Anderson Cooper is a quality phrase, so on and so forth. Whereas with 30% confidence, we think speaks at is a quality phrase. Or 0.2 or 20% confidence.
Through this phrase mining exercise, we can see that it is a very cutting edge technique and that you get a much better picture as compared to choosing a random value of N and using N-grams.