Text Summarization
Summarizing text for topics à Short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgements.
Text pre-processing
It can be defined as bringing “your text into a form that is predictable and analysable for your task.”
Steps to pre-process text:
Private/public databases
Internet Social Data APIs
For example:
Sentence: I love Fridays and hate Mondays, but this Monday I turn 21!
White-space word tokenization: “I” “love” “Fridays” “and” “hate” “Mondays,” “but” “this” “Monday” “I” “turn” “21!”
Normalized version: “I” “love” “fridays” “and” “hate” “mondays” “but” “this” “monday” “I” “turn” “21”
Example: “love” “fridays” “hate” “mondays” “monday” “turn” “21”
For example: Running, ran, runs à run, run, run
Lemmatize is taking stemming one step further and recognizing that several words have the same root word.
For example: better, best, good à good, good, good
Choosing stemming for our previous example: “love” “friday” “hate” “monday” “monday” “turn” “21”
Text pre-processing
It can be defined as bringing “your text into a form that is predictable and analysable for your task.”
Steps to pre-process text:
- Collect text data
Private/public databases
Internet Social Data APIs
- Tokenize the text
For example:
Sentence: I love Fridays and hate Mondays, but this Monday I turn 21!
White-space word tokenization: “I” “love” “Fridays” “and” “hate” “Mondays,” “but” “this” “Monday” “I” “turn” “21!”
- Normalize the tokens
Normalized version: “I” “love” “fridays” “and” “hate” “mondays” “but” “this” “monday” “I” “turn” “21”
- Remove stop words
Example: “love” “fridays” “hate” “mondays” “monday” “turn” “21”
- Stem/Lemmatize tokens
For example: Running, ran, runs à run, run, run
Lemmatize is taking stemming one step further and recognizing that several words have the same root word.
For example: better, best, good à good, good, good
Choosing stemming for our previous example: “love” “friday” “hate” “monday” “monday” “turn” “21”