This is a demonstration of the various tokenizers
provided by NLTK 2.0.4
.
Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the tokenize module. This demo shows how 5 of them work.
The text is first tokenized into sentences using the PunktSentenceTokenizer. Then each sentence is tokenized into words using 4 different word tokenizers:
The pattern tokenizer does its own sentence and word tokenization, and is included to show how this library tokenizes text before further parsing.
The initial example text provides 2 sentences that demonstrate how each word tokenizer handles non-ascii characters and the simple punctuation of contractions.
If you answered yes to any of these questions, please fill out this Natural Language Processing Services Survey.