Word Tokenization with Python NLTK

This is a demonstration of the various tokenizers provided by NLTK 2.0.4.

How Text Tokenization Works

Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the tokenize module. This demo shows how 5 of them work.

The text is first tokenized into sentences using the PunktSentenceTokenizer. Then each sentence is tokenized into words using 4 different word tokenizers:

TreebankWordTokenizer
WordPunctTokenizer
PunctWordTokenizer
WhitespaceTokenizer

The pattern tokenizer does its own sentence and word tokenization, and is included to show how this library tokenizes text before further parsing.

The initial example text provides 2 sentences that demonstrate how each word tokenizer handles non-ascii characters and the simple punctuation of contractions.

Natural Language Processing Services

Want to download/purchase any of these models?
Need a custom model, trained on a public or custom corpus?
Want help creating or bootstrapping a custom corpus?

If you answered yes to any of these questions, please fill out this Natural Language Processing Services Survey.