mwTokenizer

Python library for multilingual tokenization done for the Wikimedia Foundation.

Python package to perform language-agnostic tokenization. Researchers can start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models. Install from PyPI