elgeopaso.utils.text_toolbelt module

Tool.

class elgeopaso.utils.text_toolbelt.TextToolbelt[source]

Bases : object

Tools to manipulate text: tokenize, clean, etc.

classmethod remove_html_markups(html_text, cleaner='bs-lxml')[source]

Very basic cleaner for HTML markups.

Paramètres
  • html_text (str) – text to be clean

  • cleaner (str) – Which lib to use to clean the text: - « bs-lxml »: Beautifulsoup4 + LXML - Default. - « psl-only »: Python Standard Library only (html + regex)

Renvoie

clean text

Type renvoyé

str

classmethod tokenize(input_content)[source]

Extraction of words mentioned into the offers. The goal is to perform a semantic analysis. Mainly based on NLTK: https://www.nltk.org/.

Paramètres

input_content (str) – input text to parse and tokenize

Renvoie

list of toknized words

Type renvoyé

list