Natural Language Processing requires transforming text into numbers for machines to understand and analyze the text. In NLP, it is required to convert text into a set of real numbers or vectors to extract useful information from the text. This process of converting strings/text into a meaningful array of real numbers (or vectors) is called vectorization.

Text vectorization maps words or phrases as real numbers to corresponding words from a vocabulary to find word predictions and similarities.

Text vectorization in NLP helps to perform the following textual analysis tasks:

  • Extract features for text classification.
  • Compute the occurrence of similar words.
  • Compute the probability of occurrence of similar words.
  • Compute the relevance of features in a text.
  • Predict the next words in a sequence of words.

In Rubiscape, two Text Vectorization algorithms are available.

  • CountVectorizer
  • TF-IDF (Term Frequency-Inverse Document Frequency)

In the task pane, click Textual Analysis, and then click Text Vectorization.

For more information, refer to Text Vectorization