Skip to main content

Word Vecrotization

Module Import:

from deeprai.embedding import word_vectorize

Class Definition:

class WordVectorizer:

Initialization:

The WordVectorizer is initialized with an optional corpus, which is used for TF-IDF computations.

def __init__(self, corpus=None):

Parameters:

  • corpus (list of str, optional): List of words that forms the basis for the term frequency-inverse document frequency (TF-IDF) calculations.

Methods:

1. One-Hot Vectorization:

Converts a given word into a one-hot encoded matrix.

def one_hot_vectorize(self, word) -> np.ndarray:

Parameters:

  • word (str): The word to vectorize.

Returns:

  • numpy.ndarray: One-hot encoded matrix representation of the word.

2. Continuous Vectorization:

Encodes a given word into continuous values for each character.

def continuous_vectorize(self, word) -> np.ndarray:

Parameters:

  • word (str): The word to vectorize.

Returns:

  • numpy.ndarray: Continuous valued representation of the word.

3. Binary Vectorization:

Converts each character of a word into its binary ASCII representation.

def binary_vectorize(self, word) -> np.ndarray:

Parameters:

  • word (str): The word to vectorize.

Returns:

  • numpy.ndarray: Binary ASCII representation of the word.

4. Frequency Vectorization:

Encodes the word based on the frequency of each letter normalized by word length.

def frequency_vectorize(self, word) -> np.ndarray:

Parameters:

  • word (str): The word to vectorize.

Returns:

  • numpy.ndarray: Frequency-based representation of the word.

5. N-gram Vectorization:

Vectorizes the word by creating n-grams.

def ngram_vectorize(self, word, n=2) -> np.ndarray:

Parameters:

  • word (str): The word to vectorize.
  • n (int, default=2): The size of the n-grams.

Returns:

  • numpy.ndarray: N-gram based vector representation of the word.

6. TF-IDF Vectorization:

Vectorizes a word based on term frequency-inverse document frequency.

def tfidf_vectorize(self, word) -> np.ndarray:

Parameters:

  • word (str): The word to vectorize.

Returns:

  • numpy.ndarray: TF-IDF representation of the word.

Raises:

  • ValueError: If the WordVectorizer is not initialized with a corpus.

Description:

The WordVectorizer class from the deeprai.embedding.word_vectorize module provides multiple ways to represent words as vectors. These include methods like one-hot encoding, continuous encoding, binary encoding, frequency-based encoding, n-gram-based encoding, and TF-IDF encoding. The TF-IDF method requires a corpus to be passed during the initialization of the class.

Examples:

Module Import and Initialization:

First, let's import the necessary module and initialize our WordVectorizer. For methods that require a corpus (like TF-IDF), we'll provide a sample corpus.

from deeprai.embedding import word_vectorize

corpus = ["apple", "banana", "cherry", "date", "fig", "grape"]
vectorizer = word_vectorize.WordVectorizer(corpus=corpus)

1. One-Hot Vectorization:

This method will transform a word into a matrix where each row is a one-hot encoded representation of a character in the word.

word = "apple"
one_hot_encoded = vectorizer.one_hot_vectorize(word)
print(one_hot_encoded)

2. Continuous Vectorization:

This method will transform a word into a vector of continuous values.

word = "apple"
continuous_vector = vectorizer.continuous_vectorize(word)
print(continuous_vector)

3. Binary Vectorization:

This will convert each character of the word into its 8-bit ASCII representation.

word = "apple"
binary_vector = vectorizer.binary_vectorize(word)
print(binary_vector)

4. Frequency Vectorization:

This method vectorizes a word based on the normalized frequency of each letter in it.

word = "apple"
frequency_vector = vectorizer.frequency_vectorize(word)
print(frequency_vector)

5. N-gram Vectorization:

This method will break the word into n-grams and vectorize them. For this example, we'll use n=2 (bigrams).

word = "apple"
bigram_vector = vectorizer.ngram_vectorize(word, n=2)
print(bigram_vector)

6. TF-IDF Vectorization:

This method requires a corpus to compute the inverse document frequency. It will then vectorize a word based on its term frequency and the inverse document frequency from the corpus.

word = "apple"
tfidf_vector = vectorizer.tfidf_vectorize(word)
print(tfidf_vector)