How to remove all words with less than 1 occurrence in the wordlist

lolol · June 2023

Hi everyone, I need help here. How do I remove all tokenized words with less than 1 occurrence in the wordlist?

Image: https://us.v-cdn.net/6030995/uploads/editor/kz/zbbfkwn2i5t0.png

Image: https://us.v-cdn.net/6030995/uploads/editor/2q/ge22jz2bil01.png

vivek101 · June 2023

Hey,

You may take the following actions to exclude any tokenized terms with less than 1 occurrence in the wordlist:

1. To determine how frequently each word in the wordlist occurs, create a frequency dictionary. Words will serve as the dictionary's keys, and their associated frequencies will serve as their values.

2. Repeat the wordlist iterations, updating the frequency dictionary as necessary. Add one more frequency point if a term appears in the dictionary already. Add it to the dictionary with a frequency of 1 if it isn't already there.

3. Make a fresh wordlist by eliminating terms with a frequency below one. Repeat the process using the original wordlist, only include words that have a frequency in the frequency dictionary that is greater than or equal to 1.

Here's a Python code example to demonstrate this process:

from collections import defaultdict

def remove_infrequent_words(wordlist):
    # Step 1: Create frequency dictionary
    frequency_dict = defaultdict(int)
    for word in wordlist:
        frequency_dict[word] += 1

    # Step 3: Create new wordlist
    new_wordlist = []
    for word in wordlist:
        if frequency_dict[word] >= 1:
            new_wordlist.append(word)

    return new_wordlist

# Example usage
wordlist = ["apple", "banana", "apple", "orange", "grape"]
filtered_wordlist = remove_infrequent_words(wordlist)
print(filtered_wordlist)

Output:

['apple', 'banana', 'apple', 'orange', 'grape']

In this instance, every tokenized word has at least one occurrence, hence the final wordlist doesn't change. To exclude uncommon terms from your particular dataset, you may substitute the "wordlist" option with your own list of tokenized words.

Kind Regards
Vivek Garg
React Native

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to remove all words with less than 1 occurrence in the wordlist

Answers