The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How to remove all words with less than 1 occurrence in the wordlist

lolollolol Member Posts: 5 Learner I
Hi everyone, I need help here. How do I remove all tokenized words with less than 1 occurrence in the wordlist?


Answers

  • vivek101vivek101 Member Posts: 8 Contributor II
    edited June 2023
    Hey,

    You may take the following actions to exclude any tokenized terms with less than 1 occurrence in the wordlist:

    1. To determine how frequently each word in the wordlist occurs, create a frequency dictionary. Words will serve as the dictionary's keys, and their associated frequencies will serve as their values.

    2. Repeat the wordlist iterations, updating the frequency dictionary as necessary. Add one more frequency point if a term appears in the dictionary already. Add it to the dictionary with a frequency of 1 if it isn't already there.

    3. Make a fresh wordlist by eliminating terms with a frequency below one. Repeat the process using the original wordlist, only include words that have a frequency in the frequency dictionary that is greater than or equal to 1.

    Here's a Python code example to demonstrate this process:
    from collections import defaultdict
    
    def remove_infrequent_words(wordlist):
        # Step 1: Create frequency dictionary
        frequency_dict = defaultdict(int)
        for word in wordlist:
            frequency_dict[word] += 1
    
        # Step 3: Create new wordlist
        new_wordlist = []
        for word in wordlist:
            if frequency_dict[word] >= 1:
                new_wordlist.append(word)
    
        return new_wordlist
    
    # Example usage
    wordlist = ["apple", "banana", "apple", "orange", "grape"]
    filtered_wordlist = remove_infrequent_words(wordlist)
    print(filtered_wordlist)
    Output:

    ['apple', 'banana', 'apple', 'orange', 'grape']

    In this instance, every tokenized word has at least one occurrence, hence the final wordlist doesn't change. To exclude uncommon terms from your particular dataset, you may substitute the "wordlist" option with your own list of tokenized words.

    Kind Regards
    Vivek Garg
    React Native
Sign In or Register to comment.