The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

What is the best number of topics on lda?

elena2020chaoelena2020chao Member Posts: 13 Learner III
edited December 2018 in Help
Hello
I want to identify the topic in my data.
I used Lda. I plotted the likelihood for different values ​​for the number of topics. Now, how do I know which topic is better and more optimal?
As I increase the number of topics, the liklihood becomes less. But my analysis is getting harder by increasing the number of threads. Is there a way to know how much liklihood should be?
Thanks

Answers

  • elena2020chaoelena2020chao Member Posts: 13 Learner III

    Does not someone help me?
    What is the best value for likelihood in each of these charts?
    help me
    Thanks a lot

    444.JPG

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi
    From my point of view, I would use LDA inside an Optimize operator.

    Regards

    Lionel
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    In my view there is no simple answer to this question.  In general, the more topics you allow, the better your performance metrics look.  But as you noted, having more topics increases the complexity of your analysis.  So you have to make a tradeoff decision.  I don't think there is any single way to find the "best" number. @mschmitz is the architext of the LDA extension so I would be interested in his thoughts on this.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi,

    for me finding the optimal number of topics is very similar to k in k-means. there is no easy thing to optimize. The next toolbox version will have "Perplexity" in it, which is the common measure.

     

    Here is a ncie read on the topic: LDA Best Practices 

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • elena2020chaoelena2020chao Member Posts: 13 Learner III

    Hello

    Thank you both dear professors
    Just what did you mention, when does Perplexity come from?
    What is its purpose?
    Is it better for you to review the data for Alpha? Or set heuristics better?
    Is there a criterion for assessing the goodness of Lda with different alpha and beta parameters? How?


    Thank you
    With respect

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi @elena2020chao,

    Perplexity is defined as

    exp(-LLH/#tokens)

    and is thus a direct dereritive of LLH. It will be present in the next release. It's just common to use this measure over LLH.

     

    For alpha/beta: I would go for Heuristics + Optimize Hyperparameters. It supports an automatic change over the fitting process.

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • elena2020chaoelena2020chao Member Posts: 13 Learner III

    Hello
    Thank you very much for your reply
    Only this operator is Optimize Hyperparameters
      I did not find ...
    And what is the basis of liklihood?
    Thanks if you answer
    With regards

     

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi,

     

    Optimize Hyperparameters is a setting for the LDA operator. Not an operator.

     

    The LLH is the LLH of the underlying model. See: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

    It can be interpreted like a "goodness of fit" in other models.

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • elena2020chaoelena2020chao Member Posts: 13 Learner III

    Hi, thank you very much:smileyhappy:
    I realized
    Impatiently waiting for the new version of the program ...
    How can you find out what each topic is about? Do I need to understand myself by repeating this topic?
    Is it possible to determine the content of each cluster in kmeans by the LDA? I could not do anything ...
    thanks again:heart:

Sign In or Register to comment.