The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Topic Modeling for PDF files

KarissaKarissa Member Posts: 3 Learner I
Hello everyone,

I want to read several PDF files (business reports) and analyze them. Until now I use the operator Read Douments, because I haven't found a better operator yet.
I want to do a topic modeling on the files to find out relevant topics. A pre-processing is done by the operators Tokenize, Transform Cases, Filter Stopwords, Filter Tokens by Length and Stem. For this I have found the two operators: Extract Topics from Documents (LDA) and Extract Topics from Data (LDA). Unfortunately both do not work properly.
Extract Topics from Documents( LDA) needs a collection as input and I don't know how to get it.
And Extract Topics from Data (LDA) needs a text attribute and again I don't know how to get it.

Accordingly, I have these two questions:
1) Is there an operator I can use to read in multiple PDF files?
2) What is the best operator for Topic Modeling and how do I implement it?

I have created the process below, it runs, but I only get null values as results. Does anyone have a tip for me?

Many thanks for the help
Tagged:

Best Answer

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Solution Accepted
    Hi,
    likely the texts are for some reasons empty?

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hey,
    I think what you want to do is use Loop Files, to loop over your files and then use Read document inside. What you will receive is a collection of documents, which you process as needed.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • KarissaKarissa Member Posts: 3 Learner I
    Thank you @MartinLiebig . The Loop Files Operator worked.
    The process runs through, but all results are zero/null. What could be the reason for this?


    Many thanks

  • KarissaKarissa Member Posts: 3 Learner I
    I have changed the process and now I get a result. Many thanks.

Sign In or Register to comment.