The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Text classification into topics

CJBchtCJBcht Member Posts: 3 Learner I
Hello,
Apologies for the super-beginner question but I am a super-beginner

I have 4 documents in Spanish that contain about 600 words altogether. I would like RapidMiner to scan each document and classify the words they contain according to the topics they relate to. I know what general words can be expected so I could even feed it a list of words I consider to be related to one another and that belong in a particular class to facilitate the classification process, if it helps!
Ideally, I would like to compare the topic classes (and prominence of the various classes as a proportion of total words) found in the different documents.
I have tried Naive Bayes, K-Means, K-Medoids and Extract Topics from Document (LDA) but despite my good will and reading a lot about these operators (including on this forum) I still cannot figure out what tool is the best to use in my case and how to do this simple text classification.

Please help me. My thesis is at stake

Thank you very much!

Attached is one of my attempts, and the 4 data files

Answers

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Are you trying to classify the documents themselves, or the words within the documents?
    Because the first 3 operators you list are going to work on the level of the documents themselves.
    Only the LDA topic extraction is going to do something like what you have described without a lot of reworking of your data structure.
    You may need to play around with the parameters of that operator to get results that you are satisfied with.  LDA can be very sensitive.
    You also should do some preprocessing to simplify the text (map/replace similar tokens, exclude stopwords, use stemming, etc.)   Text mining is complicated and not generally considered a great starting point for beginners, although TurboPrep and AutoModel do have some nice built-in assistance now if you have access to those wizards.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • CJBchtCJBcht Member Posts: 3 Learner I
    Hi
    Thank you for your quick reply. I have done a lot of preprocessing (tokenize, stemming, filter stopwords, filter by length etc) already. Can you recommend any material (videos, tutorials) that could help me in using the LDA operator smoothly?
  • CJBchtCJBcht Member Posts: 3 Learner I
    Sorry for double posting I thought attaching the process might help. Basically it would be really helpful to get a better grip of the parameters. When I run the attached process it gives me this error message
    • Exception: java.lang.NumberFormatException
    • Message: null
    • Stack trace:
    • sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    • sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    • sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    • java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    • java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
    • java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:1005)
    • com.rapidminer.studio.concurrency.internal.AbstractConcurrencyContext.collectResults(AbstractConcurrencyContext.java:206)
    • com.rapidminer.studio.concurrency.internal.StudioConcurrencyContext.collectResults(StudioConcurrencyContext.java:33)
    • com.rapidminer.studio.concurrency.internal.AbstractConcurrencyContext.call(AbstractConcurrencyContext.java:141)
    • com.rapidminer.studio.concurrency.internal.StudioConcurrencyContext.call(StudioConcurrencyContext.java:33)
    • com.rapidminer.Process.executeRootInPool(Process.java:1349)
    • com.rapidminer.Process.execute(Process.java:1314)
    • com.rapidminer.Process.run(Process.java:1291)
    • com.rapidminer.Process.run(Process.java:1177)
    • com.rapidminer.Process.run(Process.java:1130)
    • com.rapidminer.Process.run(Process.java:1125)
    • com.rapidminer.Process.run(Process.java:1115)
    • com.rapidminer.gui.ProcessThread.run(ProcessThread.java:65)

    • Cause
    • Exception: java.lang.NumberFormatException
    • Message: empty String
    • Stack trace:
    • sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1842)
    • sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    • java.lang.Double.parseDouble(Double.java:538)
    • com.rapidminer.extension.operator.text_processing.modelling.mallet.PerplexityCalculator.evaluate(PerplexityCalculator.java:60)
    • com.rapidminer.extension.operator.text_processing.modelling.mallet.PerplexityCalculator.estimatePerplexity(PerplexityCalculator.java:42)
    • com.rapidminer.extension.operator.text_processing.modelling.mallet.LDAModel.calculatePerplexity(LDAModel.java:345)
    • com.rapidminer.extension.operator.text_processing.modelling.mallet.LDA.doWork(LDA.java:184)
    • com.rapidminer.operator.Operator.execute(Operator.java:1031)
    • com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:77)
    • com.rapidminer.operator.ExecutionUnit$2.run(ExecutionUnit.java:812)
    • com.rapidminer.operator.ExecutionUnit$2.run(ExecutionUnit.java:807)
    • java.security.AccessController.doPrivileged(Native Method)
    • com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:807)
    • com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:423)
    • com.rapidminer.operator.Operator.execute(Operator.java:1031)
    • com.rapidminer.Process.executeRoot(Process.java:1372)
    • com.rapidminer.Process.lambda$executeRootInPool$5(Process.java:1351)
    • com.rapidminer.studio.concurrency.internal.AbstractConcurrencyContext$AdaptedCallable.exec(AbstractConcurrencyContext.java:328)
    • java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    • java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    • java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    • java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)



  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    @mschmitz any recommended resources for using the LDA topic extraction operator?  I know that is your baby :-)
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.