The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Answers
Best,
~Marius
-> I want to classifiy PDF documents into multiple categories by there text content.
2. If you are working with data, give a detailed description of your data (number of examples and attributes, attribute types, label type etc.).
-> There exists a database with multiple PDF documents already classified. hundreds of scanned PDF documents with OCR. attributes/categories 150 (cutomer A, B, C..., topics printer, monitor, laptop, etc...)
3. Describe which results or actions you are expecting.
-> The programm should learn the classes and then apply this model to the unknown documents. Finally, it should perform some actions like renaming and moving the files from the income directory into the database. (sequence: learn pdfs -> classify into multiple categories -> analyse unknown pdfs -> perform action depending on the predictions for each category)
4. Describe which results you actually get.
-> load pdf content into example set
-> train, verify performance
-> apply model to "unknown" labeled exampleset
-> RESULT: show data view table and the prediction(label) column) as result
-> HOW TO?: perform action (my solution: export an excel table with the results, then use a short java programm to rename the files)
now that's actually an excellent problem description!
You can use Loop Examples, Extract Macro and Execute Program to solve your problem. Please have a look at the attached process. You have to adapt the command line in Execute Program to your operating system.
If you are not familiar with macros in RapidMiner and have problems understanding the process, don't hesitate to ask again!
Note: you could actually also use Execute Script to execute javascript, but that requires knowledge of the java api.
Note2: you can search in the operator list. If you enter e.g. "execute" you see all operators with execute in their name. That way you can search for operators even if you don't know where they are located in the hierarchy.
Best,
~Marius
I want to use "Neural Net" instead of "Naive Bayes". The reason for this is that with the "documents from file" implementation I get better accuracy and confidence levels...
EDIT: I found differences in the example sets:
- the exampleset from the "documents from file" has a "label"-attribute of role "label" and type "polynominal"
- the exampleset from the "loop files, process documents and set role" has a "metadata_file"-attribute of role "label" and type "nominal"
How to change the type?
why do you want to use a neural net if naive bayes works also? Have you tried to use a SVM? Most of the time it gives better results for text processing than other algorithms.
Nominal and polynominal should not create any errors.
Do you mean "How to Extend RapidMiner 5.0"? You should only buy it if you want to create a new extension for yourself.
Best,
Nils
Support Vector Machine cannot handle polynominal label.
EDIT: Working solution: I embedded the SVM into a "Classification by Regression" operator. Now it works fine. even with a few examples the classification is correct. That's great!