The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Mining a PDF document
I'm new to rapid miner. i would like to mine a pdf to create a word and number vector. I using the following operators:
Operators as follows;
1. Read document ( Content type: PDF and Encoding: system)
2. Process Document from Data (Prune method: absolute and datamanagement: double_sparsey_array)
Inside Process Document from Data
2.a Extract information ( Query type:string matching)
2.b Tokenize (mode:non letter)
2.c Transform case (Transform to: Lower case)
Error Message: com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
Stack trace:
------------
Exception: java.lang.ClassCastException
Message: com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
Stack trace:
com.rapidminer.operator.text.io.ExampleSetDocumentInputOperator.getTextObjects(ExampleSetDocumentInputOperator.java:110)
com.rapidminer.operator.text.io.AbstractDocumentInputOperator.doWork(AbstractDocumentInputOperator.java:224)
com.rapidminer.operator.Operator.execute(Operator.java:833)
com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:379)
com.rapidminer.operator.Operator.execute(Operator.java:833)
com.rapidminer.Process.run(Process.java:925)
com.rapidminer.Process.run(Process.java:848)
com.rapidminer.Process.run(Process.java:807)
com.rapidminer.Process.run(Process.java:802)
com.rapidminer.Process.run(Process.java:792)
com.rapidminer.gui.ProcessThread.run(ProcessThread.java:63)
Hi Neil. I'm getting "com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
". The sequence includes: 1. Read document (pdf) ---> 2. Process Document from Data 2a. Tokenize 2.b Transform case. I'm trying to create word vector. Thank you for your assistance.
Operators as follows;
1. Read document ( Content type: PDF and Encoding: system)
2. Process Document from Data (Prune method: absolute and datamanagement: double_sparsey_array)
Inside Process Document from Data
2.a Extract information ( Query type:string matching)
2.b Tokenize (mode:non letter)
2.c Transform case (Transform to: Lower case)
Error Message: com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
Stack trace:
------------
Exception: java.lang.ClassCastException
Message: com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
Stack trace:
com.rapidminer.operator.text.io.ExampleSetDocumentInputOperator.getTextObjects(ExampleSetDocumentInputOperator.java:110)
com.rapidminer.operator.text.io.AbstractDocumentInputOperator.doWork(AbstractDocumentInputOperator.java:224)
com.rapidminer.operator.Operator.execute(Operator.java:833)
com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:379)
com.rapidminer.operator.Operator.execute(Operator.java:833)
com.rapidminer.Process.run(Process.java:925)
com.rapidminer.Process.run(Process.java:848)
com.rapidminer.Process.run(Process.java:807)
com.rapidminer.Process.run(Process.java:802)
com.rapidminer.Process.run(Process.java:792)
com.rapidminer.gui.ProcessThread.run(ProcessThread.java:63)
Hi Neil. I'm getting "com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
". The sequence includes: 1. Read document (pdf) ---> 2. Process Document from Data 2a. Tokenize 2.b Transform case. I'm trying to create word vector. Thank you for your assistance.
0
Answers
The output from the Read Document operator is a document whereas the Process Documents from Data expects an Example Set.
One option is to insert a Documents to Data operator between them.
Another better option would be to use the Read Documents from Files operator.
regards
Andrew