The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Loading multiple pdf files
I am trying to load a corpus of several pdf-files into RM. I selected 'Process documents from Files' from the text processing menu and selected the directory with the pdf-files. But when I run this process, it gives me the following error message :
Feb 28, 2012 2:30:04 AM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Feb 28, 2012 2:30:04 AM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
==> +- Process Documents from Files[1] (Process Documents from Files)
subprocess 'Vector Creation'
Feb 28, 2012 2:30:04 AM SEVERE: java.lang.ArrayIndexOutOfBoundsException
Can you please help?
-Stephan
Feb 28, 2012 2:30:04 AM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Feb 28, 2012 2:30:04 AM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
==> +- Process Documents from Files[1] (Process Documents from Files)
subprocess 'Vector Creation'
Feb 28, 2012 2:30:04 AM SEVERE: java.lang.ArrayIndexOutOfBoundsException
Can you please help?
-Stephan
Tagged:
0
Answers
the information you provide is a bit sparse. Can you please post your process setup? Did you try another folder with another set of pdf files? You can find some useful hints on what to include into your questions here here.
Kind regards,
Marius
* the key concepts that emerge from various subsets of this corpus (which are in separate - labeled - subfolders)
* the various n-grams that contain certain words in them (e.g. every n-gram with the word 'security' in it): which combinations occur most frequently in the text
* co-occurences of various words within certain 'windows' (say - 2 sentences) throughout the text.
* automatic clustering of all pdfs
* ...
All of this after having run the usual textmining processes of course (tokenization, stemming, etc.) But it seems to me that with the available information, we should be able to set up that entire process. All the help I am asking for is the very first step: to get the pdfs into Rapidminer.
I am trying to follow the Vancouver Data video on 'loading text into Rapidminer'. As explained there I click on 'Process Documents from Files' in the 'Text Processing' operators section. AT that point my screen already looks different from the video: the 'exa' and wor' handles are automatically connected to two 'res' handles on the right. I still click on 'Text Directories', and I input the folder where I have the first set of pdfs. I accept the suggested
Mar 11, 2012 3:38:04 AM INFO: Loading initial data.
Mar 11, 2012 3:38:05 AM SEVERE: Process failed: operator cannot be executed (4). Check the log messages...
Mar 11, 2012 3:38:05 AM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
==> +- Process Documents from Files[1] (Process Documents from Files)
subprocess 'Vector Creation'
Mar 11, 2012 3:38:05 AM SEVERE: java.lang.ArrayIndexOutOfBoundsException: 4
Thanks for any help.
-Stephan
thanks for your detailed description. Unfortunately I still can't guess from where the error results.
Which RapidMiner version do you use? If it is not the latest version (5.2.002), please update.
If the error still occurs, can you please post your process setup?
Did you try another folder with different pdf files as input to check if it is caused by a corrupted pdf file?
Best, Marius
I haven't seen the movie yet, so I can't tell you if anything is different from the movie
Best,
Marius
Best,
Marius