The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Loading Adobe/Word into Rapidminer"
Hi All,
I want to load some Adobe documents into Rapidminer so I can calculate word frequencies. I am able to do this with Excel sheets but can't seem to load the Adobe doc into it. Please let me know what operators I need to load either Adobe or Word docs into Rapidminer to calculate word frequencies.
Thanks.
Tagged:
0
Answers
You can load PDF, TXT, HTML, and XML files only. DOCX is not supported.
Thank you, that is helpful. Can you tell me what operators I will need to make this work?
Sure. If the files are in a directory, then use the Process Documets from Folders operator. This operator is found in the Text Processing extension available on the marketplace.
Actually reading DOCX is supported as well. Please see this sample process.
Huh, will you look at that. You taught me a new trick @JEdward! Thanks!
MSOffice documents are actually just zip files.
It also works with PPTX documents too, but you need to do need to change the Loop Zip Files from my example to loop through each slide as they store them in separate XML documents.
Have fun!