Input file format for Process Documents From File operator
Does anyone know what text structure is expected or can be parsed using the Process Documents from Files operator? I am working on Ch 15 of the book written by Markus Hofmann and Ralf Klinkenberg. They use the Process Documents from Files operator to loop over a bunch of text files containing hotel rating data. An entry for a single hotel looks like this:
<Author>everywhereman2
<Content>Truncated for brevity....
<Date>Jan 6, 2009
<Rating>5 5 5 5 5 5 5 5
What irks me is that there absolutely nothing in the documentation for this operator telling me that is an acceptable text structure that can be parsed. Does anyone happen to know more about this operator?
Best Answers
-
Thomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
The Text Processing extension is a bit sparse on operator reference.
What I would do is review the Text Analytics KB and watch these videos on how to properly load/parse text data and build models from it.I will be recording a very detailed and updated Text Mining in RapidMiner video over the next few weeks.
1 -
ccricha Member Posts: 9 Contributor II
Are there plans to update the documentation for this extension? Even just some JavaDoc would be better than nothing.
0