The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Preprocessing texts
Hello everyone!
I'm working with Rapid miner with the goal of preprocessing texts (I want to start from the text and I want to obtain a matrix where the rows are my texts and the columns are the features). I feed the process with a folder containing all my texts and I want to be sure of a detail: which is the order in which they're stored in the final matrix?
Suppose my folder contains:
a.txt
b.txt
c.txt
in the matrix I'll have three rows
1 -> a.txt
2 -> b.txt
3 -> c.txt
Is this correct? and so am I sure that the rows of my matrix will correspond exactly to the documents in alphabetical order?
Thnks for your cooperation
Lorenzo
I'm working with Rapid miner with the goal of preprocessing texts (I want to start from the text and I want to obtain a matrix where the rows are my texts and the columns are the features). I feed the process with a folder containing all my texts and I want to be sure of a detail: which is the order in which they're stored in the final matrix?
Suppose my folder contains:
a.txt
b.txt
c.txt
in the matrix I'll have three rows
1 -> a.txt
2 -> b.txt
3 -> c.txt
Is this correct? and so am I sure that the rows of my matrix will correspond exactly to the documents in alphabetical order?
Thnks for your cooperation
Lorenzo
0
Answers
as far as I know the files are processed in the same order as the text files are listed in the document folder per default, i.e. they should be processed in alphabetical order. You can, however, be sure by setting the "id_attribute_type" parameter of the TextInput operator to either "short" (filenames) or "long" (filepaths). So each example will get a human readable ID instead of a simple number.
Cheers,
Ingo