Loading Folder Names for Text Processing
So we have a bunch of files in different folders and when we bring them in to RapidMiner for analysis, we believe the folder name to be important as an input or simply as an identifying piece of information so we want to read it.
I admit this is probably one of the things one never expects to have to do, and yet, I had to do this for a customer; and learning to do it deepens ones skills and showcases the flexibility of RapidMiner.
If you follow this article carefully, you can use the attached process and repeat what we have done here. There will be no data files attached because in every case these will be different and at different places in your file system. Let me explain what I am using and why:
- The Text Processing extension installed.
- 2 empty text files called: applezz.txt and orangezz.txt
- 1 folder in my Documents folder containing two folders with the name apple and orange which, in turn, contain their respective text files mentioned above. And to make things clear:
C:\Users\KonstantinosBonikos\Documents\delete\apple\applezz.txt
C:\Users\KonstantinosBonikos\Documents\delete\orange\orangezz.txt
- The number 46. This number will be different for you and is derived by counting the number of characters before the name of the folder we want. In this case:
1. First we place a Loop Files subprocess operator on to the Process area. Make sure to tick both the recursive and enable macros tickboxes as below.
Don't worry if the enable parallel execution is not an option for the version of RapidMiner you are using, it is not important here.
Make sure to point the directory where you have your folders saved.
2. Double-click the Loop Files subprocess operator and place a Read Document and a Process Documents operator with default values while connecting them as normal (like in the screenshot below):
Inside the Process Documents operator is empty with a through connection:
What we are doing here is reading the applezz.txt and orangezz.txt files as documents and by processing them, we are importing their path name as metadata.
3. We now take the data that is produced, which looks like this:
This is where the counting becomes important. We are going to create a couple of attributes next based on the metadata_path.
4. Connect the data output to a Generate Attributes operator and create the following attributes using formulas.
- The ClassName attribute is set to whatever the folder_name value is, using the expression %{folder_name}
Remember folder_name was set as a macro by the Loop Files operator when we selected enable macros in step 1.
- The FolderName attribute is set by using cut(Nominal text, Numeric start, Numeric length).
- Nominal text is the folder name as represented by %{folder_name}
- Numeric start This means we need to know where the folder name starts in the path name, and in my case, it was at position 46.
- Numeric length This represents how many characters we count; and as these vary with folder name and it has to be a number. Therefore, we count the lenght of the total folder name and subtract the number of characters where the name we want starts by length(%{folder_name})-46.
5. Run the process and we get the following results:
Which evidently, give us folder names as data.
Feel free to download the attached process as an .rmp file. These can be imported by File>Import Process.