The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Plain Text Classification/Clustering
Hi all,
This is the scenario.
I have an input text file containing many thousand paragraphs of comments made by different people in plain engligh. Each person's comment or statement is basically one paragraph, separated by a \n of course.
I want to read in this single file and then for rapidminer to be able to classify each paragraph within the file to a particular cluster or topic. I am aware of the fact that rapidminer will expect me to specify how many clusters or unique classifications i want up front, this is fine although ideally i would like rapidminer to determine this for me based on the input file.
I have installed the text plugin for rapidminer and am using the TextInput to read the single input file, however i am having difficulty getting rapidminer to detect each unique paragraph within the file as one example of data - any ideas on how this can be done?
Secondly, i would like to know which type of learning is the most suitable for my problem above, unsupervised or supervised?
Finally, upon deciding which type of learning is the best suited to this task, can somebody then suggest which algorithm/s are designed to do natural english language classification best?
My plan is to create a learner (model) that can then easily be applied to future comments as and when they occur.
Thanks in advance for your time.
Ritesh
This is the scenario.
I have an input text file containing many thousand paragraphs of comments made by different people in plain engligh. Each person's comment or statement is basically one paragraph, separated by a \n of course.
I want to read in this single file and then for rapidminer to be able to classify each paragraph within the file to a particular cluster or topic. I am aware of the fact that rapidminer will expect me to specify how many clusters or unique classifications i want up front, this is fine although ideally i would like rapidminer to determine this for me based on the input file.
I have installed the text plugin for rapidminer and am using the TextInput to read the single input file, however i am having difficulty getting rapidminer to detect each unique paragraph within the file as one example of data - any ideas on how this can be done?
Secondly, i would like to know which type of learning is the most suitable for my problem above, unsupervised or supervised?
Finally, upon deciding which type of learning is the best suited to this task, can somebody then suggest which algorithm/s are designed to do natural english language classification best?
My plan is to create a learner (model) that can then easily be applied to future comments as and when they occur.
Thanks in advance for your time.
Ritesh
0
Answers
for tasks like this you probably can use the operator "Segmenter" which is also part of the text plugin.
Cheers,
Ingo
When you say 'segmentation' are you referring to the problem of reading in the text file itself, or is this the actual learning you are referring to?
Cheers,
Ingo