The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Multi-label text classification problem"
I'm attempting to set up a mult-label (not just multi-class!) text classification experiment. To give you an idea: I have a data set of text documents, and each document can belong to one or more classes. Think blog posts with multiple topic tags. I would like to train and evaluate a machine learner on this data set.
My documents are stored in directories named after all applicable labels, much like below:
1. How do I make sure Rapidminer understands the labels I input in the "text directories" list (in the "Process Documents from Files" block) are multiple labels, and not just one big agglutinated label? The "sports, events" label should become "sports" AND "events". Just using commas in the class name apparently doesn't work.
Disregarding this problem for a while, I also tried exporting the generated feature vectors into a sparse format I can feed to libSVM externally. Which brings me to question 2:
2. Using the "Write Special" block, I'm using the following format to attempt to write sparse vectors:
And finally:
3. I would like to write the wordlist resulting from all the tokenization, stemming and filtering etc. to a file. This file should include at least the feature index and the matching realization. So for instance:
Any help from more experienced RapidMiners would be greatly appreciated!
My documents are stored in directories named after all applicable labels, much like below:
sports_eventsSo far, I've managed to turn my input documents into word vectors using "Process Documents from Files" and a combination of tokenization, stemming and filtering. But I have several questions:
> article1.txt
> article2.txt
politics_events
> article3.txt
politics
> article4.txt
...
1. How do I make sure Rapidminer understands the labels I input in the "text directories" list (in the "Process Documents from Files" block) are multiple labels, and not just one big agglutinated label? The "sports, events" label should become "sports" AND "events". Just using commas in the class name apparently doesn't work.
Disregarding this problem for a while, I also tried exporting the generated feature vectors into a sparse format I can feed to libSVM externally. Which brings me to question 2:
2. Using the "Write Special" block, I'm using the following format to attempt to write sparse vectors:
$l $s[ ][:]However, the label in the output is the nominal label, not the integer mapping that libSVM would require. How do I write the integer instead of the nominal label?
And finally:
3. I would like to write the wordlist resulting from all the tokenization, stemming and filtering etc. to a file. This file should include at least the feature index and the matching realization. So for instance:
1: germanyEven more ideal would be to write kind of extended sparse feature vectors, where each index:value pair is preceded by its realization in the text:
2: bankers
3: a
...
politics,events germany 1:0.0012 a 3: 0.0310 ...Is it possible to do this? If so, how? The only way I've been able to store the wordlist is with the "Write" block, which produces an unwieldy XML file...
politics germany bankers 2: 0.0008 a 3: 0.0020 ...
Any help from more experienced RapidMiners would be greatly appreciated!
Tagged:
0
Answers
I managed to set multiple labels by using the "Split" block to split the label attributes on the comma, and by then setting the role of each of the new label columns to label1, label2, etc. So my data set is pretty much ready.
Now I'm trying to set up the classification. Basically, what I want is to train an SVM classifier for each label in the training set, and each instance in the test set should be evaluated against each SVM model (of course using the appropriate label).
In a final phase, I want to set a threshold on the output probability of each label so I can determine which labels should be included in the final output.
I'm already stuck at that first step: I've been able to build models for each label using the "Loop Labels" block containing a "Discretization" and a "libSVM" block. This returns a collection of models.
I can also make a collection of test example sets using "Loop Labels".
My question now is: how do I evaluate each test set on its corresponding model? ExampleSet_Collection[0] should be run through Model_Collection[0], ExampleSet_Collection[1] through Model_Collection[1] etc. (Kind of like the zip() operator in Python, if anyone's familiar with it.)
Here's my unfinished setup as it is. I'd be grateful if someone could help me complete it: