The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"How to improve Classification in Text Mining"
I'm doing classification (15 classes) of technical papers using their abstract.
My processes are simple.
Learning:
+ TextInput
+ String Tokenizer
+ English StopwordFilter
+TokenLengthFilter
+ Binary2MultiClassLearner
+LibSVMLearner
+ModelWriter
Applying:
+TextInput
+ String Tokenizer
+ English StopwordFilter
+TokenLengthFilter
+ModelLoader
+ModelApplier
+ExcelExampleSetWriter
I get results but I'm not satisfied with them. How do I improve them? ???
I've been searching the forum and seen that feature selection is one way. There are lots of examples of FeatureSelection operator uses but I couldn't find one that writes to a model file. One example from the installer is shown but I couldn't figure out where to add the ModelWriter. Or am I thinking wrong? ???
....
+ FeatureSelection
+XValidation
+NearestNeighbors
+OperatorChain
+ModelApplier
+Performance
+ProcessLog
I'm also thinking of forcing some attributes with bigger weights. Is this a good thing to do and how do I do this?
thanks,
Matthew
My processes are simple.
Learning:
+ TextInput
+ String Tokenizer
+ English StopwordFilter
+TokenLengthFilter
+ Binary2MultiClassLearner
+LibSVMLearner
+ModelWriter
Applying:
+TextInput
+ String Tokenizer
+ English StopwordFilter
+TokenLengthFilter
+ModelLoader
+ModelApplier
+ExcelExampleSetWriter
I get results but I'm not satisfied with them. How do I improve them? ???
I've been searching the forum and seen that feature selection is one way. There are lots of examples of FeatureSelection operator uses but I couldn't find one that writes to a model file. One example from the installer is shown but I couldn't figure out where to add the ModelWriter. Or am I thinking wrong? ???
....
+ FeatureSelection
+XValidation
+NearestNeighbors
+OperatorChain
+ModelApplier
+Performance
+ProcessLog
I'm also thinking of forcing some attributes with bigger weights. Is this a good thing to do and how do I do this?
thanks,
Matthew
Tagged:
0
Answers
Regarding the feature selection: What you want to do is probably not to use a ModelApplier, but rather save the attribute weights (AttributeWeightsWriter) and apply them (AttributeWeightsApplier).
Regarding the optimization of the setup: There is no general answer. Try optimizing parameters of the SVM and of the text input, try adding term n-grams, etc., maybe add a dictionary for synonyms. It very much depends on your texts.
Cheers,
Simon
Sometimes it is tempting to tweak the answer, and to forget about whether the question makes any sense. Fifteen classes? Think how many examples would be necessary to represent the problem space.
Then what is the ideal number of classes for text classification? And how do you solve the problem of classifying technical documents into many categories ---is data mining not the solution?
Matthew
Also, where would you add the AttributeWeightsWriter operator in this example?
+ FeatureSelection
+XValidation
+NearestNeighbors
+OperatorChain
+ModelApplier
+Performance
+ProcessLog
thanks,
Matthew
Of course, Data Mining is the solution ;D
Regarding the number of classes: What haddock meant was that you need a lot of examples / documents per category to a) have enough information to distinguish the classes and b) to make any statistical reliable performance estimates. So ... how many do you have ?
Low performance values are an indication that the classes cannot be easily distinguished. Here are some rough ideas:
- If the classes are the leafes of a hierachy, try to go up the hierarchy and merge classes (i.e. class "network administration" and "software engineering" into "computer science") to see whether the results get better. Performing Feature Selection on different "levels" and comparing the results manually may give you a better feeling where the problem is located
- Merge classes iteratively and perform a one vs all classification. During scoring aggregate the confidence-values from the different models (e.g. maximum, use the operator AttributeConstruction for that strategy)
Regarding the posted process: After the FeatureSelectionOperatorregards,
Steffen
For each category I have close to 100 examples. BTW, what is the ideal number of examples? I'm only working on the abstract section of the documents.
You're right. One reason my classification did not have good result was overlap with the categories. There are categories that I should have combined. But is it possible to do hierarchical categorization in RapidMiner? Sort of a superclass for some group of classes. So when the program can not decide between two classes, it will choose their superclass. Do you have an example for this?
Last question: What exactly does the "attribute weight" do? From what I understand, you apply the attribute weight to an exampleset to change the values of the attributes. What else is it use for?
thanks a lot.
Matthew
RapidMiner offers the standard t-test ... but before we start testing, let's see if we can achieve some improvements at all. Like Haddock once said (oh, I should add this one to my signature), "RapidMiner is like Lego". You can achieve nearly anything with the right combination of operators. I will give you some hints:
- AttributeConstruction in combination with ChangeAttributeRole or ExchangeAttributeRoles to aggregate labels
- ProcessBranch to realize an if-else-statement
- ValueIterator allows you to iterate over the values of your label attribute
- ProcessLog to log the performance
It is quite hard to create an automatic process, which finds the optimal merge of categories for your problem. Indeed, it would take more than an one hour (or more) for an experienced user, so I suggest that you try manual combinations (including the domain knowledge you have) to get a better feeling which classes to merge. Please understand that I cannot provide a complete process here. Play around and I will guarantee that you will appreciate RapidMiner more and more . The AttributeWeight is an indication of how important the attribute is for distinction of the classes. In case of FeatureSelection it is always 1 or 0 (use it or dont), other operators (like InformationGainWeighting) provide a less crisp evaluation. Use the operator AttributeWeightSelection to filter the attributes to remove redudant or (worse) disturbing information.As I said above, the optimal featureset may / will depend on the current "merge situation" of your categories.
I wish you success
regards,
Steffen
PS: If it wont work, try this: http://www.youtube.com/watch?v=egfCXLHfw-M ; (cannot get rid of this song )
Last question: For the Feature Selection, do you apply Feature Selection for one class only or to more than one class? What I mean is how many classes to input in the TextInput operator. I tried both. The Feature Selection with one class runs fast but the one with many classes failed. The error message shows "outofmemoryError: Java heap space". Is it ok to run Feature Selection separately for each class then combine the attribute weight results later on.
thanks,
Matthew
I suppose that you mean with "one class" "one class vs all other classes", otherwise it makes no sense. As told above, the FeatureSelection tries to find a feature set which contains enough / exactly the information (limited to the information available through the data) you need to separate the classes given the current classification problem aka label.
That means the feature set will most probably change when you change the label. So it makes no sense to say which is the correct strategy, the question is what do you want to achieve and (as we have seen above) what can be learned.
If you have memory problems try the operator GeneticAlgorithmn instead, which delivers comparable results.
regards,
Steffen
PS: I have got the slight feeling that you are missing some data mining basics. I suggest this book. RapidMiner is a tool for the application to a science, so it is better to learn the science first and the tool afterwards. No offense .
for FeatureSelection, you will need to have all classes of your classification task, because the selection optimizes the feature set for exactly this classification task. That's why, theres a learner and a Crossvalidation inside: To estimate the performance in this classification task on the current attribute set.
If your data set contains only one class, you don't need any feature at all, hence the forward selection is very fast. The performance is simply always 100%, with or without features.
If you need forward selection and the genetic selection doesn't fit your need, we provide a plugin with an improved and very memory efficient version of the FeatureSelection. You might ask for a quote, if you want.
Greetings,
Sebastian
I think Rapid-I should publish a book in data mining using RM. The content of this forum is more than enough to fill a book.
thanks,
Matthew
you won't believe but we are working on a book...
Greetings,
Sebastian
Matthew
this depends on our workload for other projects and such stuff. A first introductory part should be published together with the final release. Let's hope we get it done until then...
Greetings,
Sebastian