The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
A question about naive bayes based text classification
Hi,
I am testing the naive bayes(NB) for text classification. To my understanding, the result should not be affected by the tf-idf vector of the text. Because NB considers the frequency of each term(t) in each category(c), i.e., p(t | c), and this information is stored in WordList, not the term vectors(i.e., the ExampleSet). Right?
However, after I changed the tf-idf values in ExampleSet, for example, by multiplying a weight x, 0<x<1, the accuracy is changed differently according to different weight x. WHY?
Sincerely yours,
gfyang
I am testing the naive bayes(NB) for text classification. To my understanding, the result should not be affected by the tf-idf vector of the text. Because NB considers the frequency of each term(t) in each category(c), i.e., p(t | c), and this information is stored in WordList, not the term vectors(i.e., the ExampleSet). Right?
However, after I changed the tf-idf values in ExampleSet, for example, by multiplying a weight x, 0<x<1, the accuracy is changed differently according to different weight x. WHY?
Sincerely yours,
gfyang
0
Answers
NaiveBayes is a general learning algorithm working on tables. You might use it in order to do text classification, but it is applicable on all other problems, too.
Although the original TF-IDF values of the documents were calculated using the word list, Naive Bayes doesn't know them. It just takes the example set into consideration.
On the other hand, if you apply a weight transformation on all examples of the example set in the same way, the naive bayes result shouldn't differ, because it treats all attributes as independent from each other. But there might be some numerical problems in the limits of computer's precision, causing slightly different results.
Greetings,
Sebastian
Thank you for the reply.
I tested several experiments. For example, I multiply all the TF-IDF values with the same weight, and then I change the weight, which is applied to all the TF-IDF values again. The results show that such weight adjustment could really change the accuracy, although all the TF-IDF values are adjusted by exactly the same weight. The results are: It seems that the differences in the results are too large to be ignored, which might not be caused by the computer precision problem.
So, I guess that when doing NB classification by RM, this algorithm really reads ExampleSet and has some important calculations based on ExampleSet, which affects the precision directly.
Sincerely yours,
gfyang
which version of rapid miner do you use?
By the way: There are many methods in the rapid miner api, which would make your life simpler...
Greetings,
Sebastian
The version of my RM is 4.5.
I am developing a new idea to adjust the text vector, and I want to test this idea on several classic classification methods. I will try the other methods later. Thank you for the help.
Sincerely yours,
gfyang