A question about naive bayes based text classification

gfyang · October 2009

Hi,

I am testing the naive bayes(NB) for text classification. To my understanding, the result should not be affected by the tf-idf vector of the text. Because NB considers the frequency of each term(t) in each category(c), i.e., p(t | c), and this information is stored in WordList, not the term vectors(i.e., the ExampleSet). Right?

However, after I changed the tf-idf values in ExampleSet, for example, by multiplying a weight x, 0<x<1, the accuracy is changed differently according to different weight x. WHY?

Sincerely yours,
gfyang

land · October 2009

Hi,
NaiveBayes is a general learning algorithm working on tables. You might use it in order to do text classification, but it is applicable on all other problems, too.
Although the original TF-IDF values of the documents were calculated using the word list, Naive Bayes doesn't know them. It just takes the example set into consideration.
On the other hand, if you apply a weight transformation on all examples of the example set in the same way, the naive bayes result shouldn't differ, because it treats all attributes as independent from each other. But there might be some numerical problems in the limits of computer's precision, causing slightly different results.

Greetings,
Sebastian

gfyang · October 2009

Hi, Sebastian,

Thank you for the reply.

I tested several experiments. For example, I multiply all the TF-IDF values with the same weight, and then I change the weight, which is applied to all the TF-IDF values again. The results show that such weight adjustment could really change the accuracy, although all the TF-IDF values are adjusted by exactly the same weight.


double precision=0.0;

Iterator<Attribute> attributeIterator; // the iterator for all attributes
Iterator<Example> exampleIterator; // the iterator for all examples

// save the text vector into array
double text_array[][] = new double [num_exp][num_att-2];
exampleIterator = exampleSet.iterator(); // move the iterator to the begining
for(int i=0; i<num_exp; i++)
{
	Example example = exampleIterator.next(); // read one example
	attributeIterator = attributes.allAttributes(); // build the iterator for the attributes
	for(int j=0; j<num_att-2; j++) // read all the attributes except that last two
	{
		Attribute att = attributeIterator.next();
		text_array = example.getValue(att); // read the TF-IDF value into array
	}
}

// adjust TF-IDF with weights
double fWeight = 0;
for(int i=0; i<20; i++)
{
	exampleIterator = exampleSet.iterator(); // move the iterator to the beginning
	for(int i2=0; i2<num_exp; i2++)
	{
		Example example = exampleIterator.next();
		attributeIterator = attributes.allAttributes();
		for(int j=0; j<num_att-2; j++)
		{
			Attribute att = attributeIterator.next();
			double val = text_array[i2] * fWeight; // adjust the TF-IDF by weight
			example.setValue(att, val); // save the adjusted TF-IDF into the ExampleSet
		}
	}

	precision = my_validate_classiciation(); // do classification by naive bayes based on the adjusted TF-IDF
	System.out.println("(" + fWeight + "): " + precision + " ");

	fWeight += 0.05; // increase the weight
	fWeight = roundTwoDecimals(fWeight); // keep two places behind the decimal point
}

The results are:


(weight): precision
(0.0): 0.0 
(0.05): 0.3875 
(0.1): 0.3125 
(0.15): 0.3125 
(0.2): 0.3125 
(0.25): 0.2875 
(0.3): 0.275 
(0.35): 0.2625 
(0.4): 0.2625 
(0.45): 0.2625 
(0.5): 0.25 
(0.55): 0.25 
(0.6): 0.25 
(0.65): 0.2375 
(0.7): 0.2375 
(0.75): 0.2375 
(0.8): 0.2375 
(0.85): 0.2375 
(0.9): 0.2375 
(0.95): 0.2375 
(1.0): 0.2375

It seems that the differences in the results are too large to be ignored, which might not be caused by the computer precision problem.

So, I guess that when doing NB classification by RM, this algorithm really reads ExampleSet and has some important calculations based on ExampleSet, which affects the precision directly.

Sincerely yours,
gfyang

land · October 2009

Hi,
which version of rapid miner do you use?

By the way: There are many methods in the rapid miner api, which would make your life simpler...

Greetings,
Sebastian

gfyang · October 2009

Hi,

The version of my RM is 4.5.

I am developing a new idea to adjust the text vector, and I want to test this idea on several classic classification methods. I will try the other methods later.

Thank you for the help.

Sincerely yours,
gfyang

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

A question about naive bayes based text classification

Answers