How to analyze negetive attribute value
Hi,
I am trying to build prediction model to predict the category of any case by looking the description of it. I have two training data set, the first data set contains case id and description and category.
ID Description Category
1 "some txt" A
2 "some text 2" B
and second data set contains following rows. which is basically tells me that which case should not fall for particular category.
ID Description Category
1 "some other txt" notA
2 "some other text 2" notB
I want to tain my model using both the dataset. I am having problem to feed the second data set to my model. I want to feed the second data set in such as way that it give correct information to my model. Any help would be great. Thanks!
Best Answers
-
atul_kotwale Member Posts: 5 Contributor II
Hi @kypexin
Thanks for reply. I am also considering, not to include negetive result but I have one more thought, if I somehow I convert the negetive dataset to below format by assigning 0 to the category which is not possible and giving 1 to all possible category.
ID Description A B C
1 "some other txt" 0 1 1
2 "some other text 2" 1 0 1
and similarly convert the positive dataset to below
ID Description A B C
1 "some txt" 1 0 0
2 "some text 2" 0 1 0
If I feed above data to my model, will that data would confuse my model ?
Thanks
0 -
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany0
Answers
Hi @atul_kotwale
I am afraid that you have to think on reformulating the task. You cannot have such 'negative' labels like you described.
For example, if "some other text 2" = notB, then it is either A or notA, which means third category C.
On the other hand, "some txt" = A is also obviously notB.
So you may only have an example which belongs to some category, but you cannot label an example as not belonging to some category.
Vladimir
http://whatthefraud.wtf
Hi @atul_kotwale
Yes.
Your first example is marked both B and C, which again is not possible in terms of ML data.
There should be only one "1" in each row, in case you want to predict categories A, B or C to any given description.
But this is a bit different task from your initial thoughts: this way you just categorize each text separately, and not much more; for example, both "some other text 2" and "some txt" are from category A (as I understood, that's not what you want to achieve).
More generally speaking, you can not feed to the model 2 different datasets with different meanings of categories.
The model still should work with a single dataset, in our case this one, where all examples are actually different:
ID Description A B C
1 "some other txt" 0 1 1
2 "some other text 2" 1 0 1
3 "some txt" 1 0 0
4 "some text 2" 0 1 0
Vladimir
http://whatthefraud.wtf
@kypexin Thanks. I got it now.
Hi @atul_kotwale,
one idea to use it, to build a "Not_A model". Then you score the other data set with it and use confidence(not_a) as a new variable for further modelling.
BR,
Martin
Dortmund, Germany
Hi @mschmitz,
Thanks for reply. If I am getting it correctly you mean, I should build model using negative dataset and then apply this model on positive dataset. The output will produce three new coloumn (confidence(not_a), confidence(not_b), confidence(not_c)) and I should include these new coloumn for further training ?