The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"[SOLVED] the Curse of Text High Dimensional nature"

siamak_wantsiamak_want Member Posts: 98 Contributor II
edited June 2019 in Help
Hi experts,

My question is about my new problem in text classification in a real-world project:

I have made a classification model based on a relatively huge labeled dataset. the model has about 50,000 attributes. Now I want to apply my model on the new unseen data. Here the problem turns out...
I have about 2000 attributes in my test data and about 500 of them do not exist in my model at all. I mean my model has not seen such attributes in the training time because these attributes have not exist in my train data set. So Is my model able to classify such a dynamic features accurately? Please explain if you have any idea about this challenge.

Thanks a lot.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    This is not the classical curse of high dimensionality, but nevertheless there's a solution: you have to connect the "wor" output of the training Process Documents operator to the Process Documents operator in the apply branch. Please have a look at this thread: http://rapid-i.com/rapidforum/index.php/topic,4802.0.html

    Best, Marius
  • siamak_wantsiamak_want Member Posts: 98 Contributor II
    Thanks a lot Marius.
    You were right.
Sign In or Register to comment.