The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
binary text classification test-set problem
Hey,
I created a process to classify 2 categories of documents. Every works fine, while reducing the test set (from a different database / domain) to only 1 class (recall 99%). If I remove the filtering of the second class the whole process doesn't work anymore. I don't think it's a problem of overfitting, since the test data is coming from another database. Currently my setup looks like this:
DB-Training -> Process Documents (TF/IDF) -> Train libSVM --------------------------V
DB-Test (different db) -> Filter Class 1 -> Process Documents (TF/IDF) -> Apply Svm -> Performance (Recall of Class 2 = 99%)
I did NOT connect the wordlist of the training-db-"processed documents" to the test-db-"processed documents" one. If i do so, the recall decreases to 0%. Am I doing something wrong with the process-documents of the training-data part or am I missing something?
I created a process to classify 2 categories of documents. Every works fine, while reducing the test set (from a different database / domain) to only 1 class (recall 99%). If I remove the filtering of the second class the whole process doesn't work anymore. I don't think it's a problem of overfitting, since the test data is coming from another database. Currently my setup looks like this:
DB-Training -> Process Documents (TF/IDF) -> Train libSVM --------------------------V
DB-Test (different db) -> Filter Class 1 -> Process Documents (TF/IDF) -> Apply Svm -> Performance (Recall of Class 2 = 99%)
I did NOT connect the wordlist of the training-db-"processed documents" to the test-db-"processed documents" one. If i do so, the recall decreases to 0%. Am I doing something wrong with the process-documents of the training-data part or am I missing something?
Tagged:
0