The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Text Mining classification problem with two data sets
mschmidkon
Member Posts: 2 Contributor I
Hey!
I have an issue with text mining and classification according to keywords with two datasets. The goal is to classify products according to textual description.
INITIAL SITUATION:
I've got two data sets, the first one contains a unique identifier (a number representing a product) and four columns including text describing this product (short/long text description etc.). The second data set contains two columns, the first one is text describing a label for classification and the second column contains a classification code. The goal is to classify the products from data set 1 according to the second data set, therefore, identical word occurences have to be identified and the classification code with the highest occurences of similar words should be taken. The process should take one product from the first data set and look up all labels from the second data set in order to find the best suiting label.
CURRENT SITUATION:
I created a RapidMiner process which reads the two csv files seperately, converts the input with 'Process Documents from Data' including Tokenizing, Filter Stopwords, Stem and Generate n-Grams. The result set includes the occurences of the tokenized words and now I want to compare the result sets of the two data sets (both data sets don't have the same amount of attributes in the same order, but there are identical ones) with the goal to find 'similar' words and classify the product. Does anybody know how to compare these two datasets with an operator from rapidminer and how to classify these products?
Thank you very much!
Michael
I have an issue with text mining and classification according to keywords with two datasets. The goal is to classify products according to textual description.
INITIAL SITUATION:
I've got two data sets, the first one contains a unique identifier (a number representing a product) and four columns including text describing this product (short/long text description etc.). The second data set contains two columns, the first one is text describing a label for classification and the second column contains a classification code. The goal is to classify the products from data set 1 according to the second data set, therefore, identical word occurences have to be identified and the classification code with the highest occurences of similar words should be taken. The process should take one product from the first data set and look up all labels from the second data set in order to find the best suiting label.
CURRENT SITUATION:
I created a RapidMiner process which reads the two csv files seperately, converts the input with 'Process Documents from Data' including Tokenizing, Filter Stopwords, Stem and Generate n-Grams. The result set includes the occurences of the tokenized words and now I want to compare the result sets of the two data sets (both data sets don't have the same amount of attributes in the same order, but there are identical ones) with the goal to find 'similar' words and classify the product. Does anybody know how to compare these two datasets with an operator from rapidminer and how to classify these products?
Thank you very much!
Michael
Tagged:
1
Best Answer
-
rfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 UnicornHey @mschmidkon,Do you mind to share your process with us, so that we can provide you better guidance?All the best,Rod.7
Answers