how to connect between Set Role operator and Apply Model operator
hi
I have two questions. I would appreciate if you would guide me
1-I have a dataset with 5000 samples that do not have labels. On the other hand, I have another dataset with 100 samples labeled and the samples are not in the 5000 dataset. Is it okay to remove the label of 100 samples and cluster with clustering algorithms and after clustering, add the label to 100 samples and see how many algorithms are clustered correctly. And then, if clustering accuracy increased, We cluster 5000 samples with the same algorithm?
2- I run the scenario for my first question in the RapidMiner, but I do not know how to create connection between two operators. Does anyone know how to connect Set Role operator and Apply Model operator together? I will send you the related file and I hope you help me.
َAlso the dataset is available at the below link:
https://drive.google.com/drive/folders/1t2qEnc7K35IHKfDVvG2dqEHZ_lNHZBis
Answers
Hi,
Regarding your first question:
Even if your approach is theoretical valid, there is no need to remove the label and do a clustering approach. If you set the role of your label to "label" it will be ignored by the clustering algorithm.
But if you have said label, why not use a supbervised learning algorithm to directly train a model that can predict this label.
This you then can apply on your second data set (where you don't have the label). Only potential issue I see there is, that the training size
To connect the two operators, you simply need to left-click on one of the ports you want to connect and then move over to other port and click again (check this tutorial video for an example: https://youtu.be/ophGqpUexKI?t=2m14s)
Best,
David
hi
About the first solution you said: If you give an unlabeled dataset to a supervised learning algorithms like Decision Tree, in the input of the algorithm, you must specify the label column. Thus, for the 5000 unlabeled samples, it is not possible to use supervised algorithms.I want to get the precision of 170 labeled samples with a clustering algorithm like K-means, and then, based on earn the high percentage accuracy, do clustering on 5000 samples with the same algorithm.
About the second solution you said: as you see, the input of Apply model operator needs a model, and when i connect exa port of Set role operator to mod port of Apply model , the error shown. i need both operator but i connect connect them.
Best Regard,
Mina
Hi @m_gholami1991, Hi @David_A,
Sorry, @m_gholami1991, I come with questions and not answers :
I played with your data and builded a "classic process" with a Decision Tree.
The builded model is the following :
or in an other form :
If I good understand, the model is not able to predict (label = One) ?, however :
When the model is applied to the Training set (output of a Cross Validation), (label = One) is predicted in some cases by the model ... :
an other case which is not intuitive for me is the following :
Depending on the model, (label = Two) is predicted only if Marital > 2,5, however there are cases where
(label = Two) is predicted with Marital <= 2,5 (Marital = 2) with a confidence = 1 ... :
Can you enlighten me on these cases, which are not intuitive for me ?
Thanks you for your answers,
Regards,
Lionel
NB : The process :
hi @lionelderkrikor
thanks of your attention. but you know, this dataset is sample.
Please pay attention to the picture, I want to explain different steps.
Step1: 100 data labeled input (Label column has been deleted) and after normalize, based on the number of specified attributes (select by weight operator), clustering is performed.
Step2: The four evaluation criteria apply to each feature. And finally, the features are ranked according to their importance.
Step3: after clustering finished, as you know. A new column is added to the features column which shows each sample in which cluster is located. After that, with Map Operator We can specify a match between the names of the clusters and the priorities. (The priorities are the same labels that were already given to the samples.) After that, We can use a tree to model the output. (Many tell me that at this stage there is no need for a decision tree at all and its use is wrong.)
Step4: 100 data with label entered and with the help of the Apply Model Operator, labeled samples applied to decision tree and compare the percentage accuracy between the label column and the clustering results. and finally, final accuracy is determined by Performance Operator.
My question related to Step3. Is the using of decision tree wrong? And if the connection is wrong, which operator should be used?
Hi @m_gholami1991,
I have a question : how do you establish the correspondance between the clusters results (cluster_0, cluster_1, cluster_2) and the label values (priority = One /Two/Three) ?
To answer to your question : A priori I don't know if "the using of Decision Tree is wrong". I recommend you to follow the "classic methodology", that is to say, to perform a Cross Validation with some models and to select the most performant...
Regards,
Lionel
Hi again @m_gholami1991,
OK, after reading again your process, I understood the "philosophy" of your process and what you want to perform (excuse me but, here in France, it's late in the evening and I am less efficient...).
Indeed, you want to compare your clustering results to your labelled data, isn't it ? So, no need of Decision Tree.
So you can inspire of this sample process :
However i would try to establish "manually" the correlations between the clustering Results (cluster_0, cluster_1, cluster_2) and your labelled data (Priority = One/Two/Three) at the final step. (if these correlations exist).
NB : For example, with your sample data, the correlations are not obvious...
1. Labelled data :
2. Clustering results :
I hope it helps,
Regards,
Lionel
hi @lionelderkrikor
yessss, You know exactly what I mean
I copied the code you provided and I saw the process. You marked the priority column in the first Read Operator.
But you know, I think it is better not to mark this column for the first Read Operator, because the clustering algorithm may consider this column for clustering. For the reason I mentioned above, in step 4, I re-entered the dataset and selected this column there.
On the other hand, if you run my XML file and select this column in the first Read Operator and disable Operators(Stap3: Set Role and Decision Tree | Stap4: TrainData_WithLabel ( Read Operator) and Normalize and Apply Model), An error will appear in the Performance operator stage, which "Input ExampleSet does not have a label".
hi
Is there anyone to help me? I really need your help. My thesis presentation is very close. Please....
HI @m_gholami1991,
Here a working process with the Decision Tree model :
I hope it helps,
Regards,
Lionel
Hi @m_gholami1991,
And here a simplified process without Decision Tree :
Like I said in a previous post, data are just clustered (after performing feature selection) and then
simply compared to the labeled data.
The process :
I hope it helps, too.
Regards,
Lionel