The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Testing/Training data separation
nawafpower
Member Posts: 34 Contributor II
Hi All,
Does any body knows how does the Rapid Miner separate the data fed to it as training and testing? Like I fed my data in folders and put label for each folder path, now how does my data be separated inside the folders as training and testing? Any help is appreciated.
Does any body knows how does the Rapid Miner separate the data fed to it as training and testing? Like I fed my data in folders and put label for each folder path, now how does my data be separated inside the folders as training and testing? Any help is appreciated.
0
Answers
Usually there are several ways to use test and training data. First, in case your data is already split into two different resources you may create two repository entries for them and connect the corresponding Retrieve operators to a ModelApplier Operator. Second, and this is the most common case, let RapidMiner do the work for you: Just use an operator like XValidation. This operator automatically splits your data into several subsets needed for cross validations. In the Repositories view you find a folder called Samples which contain some sample experiments including processes which use XValidation for performance measurement. Simply open one of them and double click on the XValidation operator to see how things are connected internally.
Greetings,
Helge
I appreciate any help.
Of course you can keep things a little bit simpler. Use the Split Data operator to split your data into test and training partition, connect the trainig data output to a learner operator and feed the test data into an Apply Model operator. Finally connect the model output of your learner to the applier and the applier output for labeled data to one of the main resource ports of your process. Now you should receive a data view of your test partition which is enriched by a column called prediction, showing you how your model classified each example. You may add a Performance operator to the last connection to see the confusion matrix for this particular job. In case you change partiton or sampling type parameters of the Split Data operator you will get different learning results. This is why XValidation with its crossvalidation ability is often used to get more reliable performance values.
Greetings,
Helge
I did what you said and it works fine, but still my problem is: how can I know which file in specific was misclassified? If I have lets say 30 files in folder A which supposed to come out as class A for example, the model now split the files and show me the ratio that I have select in the confusion matrix, but what about the testing files? where are they? and how can I check their classification? is there a way to show a detailed list of each file with its class? I know too many questions, and anyone who knows any answer is welcome to reply here with my full appreciation.
Greetings,
Helge
Is it possible for the RapidMiner to enter an infinite loop? I did what you told me about connecting the second output of the performance (exa) to the (res) output, I added a split data to split my data into training to go to classifier and testing to go to the (unl) port of the apply model, up to this moment the RapidMiner has been running for 2 days 2 hours 30 minutes, (50 hours and 30 minutes) so I am just worried should I wait? for how long? or just terminate? please Help. I appreciate all your help so far.
I can't say much about this. Runtimes of days, weeks or even months might be frequently the case if you are training a computational complex model, say neural nets or SVMs, on huge data sets.
So, without having the process that shows me what you did at all, and without the data specs, I have no clue if it's a good idea to wait or not.
In general it's a good idea to take a look at the status bar, where each operator is shown with the number of execution and runtime. So is there just one single operator running all the four days? Or is it just the 1000000th execution of the same operator?
Greetings,
Sebastian
When I got the 4 days processing, which I canceled eventually, was different setting, I manually set different folders for training and another folders for testing, and fed these folders to two blocks of process document from files, and remove the split data.
I wonder why the SVM did so bad in this setting. any feed back is appreciated.
Regards,
Helge
And still my issue if I need to test a specific extra file, not had been before in the data set, how can I test that file to see where does it fit? This is my BIGGEST issue now.
Thanks
But now to your last question: What do you mean with testing an extra file? Does this file contain another data set? In case this data set has the same structure you can use the model you have already trained. There are operators that allow you to write a model to disk (WriteModel & ReadModel) and others to store them in your repository (Store & Retrieve). Simply build a process which loads the data, retrieves the model and applies the latter to the first. Connect the lab output of the applier and you should receive your prediction values.
Cheers,
Helge
Thanks for your reply, but I didn't catch it well, now I have 10 folders containing text for different authors, each folder should be classified for that author if the model was correct, now I read this folders and supply the output after preprocessing to split data so I can do training and testing portion, I send the training output of the split data to classifier and the other output as testing to the (unl) input of the apply model and the output of classifier to (mod) input of the apply model, then send the (lab) output of apply model to performance and the (mod) output of the apply model to output (res).
now I did add to this setting a (write model) from the (mod) output to (inp) of write model and (thr) output of the write model to (mod) input of the apply model. this will write the model some where I can decide.
Now in detail, what did you say about saving my data to repository and if I need to test a text file not in my data set, or more than one text file, to see whose the author of this/these text(s)? I already tried to read model and input one text and apply model, when run, or actually still running from 5 days ago . so where did I miss? Does the read model get the trained model on my data set? please help.
Regards and thanks for your time.
yes, in fact I have been in vacation. And after coming back a overwhelming number of open threads welcomed me Following my heuristic that a thread with 17 replies should be already solved, I didn't take a closer look here in my first sweep.
So, could you please summarize the problem and the current state? Would help me a lot, if I don't have not to read all the previous posts...
Greetings,
Sebastian
Your last post made things look a bit clearer. You want to classify text data by predicting the author who might have written it. Such an approach needs text mining techniques to work with unstructured data. Please make sure that you have installed the Text Processing Extension (via Help -> Update RapidMiner). Unfortunately the analysis of unstructured data like texts is a more complex task. Therefor some preliminary steps are needed before you can start with things like learning models or validating results. Maybe it is a good idea to take a closer look at some useful introduction videos dealing with this topic. A video which shows how to classify texts dealing with different topics can be found here:
http://rapidminerresources.com/index.php?page=text-mining-3
In addition to that Neil McGuigan produced a great series of five videos dealing with RapidMiner and Text-Mining which are available via his blog:
http://vancouverdata.blogspot.com/2010_11_01_archive.html
Greetings,
Helge
then, when you are predicting authorship, you use the old wordlist instead of training a new one
Your suggestion is to save the word list, how? and in what stage? do I have to insert a block like write model? and when use this word list in the next stage that will be with out the model read or what? Sorry for my many questions.
Any update regarding my last post? I didn't hear a response from Neil or any one ? Is there any thing need to be clarified in my post? let me know and I appreciate any help.
Hi, I read all this topic to see if I can find any answer to my problem , but I have to mention my problem, I hope you provide me the answer.
first, about nawafpower's problem , I think you can find your answer in http://www.youtube.com/watch?v=9I0BcMuhPe8
I'm also doing text classification. I have stored a model and a wordlist. and I used them to classify new documents. for example I have 1000 test document, I can see that what the model predicts for that but I can not save them in a file to see what is predicted for a certain document
I want to save a file which one entry is the name of the document and one other entry for the predicted label
I really appreciate your time.
please help me.