Is it possible to get 100% prediction?

Joannach0ng · August 2019

Hi everyone I was told by my tutor to have a 100% accuracy prediction for my split validation ,so I was wondering if it is possible as I have tried from 0-1 but could not get to 100% ,can adding some operator do so for me to get to 100%?Thank you!

hughesfleming68 · August 2019

In most cases when you have 100% prediction accuracy, it means that you are testing on your training data by mistake. You need to consider the randomness in your data first. For malware detection, very high accuracy would be desired for example. If on the other hand your data is a financial time series then anything better than random might be considered a good result.

I would be very concerned if I got 100% accuracy in almost any real data set. The first thing I would do is try and figure out where I made a mistake. Randomness can easily mess up your perception of your model. You might get that and it could still be by chance.

Joannach0ng · August 2019

@hughesfleming68 Hi thank you for you reply !If I am not suppose to get 100% accuracy,is it beneficial in anyway to get higher accurcay e.g-to analyse data more accurately?

varunm1 · August 2019

Hello @Joannach0ng

Getting higher accuracy is not bad and yes you might get 100 percent, what most of us are trying to inform you is that you should be extra cautious and do more investigation on why your model is 100 percent accurate. Remember that the sentence "Nothing's Perfect"

. These kinds of accuracies are far from real-world scenarios, so most of us are kind of amazed to see this. There can be multiple reasons for this.

1. No difference in training and testing datasets: If your training and testing datasets are same, then you might get 100 percent accuracy. This because the model already saw the data and is making the exact predictions on which it trained.

2. Highly correlated column: This is one regular case I see with highly accurate models. In this scenario, in your dataset, there might be a feature/attribute that is very highly correlating with the target variable (label). The issue with this is that some complex models easily identify this and just use this highly correlated column to make predictions.

3. Confounding relations: For some datasets, there might be some confounding relation (hidden relationships) between data. This will also lead to very high accuracy. I encountered this. In my scenario, I did cross-validation on a dataset and the data has multiple samples per subject, in this case, some subject samples are in training and some are in testing and the model identified the relation between them and gave 99.99 percent accuracy. I felt something odd and then tested with Leave one subject out cross-validation and it gave just above chance performance. This is important as the earlier high accuracy results are misleading.

4. Type of validation: It is very important to select the validation for your model. There are many good practices for this, but one method is to split the data into 90:10 or 80:20 ratio, then apply k-fold cross-validation on 90 percent of data and apply the model generated on 90 percent of data to test the remaining 10 percent of data. You can verify your performance from both cross-validation and the 10 percent hold out dataset.

5. Finally, I recommend you try the exciting Automodel in rapidminer that will help you decide on different models you can use and provides with some good validation as well.

I also see that you asked about ways to improve accuracy. There are different ways you can try.

1. Model selection: Identify the model that best suits your data, this can be done with the help of auto model in rapidminer or trial and error method using different models or visualizing your raw data and identify patterns to decide on the type of models that support predictions such as linear models and nonlinear models.

2. Feature selection: All features in your dataset might not be useful for predictions, for this to pick the features/attributes that are relevant for prediction you can apply different feature selection techniques like forwards, backward, automatic feature engineering, etc.

3. Hyperparameter tuning: This is important, most of the models have different hyperparameter settings. For example, a random forest may have a number of trees, pruning, etc. It is important to test different setting of hyperparameter to see how the model is improving or decreasing in performance. You can do this with the help of optimizing parameters operator to find the best model parameters for the dataset.

These are some points that came to my mind. There can be many other aspects as well.

Hope this helps.

PS: I have one suggestion about postings on the community, please don't post multiple questions on the same topic, you can continue your discussion on a single thread. This recommendation is based on this thread you already created. This answer is a combined answer for both the threads.
https://community.rapidminer.com/discussion/55923/is-it-possible-to-get-100-for-split-validation-accuracy#latest

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Is it possible to get 100% prediction?

Comments

Be Safe. Follow precautions and Maintain Social Distancing