The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Workflow to predict complex dataset - Best Practices

wirtcalwirtcal Member Posts: 16 Maven
edited November 2018 in Help

Hello community,


What are the best practices to explore a complex and unknown database and predict with accuracy a numeric value? I mean "complex" considering that the dataset contains more than 100 columns including integer attributes, real numbers, and at least 10 polynomial columns.


>>> I have created a repository and loaded the trainning_data and test_data, setting the data type correctly to the columns (integer, real, polynomial and label)

>>> I am using the Sample Operator to reduce the amount of data to process and save some time when I am modeling. Which other techniques can be used to be more productive when dealing with large databases that requires a lot of time to run?

>>> Then I start trying to use the Learners and realized that I don't know which is the most applicable. It is more difficult especially because of the polynomial attributes. When I tried to use some Polynomial to Binomial, there was a lack of memory to process. 

>>> Knowing that convert the polynomial attributes to binominal results in a lack of memory, I have splitted the data (using select attribute) to use partially with learners that works with polynomial attributes, and the others with a different learner - what is definitely not the correct way!


My *dream* plan is:

--> Load database

------> Set variables type

---------->Run some kind of Matrix Correlation (but there also polynominal fields) and Weighting

---------------> Select the most relative and important attributes to learning

------------------> Use sample operator to increase performance when modeling

---------------------> Include a Validation Operator

------------------------->Use performance operator to improve parameters. 



  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Your dream plan looks good to me. 


    Have a look at the Weight By operators. Especially Weight By Gini Index and Weight by Information Gain might be helpful for your polynominal values.



    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • wirtcalwirtcal Member Posts: 16 Maven

    Thanks Martin!


    Good to know that my plan is ok!


    I have checked out the Weight By __ operators that you suggest, but both cannot handle with numeric label. In my tests, just Weight By Relief seemed to work to weight numerical and polynominal attributes with a numeric label.  

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    You can also use Correlation for Numerical and Gini Index for polynominal attributes. You can use Select Attributes with value_type as option to split between numerical and polynominal.



    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • wirtcalwirtcal Member Posts: 16 Maven

    I have followed the workflow I planned plus your suggestions, yet my predictions are far away from "acceptable" (considering r2).


    The last thing I tryed was :

    0) Sample the data


    1) Split data in nominal and numeric attributes

    - for numerical --> Weight by Correlation -> Filter Numerical Attributes by Weight

    - for nominal --> Select 2 (of 8*) attributes  --> Convert to binominal -> Convert to numerical -> Weight by Releaf -> Filter Nominal Attributes by Weight


    2) Join the "most relative" attributes in a new table

     - I have tried manualy different setups to define the "most relative" attributes based on performance tests, I also tried differents weight operators


    3) Connect this new table to the Forward Selection operator

     - Inside of it I'm splitting my data in 70% to model/learning and 30% to performance test


    4) Change parameters and test different regression operators.


    5) Get bad predictions =( 


    * I have selected the ones with less than 200 distinct values. There are other 6 polynominal attributes that I dont know how to take advantage of them to predict a numeric label. They have hundreds of distinct values and conversions to binominal demands memory and processor that I dont have =/ 


    How could I take advantage of these 6 extra nominal fields to predict a number regarding the memory limitation? What improvements/changes should I do in the process? should I start from scratch (again)?




  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist



    have you tried the gradient boosted trees as learners? They are pretty nice because you do not need to do Nominal to Numerical to do regression.



    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • wirtcalwirtcal Member Posts: 16 Maven

    @mschmitz wrote:


    have you tried the gradient boosted trees as learners? (...)

    Not really..


    I was currently modeling with rapidminer 5.4 (favorite) and 6.X (often crashes on OSX) that I had already installed in my computer for years.


    I downloading right now the latest version of rapidminer studio to check it out. Looks like gradient boosted trees was released 7.x right? I hope it (or another new learner) helps me to get "acceptable" predictions.


    Thank you Martin!



  • IngoRMIngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder



    Yes, Gradient Boosted Trees have been added in RapidMiner 7.2 along with some other nice new learners (incl. Deep Learning, a new Logistic Regression, and Generalized Linear Models).  They all delivered very good results in the projects we have used them for.




Sign In or Register to comment.