The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Correlate data against criteria

adamc79adamc79 Member Posts: 1 Learner III
edited November 2018 in Help

Hi guys - I am a newby - I am looking for some advice pertaining to a specific analysis I would like to perform.

 

I have a pile of fluid sample data, these have been graded by a labratory as: 'Normal', 'Abnormal' or 'caution'. I would like to correlate the remaining data pertaining to those titles with a view to understanding the reason for the grading.

 

So, I have given up trying to use text analysis as an input to the correlation matrix - that would have been too good!

 

I have achieved some results by making three columns, zeros and ones. For example, a 'Caution' column where all caution rows are populated with a 'one', and all other rows a 'zero'. And similar columns for the 'abnormal' and 'normal'.

 

While the above has yeilded an interesting result, I am certain I could be doing this a better way?

 

Any assistance appreciated, thanks

Tagged:

Best Answer

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Solution Accepted

    Dear Adam,

     

    you can clearly use text analytics here. The trick is to prepare the texts in the right way. You can use the text extension to create a bag of words. The back of words creates a lot of attributes "counting" (or TF-IDF) how often a word occurs in your documents.

     

    Afterwards you can use various feature selection or weighting methods to figure out how important which word is. This also works for your other data. The key operators are Weight by XXX where XXX is the desired critieria. Since you have a Nominal label it is most likely better to use Gini Index or Information Gain then Correlation.

     

    Another but closely connected approach would be to use feature selection methods like a forward selection. The principal idea is to build a algorithm to predict Normal, Abnormal and Caution and try out/ask the algorithm which attributes help him to predict well.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.