subsetting in execute R process
Hi, some basic subsetting of data frame seems to be not working when executing R in Rapidminer.
I ran this R code in Rstudio and it yields correct dimension after subsetting:
cat('dimension of training x:', dim(as.data.frame(dat[train, x])), '\n')
cat('dimension of training y:', dim(as.data.frame(dat[train, y])), '\n')
# dimension of training x: 138 60
# dimension of training y: 138 1
And then I paste this R code into a execute R
process in Rapidminer, and it yields:
Jul 7, 2016 4:57:43 PM INFO: dimension of training x: 62 1
Jul 7, 2016 4:57:43 PM INFO: dimension of training y: 62 1
I print many other things to debug, and other things are all same in two cases (Rstudio vs Rapidminer). -- See complete output below
Sonar data (208 rows * 61 columns) is used in both cases.
--------
Complete R code:
library(mlbench)
rm_main = function(dat, in_rapidminer = T){
cat('Starting R script now ...\n')
# find columns of x (attribute) and y (response) ####
if(in_rapidminer){
meta = melt(metaData)
meta$L1 = NULL
names(meta) = c('value', 'variable', 'column')
meta = dcast(meta, formula = column ~ variable)
print(meta)
y_name = meta[meta$role %in% 'label', 'column']
x_name = meta[meta$role %in% 'attribute', 'column']
y = names(dat) %in% y_name
x = names(dat) %in% x_name
} else {
# in R manually specify it:
y_name = 'Class'
y = names(dat) %in% y_name
x = ! names(dat) %in% c(y_name, 'pred_prob', 'pred')
}
cat('y column:', which(y), '\n')
cat('x column(s):', which(x), '\n')
cat('dimension of data:', dim(dat), '\n')
# encode y (only work for binary) ####
f1 = paste0('~', y_name, '- 1')
dat[[y_name]] = model.matrix(as.formula(f1), data = dat)[ , 1]
# ####
n_row = nrow(dat)
n_fold = 3
set.seed(123)
group = (seq_len(n_row) - 1) %% n_fold + 1
group = sample(group) # random permutation
print(table(group))
# n_fold CV ####
for(ii in seq_len(n_fold)){
cat('CV round', ii, '\n')
train = group != ii
cat('dimension of data:', dim(dat), '\n')
cat('how many rows in training set:', sum(train), '\n')
cat('dimension of training x:', dim(as.data.frame(dat[train, x])), '\n')
cat('dimension of training y:', dim(as.data.frame(dat[train, y])), '\n')
}
return(1)
}
data(Sonar)
dat = Sonar
rm_main(dat, in_rapidminer = F)
Complete output:
Starting R script now ...
y column: 61
x column(s): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
dimension of data: 208 61
group
1 2 3
70 69 69
CV round 1
dimension of data: 208 61
how many rows in training set: 138
dimension of training x: 138 60
dimension of training y: 138 1
CV round 2
dimension of data: 208 61
how many rows in training set: 139
dimension of training x: 139 60
dimension of training y: 139 1
CV round 3
dimension of data: 208 61
how many rows in training set: 139
dimension of training x: 139 60
dimension of training y: 139 1
(Highlight in red by me)
Complete Rapidminer code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.4.000">
<context>
<input>
<location>//_your_path_/Sonar</location>
</input>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="multiply" compatibility="6.4.000" expanded="true" height="94" name="Multiply" width="90" x="112" y="120"/>
<operator activated="true" class="r_scripting:execute_r" compatibility="6.4.000" expanded="true" height="76" name="CV" width="90" x="313" y="165">
<parameter key="script" value="rm_main = function(dat, in_rapidminer = T){ cat('Starting R script now ...\n') # find columns of x (attribute) and y (response) #### if(in_rapidminer){ meta = melt(metaData) meta$L1 = NULL names(meta) = c('value', 'variable', 'column') meta = dcast(meta, formula = column ~ variable) print(meta) y_name = meta[meta$role %in% 'label', 'column'] x_name = meta[meta$role %in% 'attribute', 'column'] y = names(dat) %in% y_name x = names(dat) %in% x_name } else { # in R manually specify it: y_name = 'Class' y = names(dat) %in% y_name x = ! names(dat) %in% c(y_name, 'pred_prob', 'pred') } cat('y column:', which(y), '\n') cat('x column(s):', which(x), '\n') cat('dimension of data:', dim(dat), '\n') # encode y (only work for binary) #### f1 = paste0('~', y_name, '- 1') dat[[y_name]] = model.matrix(as.formula(f1), data = dat)[ , 1] # #### n_row = nrow(dat) n_fold = 3 set.seed(123) group = (seq_len(n_row) - 1) %% n_fold + 1 group = sample(group) # random permutation print(table(group)) # n_fold CV #### for(ii in seq_len(n_fold)){ cat('CV round', ii, '\n') train = group != ii cat('dimension of data:', dim(dat), '\n') cat('how many rows in training set:', sum(train), '\n') cat('dimension of training x:', dim(as.data.frame(dat[train, x])), '\n') cat('dimension of training y:', dim(as.data.frame(dat[train, y])), '\n') } return(1) }"/>
</operator>
<connect from_port="input 1" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="result 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="CV" to_port="input 1"/>
<connect from_op="CV" from_port="output 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="90"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<description align="center" color="yellow" colored="false" height="58" resized="true" width="214" x="250" y="243">This R code works with binary (two-class) response only</description>
</process>
</operator>
</process>
Complete log in Rapidminer:
Jul 7, 2016 5:15:25 PM INFO: Starting R script now ...
Jul 7, 2016 5:15:25 PM INFO: column role type
Jul 7, 2016 5:15:25 PM INFO: 1 attribute_1 attribute real
Jul 7, 2016 5:15:25 PM INFO: 2 attribute_10 attribute real
Jul 7, 2016 5:15:25 PM INFO: 3 attribute_11 attribute real
<I omit some lines here for clarity>
Jul 7, 2016 5:15:25 PM INFO: 58 attribute_7 attribute real
Jul 7, 2016 5:15:25 PM INFO: 59 attribute_8 attribute real
Jul 7, 2016 5:15:25 PM INFO: 60 attribute_9 attribute real
Jul 7, 2016 5:15:25 PM INFO: 61 class label nominal
Jul 7, 2016 5:15:25 PM INFO: y column: 61
Jul 7, 2016 5:15:25 PM INFO: x column(s): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Jul 7, 2016 5:15:25 PM INFO: dimension of data: 208 61
Jul 7, 2016 5:15:25 PM INFO: group
Jul 7, 2016 5:15:25 PM INFO: 1 2 3
Jul 7, 2016 5:15:25 PM INFO: 70 69 69
Jul 7, 2016 5:15:25 PM INFO: CV round 1
Jul 7, 2016 5:15:25 PM INFO: dimension of data: 208 61
Jul 7, 2016 5:15:25 PM INFO: how many rows in training set: 138
Jul 7, 2016 5:15:25 PM INFO: dimension of training x: 61 1
Jul 7, 2016 5:15:25 PM INFO: dimension of training y: 61 1
Jul 7, 2016 5:15:25 PM INFO: CV round 2
Jul 7, 2016 5:15:25 PM INFO: dimension of data: 208 61
Jul 7, 2016 5:15:25 PM INFO: how many rows in training set: 139
Jul 7, 2016 5:15:25 PM INFO: dimension of training x: 61 1
Jul 7, 2016 5:15:25 PM INFO: dimension of training y: 61 1
Jul 7, 2016 5:15:25 PM INFO: CV round 3
Jul 7, 2016 5:15:25 PM INFO: dimension of data: 208 61
Jul 7, 2016 5:15:25 PM INFO: how many rows in training set: 139
Jul 7, 2016 5:15:25 PM INFO: dimension of training x: 61 1
Jul 7, 2016 5:15:25 PM INFO: dimension of training y: 61 1
Jul 7, 2016 5:15:25 PM INFO: Saving results.
Jul 7, 2016 5:15:25 PM INFO: Process
Any help's appreciated. Thanks-
Best Answer
-
awchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
If you add a line like this near the beginning of the code, it seems to fix the issue.
dat <- as.data.frame(dat)
It's something to do with the type of the dat variable. In pure R, it's a data.frame. In RapidMiner it's also a data.table
Andrew
1
Answers
<I deleted this reply>
Hello
What is the variable metaData in the R script. If you do str(metaData) what do you get?
regards
Andrew
Thanks Andrew. metaData is a list of list of list.
(In case you're not familiar with metaData, it's information about the role (label/id/attribute/...) and the datatype of each column. For example, you can use `set role` operator in Rapidminer to control the role, and you can control the role in import data wizard.)
The official doc of `execute R` operator talks about handling metadata <http://docs.rapidminer.com/studio/operators/utility/scripting/execute_r.html>
if you go to the HELP of `execute R` in Rapidminer, there're links to the examples in the doc.
It's easier to see after flattening and reorganization:
Hello
I know what roles are but I wasn't aware of the specific addition of an R variable that describes it.
Andrew
Hi Andrew,
You solution fixed the problem. And I reproduced the mistake by doing dat = data.table(dat) in Rstudio.
Thank you so much!
just commenting for future users to be aware of
Rapidminer Execute operator uses the data.table object vs data.frame
so in the function rm_main(data1,data2,data3) all of data1,data2,data3 objects are datatables and not dataframes.
So your best bet is converting the table to frame if you are more comfortable with dataframes