The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
EXPORT Sparse Data
Hello....
I am rather new to RapidMiner, and so my apology is this question is too basic.
I am trying to do some Text Mining of a relatively large dataset (>100MB), with RapidMiner, and i would like to export the results, TF-IDF, (after applying a Tokenizer, Stemmer, and Stop words Removal). The problem i have, is that when i use a "CSV export", or "ARFF export" operators, the file i receive is very large (>5GB), despite the data being very sparse.
I am not sure, if can write sparse data into CSV, but WEKA write sparse data in ARFF file format, and RapidMiner can read sparse data.
My question is: is it possible to instruct RapidMiner to make use of the sparsity of the data when exporting it to a file?
Cheers
I am rather new to RapidMiner, and so my apology is this question is too basic.
I am trying to do some Text Mining of a relatively large dataset (>100MB), with RapidMiner, and i would like to export the results, TF-IDF, (after applying a Tokenizer, Stemmer, and Stop words Removal). The problem i have, is that when i use a "CSV export", or "ARFF export" operators, the file i receive is very large (>5GB), despite the data being very sparse.
I am not sure, if can write sparse data into CSV, but WEKA write sparse data in ARFF file format, and RapidMiner can read sparse data.
My question is: is it possible to instruct RapidMiner to make use of the sparsity of the data when exporting it to a file?
Cheers
0
Answers
of course this is possible (this is my default answer for all "is X possible"-questions ;D )
The operator "Write Special Format" is your friend. Try the special format "$s[;][:]" for example if you want to separate the columns by ";" and the index of the attributes by ":". The "$s" means "sparse format". You can find more information in the help text of the operator.
Here is a simple example process: Have fun!
Ingo
The solution u provided, writes the data without the attributes names (well, there is an option $v[name], but i am not sure how to use it?)
What should i replace the name with? and if it's the name of an attribute (a column from TF-IDF matrix), how do i populate this field before knowing a priori what are the attributes name (terms in the dictionaries) and how many of them are there?
I want to produce an ARFF sparse file, that contains the attribute names, (similar to the one produced by weka), and i would have thought, that i could connect the output of an ARFF file Operator to the Input of the Export Special Operator; or the other way around (mimiking the pipe unix operation), but that doesn't produce the required output format.
Any advice to a novice user, will be much appreciated, and very helpful to get me going with RM
Cheers