The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Data ordering
Good afternoon,
I’m a new user of RapidMiner. I would like to build the following process with RapidMiner and the data ordering is being quite complicated for me.
I explain bellow what I would like to do:
1. The final goal of my RapidMiner process is to plot curves and correlation matrix between different attributes over a period of time.
2. I will explain later the source of the data. I would like to obtain an “example set” with the following structure:
[move][/move]Pressure Flow Temperature
dd/mm/yy hh:mm P1 F1 T1
dd/mm/yy hh:mm P2 F2 T2
dd/mm/yy hh:mm P3 F3 T3
…. …. … ….
(table 1)
From this table I would be able to plot curves, correlation matrix, etc.
3. The problem is that I don’t have the data stored in such a simple format.
4. I have the data stored in .txt files. (>1000 rows per file). I can read it from the operator “Read CSV”.
5. But the data are stored as follows:
dd/mm/yy hh:01 Pressure P1
dd/mm/yy hh:02 Flow F1
dd/mm/yy hh:02 Pressure P1
dd/mm/yy hh:03 Temperature T1
dd/mm/yy hh:03 Temperature T1
…. …. ….
(table 2)
So, a very different order to what I would like to have.
6. I need to build an “example set” reordering the data (from table 2 to table 1). How can I do that?
7. Besides that I would like to reduce the final number of examples considering only an example per every 5 minutes instead of per minute. Therearefore I would need to have an example set whose examples are averages of several examples included in another “example set”. Again, I dind’t find any operator to do that.
I hope I explained my problem clearly.
Thank you very much for your help.
Best regards! Rosana
I’m a new user of RapidMiner. I would like to build the following process with RapidMiner and the data ordering is being quite complicated for me.
I explain bellow what I would like to do:
1. The final goal of my RapidMiner process is to plot curves and correlation matrix between different attributes over a period of time.
2. I will explain later the source of the data. I would like to obtain an “example set” with the following structure:
[move][/move]Pressure Flow Temperature
dd/mm/yy hh:mm P1 F1 T1
dd/mm/yy hh:mm P2 F2 T2
dd/mm/yy hh:mm P3 F3 T3
…. …. … ….
(table 1)
From this table I would be able to plot curves, correlation matrix, etc.
3. The problem is that I don’t have the data stored in such a simple format.
4. I have the data stored in .txt files. (>1000 rows per file). I can read it from the operator “Read CSV”.
5. But the data are stored as follows:
dd/mm/yy hh:01 Pressure P1
dd/mm/yy hh:02 Flow F1
dd/mm/yy hh:02 Pressure P1
dd/mm/yy hh:03 Temperature T1
dd/mm/yy hh:03 Temperature T1
…. …. ….
(table 2)
So, a very different order to what I would like to have.
6. I need to build an “example set” reordering the data (from table 2 to table 1). How can I do that?
7. Besides that I would like to reduce the final number of examples considering only an example per every 5 minutes instead of per minute. Therearefore I would need to have an example set whose examples are averages of several examples included in another “example set”. Again, I dind’t find any operator to do that.
I hope I explained my problem clearly.
Thank you very much for your help.
Best regards! Rosana
0
Answers
your data is in a so called relational format, and you have to use the Pivot operator to convert it to a row-based format.
The pivot operator groups the data by a single attribute (in your case it would be the timestamp), and creates one column for each value of the "index" attribute (in your data the second row).
The Create Transactional Data operator creates data with a similar structure to your data. In the attached process, I pivoted it, so you can adapt the process to your data (the Id attribute corresponds to your timestamp).
Once you managed to pivot your data, please come back here and we will deal with your second problem (5 minutes averages).
Best, Marius
Thank you very much!!! I managed to pivot the data, so we can start with my second problem (5 minutes average).
Previously, I think I should also point out that I don't have the value of each attribute for each time step. For example, for the time step 10:00:01 I may have the value of pressure but I don't have the values for "flow" and "temperature". Therefore a "?" is indicated in the new pivoted table. I think this may be important when calculating the average per 5 minutes?.
Finally, I indicated as "date format" in the "Read CSV" Operator "dd/MM/yy hh:mm:ss" (because the measures are done in a seconds scale)... but only the day is displayed in the table: dd/MM/yy... why?
Again, thank you very much for your help!!
Best, Rosana
Instead use nominal to date, and select date_time so you get both dates and time.
Best regards,
Wessel
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="390" width="359">
<operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="128" y="134">
<list key="attribute_values">
<parameter key="myDate" value=""23/12/2012 14:00:55""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_date" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Date" width="90" x="239" y="133">
<parameter key="attribute_name" value="myDate"/>
<parameter key="date_type" value="date_time"/>
<parameter key="date_format" value="dd/MM/yyyy hh:mm:ss"/>
<parameter key="keep_old_attribute" value="true"/>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Nominal to Date" to_port="example set input"/>
<connect from_op="Nominal to Date" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I have improved my process with your recommendation!
Best regards, Rosana
to get the average of an interval you have to use the Aggregate operator. The problem however is that before that you have to create an attribute which defines to which interval of 5 minutes a row belongs. We can do that with Generate Attributes and the function date_diff. date_diff delivers the time between two dates in milliseconds. What I did in Generate Attributes was to calculate the time between 01-01-1970 and the current date in milliseconds, and then divide that value by the number of milliseconds per 5 minutes (5*60*1000). If you round that with the floor() function, you get the number of 5-minute-intervals since 1970 for the current date. Just have a look at the formula in the process.
In the Aggregate operator, you can group by that new attribute, and select one or more aggregation attributes. In your case that would be the average of your sensor values. Additionally you may want to select the minimum of the original date attribute to get the first date of the current 5-minutes-interval.
Happy Mining!
~Marius
It works!!
Best, rosana