"Prediction of Time Series Data"

Judy · January 2019

Hi all,

I am trying to use the auto model to predict my price over time. However, when the results are out, i realise that it does not take in the time series element of the data and provide me with the predictions based on the dates i produce.

Can I check will RapidMiner be able to take in the time series elements and provide me with the predictions based on the dates i have provided?

Please advice!
Thanks!

David_A · January 2019

Hi,

for time series forecasting you can use the ARIMA or Holt-Winters forecasting operators, when you have a uni-variate time series.
If you have several features per time stamp, you can also use the Windowing operator to transform your series data into a set of windows, on which you can then apply Auto Model or your own prediction algorithm.

Take a look at the example process below, taken from the Samples folder in RapidMiner.

Best,
David

<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process" origin="GENERATED_SAMPLE">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Prices of Gas Station" origin="GENERATED_SAMPLE" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/Time Series/data sets/Prices of Gas Station"/>
      </operator>
      <operator activated="true" class="filter_example_range" compatibility="9.1.000" expanded="true" height="82" name="Filter Example Range" origin="GENERATED_SAMPLE" width="90" x="179" y="34">
        <parameter key="first_example" value="1"/>
        <parameter key="last_example" value="16"/>
        <parameter key="invert_filter" value="true"/>
      </operator>
      <operator activated="true" breakpoints="after" class="time_series:windowing" compatibility="9.1.000" expanded="true" height="82" name="Windowing" origin="GENERATED_SAMPLE" width="90" x="447" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="gas price / euro (times 1000)"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="has_indices" value="true"/>
        <parameter key="indices_attribute" value="date"/>
        <parameter key="window_size" value="48"/>
        <parameter key="no_overlapping_windows" value="false"/>
        <parameter key="step_size" value="24"/>
        <parameter key="create_horizon_(labels)" value="true"/>
        <parameter key="horizon_attribute" value="gas price / euro (times 1000)"/>
        <parameter key="horizon_size" value="1"/>
        <parameter key="horizon_offset" value="23"/>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation" origin="GENERATED_SAMPLE" width="90" x="782" y="34">
        <parameter key="split_on_batch_attribute" value="false"/>
        <parameter key="leave_one_out" value="false"/>
        <parameter key="number_of_folds" value="10"/>
        <parameter key="sampling_type" value="automatic"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="h2o:gradient_boosted_trees" compatibility="9.0.000" expanded="true" height="103" name="Gradient Boosted Trees" origin="GENERATED_SAMPLE" width="90" x="179" y="34">
            <parameter key="number_of_trees" value="100"/>
            <parameter key="reproducible" value="false"/>
            <parameter key="maximum_number_of_threads" value="4"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="maximal_depth" value="5"/>
            <parameter key="min_rows" value="10.0"/>
            <parameter key="min_split_improvement" value="0.0"/>
            <parameter key="number_of_bins" value="20"/>
            <parameter key="learning_rate" value="0.1"/>
            <parameter key="sample_rate" value="1.0"/>
            <parameter key="distribution" value="AUTO"/>
            <parameter key="early_stopping" value="false"/>
            <parameter key="stopping_rounds" value="1"/>
            <parameter key="stopping_metric" value="AUTO"/>
            <parameter key="stopping_tolerance" value="0.001"/>
            <parameter key="max_runtime_seconds" value="0"/>
            <list key="expert_parameters"/>
          </operator>
          <connect from_port="training set" to_op="Gradient Boosted Trees" to_port="training set"/>
          <connect from_op="Gradient Boosted Trees" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model" origin="GENERATED_SAMPLE" width="90" x="45" y="34">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance_regression" compatibility="9.1.000" expanded="true" height="82" name="Performance" origin="GENERATED_SAMPLE" width="90" x="246" y="34">
            <parameter key="main_criterion" value="first"/>
            <parameter key="root_mean_squared_error" value="true"/>
            <parameter key="absolute_error" value="false"/>
            <parameter key="relative_error" value="true"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="false"/>
            <parameter key="squared_correlation" value="false"/>
            <parameter key="prediction_average" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <connect from_op="Performance" from_port="example set" to_port="test set results"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
          <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="246" y="124">Type your comment</description>
        </process>
      </operator>
      <connect from_op="Retrieve Prices of Gas Station" from_port="output" to_op="Filter Example Range" to_port="example set input"/>
      <connect from_op="Filter Example Range" from_port="example set output" to_op="Windowing" to_port="example set"/>
      <connect from_op="Windowing" from_port="windowed example set" to_op="Cross Validation" to_port="example set"/>
      <connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
      <connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
      <connect from_op="Cross Validation" from_port="test result set" to_port="result 3"/>
      <connect from_op="Cross Validation" from_port="performance 1" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
      <description align="center" color="blue" colored="true" height="166" resized="true" width="259" x="27" y="130">Retrieve the German gas prices data set from the Samples/Time Series folder.&lt;br&gt;&lt;br&gt;Remove the first 16 Examples, so that the remaining Examples starts at 9:00 AM</description>
      <description align="center" color="green" colored="true" height="427" resized="true" width="366" x="313" y="130">Perform a Windowing on the data set.&lt;br&gt;&lt;br&gt;The window size is set to 48, to include the prices of the previous 48 hours for each window.&lt;br&gt;&lt;br&gt;The step size is set to 24, so that we only look at windows which ends at 8:00 AM.&lt;br&gt;&lt;br&gt;The horizon size is set to 1, cause we want to forecast 1 price.&lt;br&gt;&lt;br&gt;The horizon offset is set to 23, so that the horizon is 23+1 hours after the window, hence the gas price of the next day at the same time.&lt;br&gt;&lt;br&gt;The resulting ExampleSet contains all we need to train any machine learning model on it. A label (the price of the next day, (gas price / euro cents (times 1000) + 24 (horizon); 48 Attributes containing the prices of the last 48 hours (gas price / euro cents (times 1000) - i) and a special attribute holding the last date in window, which is not used in the training).</description>
      <description align="center" color="yellow" colored="false" height="91" resized="true" width="230" x="703" y="198">Train a Gradient Boosted Tree on the ExampleSet created by the Windowing operator.</description>
    </process>
  </operator>
</process>

Judy · January 2019

Hi @David_A

Thank you for explaining! Can I also check under the forecast validation operation, what do they mean by window size, step size and horizon size? Do they differ? Because the default input, give me a super large root mean square error of a few billions. I do not understand from the video too. Anyone can help?

Please advice. Thank you very much in advance!

Regards
Judy

Maerkli · January 2019

Hallo Judy,

Please, have a look at:

https://www.youtube.com/watch?v=ZDBPjvAlLvs.

It is a RapidMiner training on time series, horizon, Arima.

Bon courage,

Maerkli

MarcoBarradas · January 2019

window: size how many examples are going to be used while performing the data transformation. For example is you have daily data and you want to transform it to weekly data you will put a 7
step size: how far should the forecast is going to predict you wan´t to take information to predict one week ahead? two weeks ahead? the model will try to understand how the information you input is affecting the outcome
horizon size: How many predictions you wan't only the next week or the next n weeks. The bigger the number in hear the greater the loss of performance.

Telcontar120 · January 2019

Actually @MarcoBarradas there are just a couple of small corrections here.
I believe step size in the current operator determines how far to increment when constructing each window. So for example, if you have the series {1,2,3,4,5,6,7,8,9} and you set the window size to 3 and the step size to 1, you would get: {1,2,3} then {2,3,4} then {3,4,5} and so on. But if you set the step size to 3 then you would get {1,2,3} then {4,5,6} and so on. So this determines how much your windows will overlap.
The horizon size is actually the definition you have given for step size, the number of periods in the future that is taken as the label for each window.

MarcoBarradas · January 2019

@Telcontar120my bad LOL sorry for that I tried to define all by memory but actually by checking my latest process I can see I was terribly wrong.
But I have know a question then forecast horizon and from the apply forecast and horizon size from Process Window are not equivalent. I tried to define it by what I understood while applying a forecast.
I have another question related with Process Window.
If I have daily information of sales and I want to predict next years sales by Month or by Week what would be the ideal window size since months are not standard? I remember form my business classes that we take months as 30 days and years as 360 days but I'm not sure if this would be translated the same way while trying to predict sales per month.
What I did to avoid my confusion was applying a Date to Nominal with yyyy-MM then aggregate grouping by my new Year-Month column and then analyzed it. But when I use Holt-Winters I'm no longer able to use the has indices.
Any suggestions?

tftemme · January 2019

Hi @MarcoBarradas,

Your definition of horizon size is correct. It is the number of predictions you want to make (so the number of values in the horizon). The step size is, as @Telcontar120 correctly stated, the increment between two windows. What you are asking for is called horizon offset. This is the number of values between end of window and begin of horizon (so for horizon offset = 0 it is the next value). For example if you have several weeks of daily data and the training window is from Monday to Sunday and you want to predict the values next week Sunday and Monday you would have horizon offset = 6 (6 days between end of training window and begin of horizon) and a horizon size = 2 (2 days to predict).

To keep your date information, just check 'keep old attribute' in the Date to Nominal operator and also aggregate the old date attribute, using for example max as the aggregation function.

You could also perform a forecast on the daily data, forecasting 30 days and aggregating the forecasted results to a monthly data. This may improve the performance, cause the forecast model can include the more granular daily data.

Hopes this helps,
Best regards,
Fabian

StefanBu · January 2019

Hi Fabian,

this is Stefan from Athens.
I am playing around with the example, that is meantioned in this thread. I wonder, if there is any operator, i can add to this example, that gives me some statistics and measures about the quality of the forecast.

Best, Stefan

MartinLiebig · January 2019

Hi Stefan,

have a look at the Operator "Forecast Validation" and its tutorial process. That should do the trick.

BR,

Martin

tftemme · January 2019

Hi Stefan,

Nice to see that you found the way in the community. As Martin already stated, you can use the Forecast Validation operator.

Best regards,
Fabian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Prediction of Time Series Data"

Best Answer

Answers