How to model a large data set with different "components"?

dan_ferraro24 · June 2015

Greetings Community,

I was hesitant to post this because i thought for sure it would be covered. But thus far in the existing videos and posts i haven't seen my specific question covered (even if some similar subject matter has been discussed).

I am modeling sports data. Baseball and football for daily fantasy sports purposes. Overall Direction Is: i want to build a baseline projection model (time series/pattern recognition), and then add some form of regression to that baseline projection in order to account for game specific matchup variables.

My question is less about the types of models or the theory, but more basic: How can i create models for individual players from a dataset including multiple players in the easiest way possible.

I want to build both models (projection and regressions) based on player specific data. I don't want to create models for "all third basemen" or "all running backs". I want to create them specific to individual players. However, i don't want to save individual data files for each player. That process, while likely not to hard with some engineering, seems like a waste of time. There has to be a better way.

I have large data sets with all the variables and historical data tied to individual players for individual games connected and organized. It would read something like (Date-game specific; player ID; Team ID; Points scored, then all the stats and situational variables related to that game). Each player has their own line for a specific game/date.

How would someone with more experience suggest i set up my process, or leverage certain models, which can provide me player specific results from a single run through a larger data set?

From my research i have a hunch that a macro and loop setup could possibly be used to limit my overall data to a player specific set of examples based on the macro list. But is there a better, more streamlined way?

Last note - my question (again) is less about using specific operators. I have used single player data sets with success using the instruction for time series, regression, and SVMs (THANKS THOMAS OTT). Now i need the best way to move from single player datasets to larger data sets. I will need to update these daily or weekly - hence my quest for simplicity if possible.

Thanks (and sorry if i am in the wrong place with a bad question)
-Dan

MartinLiebig · June 2015

Hi,

i am not completly sure if i understood you, but i think you misunderstood the concept of predictive analytics. In predictive analytics you usually generate a model, which represents the general underlying rules like "Old players which a injury in the last three month underperform"

To do this you need a dataset like this

PlayerId | Age | TouchDownsLastThreePlayDays | Preferred System | ...

and most important a performance/cost/value you want to predict. Then you take this general rule (e.g. a SVM Model) and apply it on your specific data: A person who is X years old, likes to play system B and had 2 TouchDowns the last three playdays. Then you get the prediction of 1 [a.u] for it.

Can your idea fit into this schema?

Cheers
Martin

JEdward · June 2015

Actually if I understand correctly, what Dan is trying to do is predict how an individual player will perform in their next game based on how that player has previously performed.

Therefore taking a dataset like:
PlayerId|Age|TouchDownsLastThreePlayDays|Preferred System|...|GameID|GameDate|GameRuns|GamePass|GameWasMoM

You can use Loop Values to loop the individual PlayerID and generate a model for the individual player data.
I would suggest looking into ways of combining the data of the other players to complete attributes of players without a huge amount of data. (for example Player X is currently performing well and has stats similar to Player Z in 1983 season, if the same trend holds then Player X should burn out before end of season so should be sold from Fantasy League within 2 weeks whilst price is at peak).

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="6.4.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="165">
        <parameter key="text" value="PlayerId|Age|TouchDownsLastThreePlayDays|Preferred System|...|GameID|GameDate|GameRuns|GamePass|GameWasMoM&#10;1|24|1|Running with the oval ball|Dancing|12|24-01-2015|9|12|TRUE&#10;1|24|1|Running with the oval ball|Dancing|14|26-03-2015|1|2|FALSE&#10;2|29|7|Keynesian|Meditation|12|24-01-2015|3|5|FALSE&#10;"/>
      </operator>
      <operator activated="true" class="text:write_document" compatibility="6.4.001" expanded="true" height="76" name="Write Document" width="90" x="179" y="165"/>
      <operator activated="true" class="read_csv" compatibility="6.4.000" expanded="true" height="60" name="Read CSV" width="90" x="246" y="255">
        <parameter key="column_separators" value="|"/>
        <list key="annotations"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="PlayerID.true.nominal.id"/>
          <parameter key="1" value="Age.true.nominal.attribute"/>
          <parameter key="2" value="TouchDownsLastThreePlayDays.true.nominal.attribute"/>
          <parameter key="3" value="Preferred System.true.nominal.attribute"/>
          <parameter key="4" value="\.\.\..true.nominal.attribute"/>
          <parameter key="5" value="GameID.true.nominal.attribute"/>
          <parameter key="6" value="GameDate.true.nominal.attribute"/>
          <parameter key="7" value="GameRuns.true.nominal.attribute"/>
          <parameter key="8" value="GamePass.true.nominal.attribute"/>
          <parameter key="9" value="GameWasMoM.true.nominal.label"/>
        </list>
      </operator>
      <operator activated="true" class="loop_values" compatibility="6.4.000" expanded="true" height="76" name="Loop Values" width="90" x="313" y="120">
        <parameter key="attribute" value="PlayerID"/>
        <process expanded="true">
          <operator activated="true" class="filter_examples" compatibility="6.4.000" expanded="true" height="94" name="Filter Examples" width="90" x="112" y="30">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="PlayerID.equals.%{loop_value}"/>
            </list>
          </operator>
          <operator activated="true" class="parallel_decision_tree" compatibility="6.4.000" expanded="true" height="76" name="Decision Tree" width="90" x="246" y="120"/>
          <connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_port="out 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Write Document" to_port="document"/>
      <connect from_op="Write Document" from_port="file" to_op="Read CSV" to_port="file"/>
      <connect from_op="Read CSV" from_port="output" to_op="Loop Values" to_port="example set"/>
      <connect from_op="Loop Values" from_port="out 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to model a large data set with different "components"?

Answers