"Value Series"

labrat · August 2009

Hi all,

I am working on a project using RM, and i have a question about Multiple value series, let me explain.

Currently i am trying to predict whether a string of amino acids are an Epitope (E) or not (N). The strings are all fixed at length 20, and there are several models of prediction, with varying accuracies. These models give out a string of scores on a sliding window across the protein.

EG
Sequence ABCDEF
Model1 1,2,3,4,5,6,7
Model2 3,5,2,4,5,6,9
Model3 ,6,2,2,3,1,7,6
label E

and i was wondering how i would go about combining these scoring models to put into an SVM.

EG:

ID,[MODEL1],[MODEL2],[MODEL3],label.
ID,[1,2,3,4,5,6,7],[3,5,2,4,5,6,9],[6,2,2,3,1,7,6],E

I hope that makes sense.

Stuart

PS:

The SVM i will be using will be the bog standard Xval-SVM from the wizard

land · August 2009

Hi Stuart,
if I understand you correctly, you want to learn on the predictions of three different learning algorithms? If you want to do this, you could use the MetaLearning operator Stacking, where you put the SVM as first operator and an OperatorChain containing the three learning schemes producing your current models. This already should do the trick.

Greetings,
Sebastian

labrat · August 2009

Hi Sebastian,

and thanks for the reply,

It basically boils down to this....

There have been several physico-chemical properties correlated with a string of amino acids being a Epitope (E). To calculate this a sliding window of size 7 is used to scan a protein and you generate scores. If the score is above an arbitrary thresh-hold its determined an Epitope.

Because im using several scoring indexes (what i have called models) i need to be able to load this data into an SVM, and tell it where the scores have come from, be it model 1 or model 2.

What I currently have is this...
<?xml version="1.0" encoding="windows-1252"?>
<attributeset default_source="all7.dat">
<id
name = "id"
sourcecol = "1"
valuetype = "integer"/>

<attribute
name = "ant1"
sourcecol = "2"
valuetype = "real"
blocktype = "value_series_start"/>

<attribute
name = "ant2"
sourcecol = "3"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant3"
sourcecol = "4"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant4"
sourcecol = "5"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant5"
sourcecol = "6"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant6"
sourcecol = "7"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant7"
sourcecol = "8"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant8"
sourcecol = "9"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant9"
sourcecol = "10"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant10"
sourcecol = "11"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant11"
sourcecol = "12"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant12"
sourcecol = "13"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant13"
sourcecol = "14"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "ant14"
sourcecol = "15"
valuetype = "real"
blocktype = "value_series_end"/>

<attribute
name = "asa1"
sourcecol = "16"
valuetype = "real"
blocktype = "value_series_start"/>

<attribute
name = "asa2"
sourcecol = "17"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa3"
sourcecol = "18"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa4"
sourcecol = "19"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa5"
sourcecol = "20"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa6"
sourcecol = "21"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa7"
sourcecol = "22"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa8"
sourcecol = "23"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa9"
sourcecol = "24"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa10"
sourcecol = "25"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa11"
sourcecol = "26"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa12"
sourcecol = "27"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa13"
sourcecol = "28"
valuetype = "real"
blocktype = "value_series"/>

<attribute
name = "asa14"
sourcecol = "29"
valuetype = "real"
blocktype = "value_series_end"/>

.....<SNIP>....

<attribute
name = "paris"
sourcecol = "72"
valuetype = "integer"
blocktype = "value_series"/>

<label
name = "class"
sourcecol = "73"
valuetype = "nominal">
<value>E</value>
<value>N</value>
</label>

</attributeset>

land · August 2009

Hi,
sorry, I dont see there any problems? Why can't you just load the data file (which you apperently already had have in rapid miner) and then apply a SVM? Do I miss anything?

Greetings,
Sebastian

labrat · August 2009

I thought that would do the trick, but i notice by making something a "series within the data attributes screen" i still get the same results from my data, from telling the SVM, look at all the data points equally.

It is very strange.

land · August 2009

Hi Stuart,
I'm sorry, but I can't follow you. I just don't know whats going on and whats going wrong. Everything I saw right now, seems to fit exactly in a setup you would need. Perhabs you could paste your process? (Please use a code enviroment, available through the icon with the sharp!) Of course, you could send us the both process and data and we would take care about a proper result, but I think we would have to treat this as consulting.

Greetings,
Sebastian

labrat · August 2009

Hi Sebastian,

ok now i've had a little time to compose my thoughts (was on a bit of a dead line there),

This is what i wanted to do:

I was attempting to use a SVM to help classify if a short peptide (in this case a string of 20 letters) could be an Epitope (E) or not(N).

Previously there have been several scoring methods that have about 54/55% prediction accuracy and I was attempting to use a SVM to better these.

What these scoring methods do is assign a score over a window of 7 letter and this 7 letter window slides along the short peptide (20 letters) to yield 14 single scores.

EG:
SEQ- APTQPPPAGTGDRLLNLVQG
label - E
scoring window index, scoring windows sequence , score
1 - APTQPPP - 0.132
2 - PTQPPPA - 0.132
3 - TQPPPAG - -0.165
....
13 - RLLNLVQ - 1
14 - LLNLVQG - 1

Putting a single scoring methods into a SVM is no problem as i just dump it in as usual, however say for example I start combining these scores to give the SMV more vectors to attempt to classify with IE more scoring methods.

So i have a Excel set out like this:

ID,s1-1,s1-2,s1-3,......s1-13,s1-14,s2-1,s2-2,s2-3,....,s-13,s2-14,sn-1,sn-2,........sn-13,Label

where:

s denotes score
s1 denotes scoring mthods number 1
s1-12 denotes scoring methods number 1 for window 12

N is currently about 7

So what my question was originally, how would i tell rapid miner and thus the SVM, that s1-1 and s2-1 are the same but slightly different, or probably to put it more clearly, how would I set out the data to within Rapidminer to to tell RM these are separate scoring methods on the same data?

Now I can get a accuracy of about 60% just using a consensus scoring without using a SVM, and the best i can get from the SVM is about 56% This is why I think i am going wrong here. Probably thinking back now, I probably should have looked at Neural Net but o well that's what Masters projects are all about the journey and not the result.

Thanks again.

Stuart

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Value Series"

Answers