The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
ignore '?' value
olandesino
Member Posts: 19 Maven
Hi,
I would like to ask if there is a preprocessing filter to ignore all the fields (not entire attribute) with a particulare values like '?'.
So that RM accept the file as input (without missing values) but still i can ignore this fake value during my process.
for example:
3 45 sss
? ? qqq
? ? rrr
in this case i want that the tool take all 3 attribute but with out the ? values (this will compromise my data!)
thank you in advance for your feedback!
A.
I would like to ask if there is a preprocessing filter to ignore all the fields (not entire attribute) with a particulare values like '?'.
So that RM accept the file as input (without missing values) but still i can ignore this fake value during my process.
for example:
3 45 sss
? ? qqq
? ? rrr
in this case i want that the tool take all 3 attribute but with out the ? values (this will compromise my data!)
thank you in advance for your feedback!
A.
0
Answers
I must admit that I did not get the point. Could you provide an example process where this would make any sense and describe in detail how the values should be ignored - without being ignored? I think I missed something crucial here
Cheers,
Ingo
from the file you can see sthat I've 3 attributes. the attributes "id" and "verdict" appear every n row (start point of a sequence of n[with n variable] length) meanwhile the minimal_event attribute appears every rows.
1. if I will replace blank fields (wich is not allowed in RM) with the '?' can i say to RM after to ignore this sign? otherwise it will compromise my data?
2.After this, I know how to extract sequences from this file, but how can I associate the verdict value to the related sequence?
example of desired structure:
<Sequence1>
id
minimal_event1
minimal_event2
|
|
minimal_eventn
verdict (related)
<end sequence1>
...and so on.
I know that RM doesn't like different length size of sequences
is there a smart solution to solve it?
I hope that my 2 problems are more clear.
Thanks
A.
[attachment deleted by admin]
this format is, well, a real pain. Any chance that the data source is able to deliver a slightly improved data set? For example, it would be much easier if the sequence events would not have been divided by new lines. Alternatively (additionally?) it would also be much easier if you would not have simply whitespace as a separation character between the columns. As you can easily see, this will always lead to problems (how to identify the columns?). A semicolon for example would be much easier.
After saying this I currently see only a single option (at least only a single easy one): write your own example source operator (it should not be too difficult).
Cheers,
Ingo
well, i can put all the operation related to one sequence in one line. But
I don't know how RM could interpret those kind of input format since it decides that
every column is a attribute...Could you explain me better?
with a script i convert it in to a arff format(every row is a sequence), but the problem remains: "different length of sequences" is something that the tool cannot handle..
Thank for your time.
A.
you could use the Split operator for this purpose like in
The attached file has the format: As you can see, the value sequences do not have the same length. The resulting example set will have the format: This should be pretty much what you are looking for.
Hope that helps,
Ingo
P.S.: Please consider voting at KDnuggets. Read more at http://rapid-i.com/rapidforum/index.php/topic,884.msg3302.html
[attachment deleted by admin]
It will manipulate my data with the purpose to have "always the same length"
Thanks anyway for your help, I know that somewhere there is a solution.
Regards,
A.
sorry, but I did not notice that there was a bug in the split operator for the ordered split mode leading to the wrong values. This bug was fixed in the latest developer branch. Since we are currently moving our CVS servers to subversion, the access is however not as easy as usual. But of course this bug will also be fixed for the next update of the Enterprise Edition and later also for the next community release.
Cheers,
Ingo