The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Reading values using XPATH and extracting from metadata to an attribute
Hello,
This seems like it should be possible but I've hit a few bumps in the road and am hoping that someone can offer a few suggestions. The basic storyline is that I am attempting to mine some data off of a page that I access in Google. In order to do this, you have to first log into your Google account. Here are the steps:
1) Access Google's login page, allowing Google to set a cookie for the session
2) Read hidden variables on the authentication form (the GALX token is what I'm interested in here)
3) Post values back to the form that include the tokens you picked up along with your username and password
4) Voila - you are authenticated
My process to parse the initial query result doesn't seem to be working...RapidMiner does not seem to be picking up the GALX attribute. So that's the first place I'm stuck. The second is that once I have that in my metadata, how do I get it out to use in the post back?
Thanks in advance for your help. Process XML is below.
-Eric
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:get_webpage" compatibility="5.3.001" expanded="true" height="60" name="Get Page" width="90" x="45" y="75">
<parameter key="url" value="https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/"/>
<parameter key="user_agent" value="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1"/>
<parameter key="accept_cookies" value="all"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="179" y="75">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="GALX" value="//input[@name='GALX']/@value"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
This seems like it should be possible but I've hit a few bumps in the road and am hoping that someone can offer a few suggestions. The basic storyline is that I am attempting to mine some data off of a page that I access in Google. In order to do this, you have to first log into your Google account. Here are the steps:
1) Access Google's login page, allowing Google to set a cookie for the session
2) Read hidden variables on the authentication form (the GALX token is what I'm interested in here)
3) Post values back to the form that include the tokens you picked up along with your username and password
4) Voila - you are authenticated
My process to parse the initial query result doesn't seem to be working...RapidMiner does not seem to be picking up the GALX attribute. So that's the first place I'm stuck. The second is that once I have that in my metadata, how do I get it out to use in the post back?
Thanks in advance for your help. Process XML is below.
-Eric
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:get_webpage" compatibility="5.3.001" expanded="true" height="60" name="Get Page" width="90" x="45" y="75">
<parameter key="url" value="https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/"/>
<parameter key="user_agent" value="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1"/>
<parameter key="accept_cookies" value="all"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="179" y="75">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="GALX" value="//input[@name='GALX']/@value"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0
Answers
Has anyone else worked through the details of authenticating your credentials on Google through the operators in RapidMiner?
-Eric