The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Problems with processing the answer from a GET request
David_Bartholomew
Member Posts: 1 Learner I
Hi guys,
I want to mine performance data of footballers for an essay.
As a source I found Goaloo1 (I cant post links yet). The problem is that they don't provide the information in a file, so I want to use the Web Mining Extension instead.
I managed to identify the GET request URL that provides all the data for a given season of a given league (cant post that either ). Only problem is that the document is just one big string that (with some minor RegEx replacements) can be turned into multiple CSVs. Now I could do that manually in VSC, but I would rather learn to do it all properly in Rapid Miner.
First things first, I couldn't get the GET (REST) operator to work (I got an "Error accessing REST Service"):
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="SYSTEM"/><br> <process expanded="true"><br> <operator activated="true" class="web:crud_get" compatibility="9.7.000" expanded="true" height="68" name="GET (REST)" width="90" x="112" y="85"><br> <parameter key="request_url" value="https://info.goaloo1.com/jsdata/count/2020-2021/playertech_36.js"/><br> <list key="request_headers"/><br> <parameter key="response_body_type" value="json"/><br> <parameter key="fail_on_endpoint_error" value="true"/><br> </operator><br> <operator activated="true" class="text:documents_to_data" compatibility="9.4.000" expanded="true" height="82" name="Documents to Data" width="90" x="380" y="85"><br> <parameter key="text_attribute" value="Test"/><br> <parameter key="add_meta_information" value="true"/><br> <parameter key="datamanagement" value="double_sparse_array"/><br> <parameter key="data_management" value="auto"/><br> <parameter key="use_processed_text" value="false"/><br> </operator><br> <connect from_op="GET (REST)" from_port="response" to_op="Documents to Data" to_port="documents 1"/><br> <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process><br><br>
I did manage to get the document by using the "Get Page" operator though. From what I gathered online, I now need to use the "Replace" operator an an ExampleSet. Therefore, I need to transform the Document to an ExampleSet first. I found two ways, but I couldn't get any of them to work.
The first way was to use the "Documents to Data" operation. Although it does give me an ExampleSet that I can use the "Replace" operation on, it cuts of about 99% of the information of the original document:
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="SYSTEM"/><br> <process expanded="true"><br> <operator activated="true" class="web:get_webpage" compatibility="9.7.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="85"><br> <parameter key="url" value="https://info.goaloo1.com/jsdata/count/2020-2021/playertech_36.js"/><br> <parameter key="random_user_agent" value="false"/><br> <parameter key="connection_timeout" value="10000"/><br> <parameter key="read_timeout" value="10000"/><br> <parameter key="follow_redirects" value="true"/><br> <parameter key="accept_cookies" value="none"/><br> <parameter key="cookie_scope" value="global"/><br> <parameter key="request_method" value="GET"/><br> <list key="query_parameters"/><br> <list key="request_properties"/><br> <parameter key="override_encoding" value="false"/><br> <parameter key="encoding" value="SYSTEM"/><br> <parameter key="keep_sensitive_headers" value="false"/><br> </operator><br> <operator activated="true" class="text:documents_to_data" compatibility="9.4.000" expanded="true" height="82" name="Documents to Data" width="90" x="380" y="85"><br> <parameter key="text_attribute" value="Test"/><br> <parameter key="add_meta_information" value="true"/><br> <parameter key="datamanagement" value="double_sparse_array"/><br> <parameter key="data_management" value="auto"/><br> <parameter key="use_processed_text" value="false"/><br> </operator><br> <connect from_op="Get Page" from_port="output" to_op="Documents to Data" to_port="documents 1"/><br> <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process><br><br>
The second way I found was to use the Process Documents operation. Same problem:
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="SYSTEM"/><br> <process expanded="true"><br> <operator activated="true" class="text:read_document" compatibility="9.4.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="85"><br> <parameter key="file" value="C:/Users/[Hidden]><br> <parameter key="extract_text_only" value="true"/><br> <parameter key="use_file_extension_as_type" value="true"/><br> <parameter key="content_type" value="txt"/><br> <parameter key="encoding" value="SYSTEM"/><br> </operator><br> <operator activated="true" class="text:process_documents" compatibility="9.4.000" expanded="true" height="103" name="Process Documents" width="90" x="380" y="85"><br> <parameter key="create_word_vector" value="false"/><br> <parameter key="vector_creation" value="TF-IDF"/><br> <parameter key="add_meta_information" value="false"/><br> <parameter key="keep_text" value="true"/><br> <parameter key="prune_method" value="none"/><br> <parameter key="prune_below_percent" value="3.0"/><br> <parameter key="prune_above_percent" value="30.0"/><br> <parameter key="prune_below_rank" value="0.05"/><br> <parameter key="prune_above_rank" value="0.95"/><br> <parameter key="datamanagement" value="double_sparse_array"/><br> <parameter key="data_management" value="auto"/><br> <process expanded="true"><br> <connect from_port="document" to_port="document 1"/><br> <portSpacing port="source_document" spacing="0"/><br> <portSpacing port="sink_document 1" spacing="0"/><br> <portSpacing port="sink_document 2" spacing="0"/><br> </process><br> </operator><br> <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/><br> <connect from_op="Process Documents" from_port="example set" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process><br>
Can anybody help me with my problem? Or should I maybe follow a different approach to mining the data altogether?
Im very new to Rapid Miner, so please excuse any Newbie mistakes I make.
Best
David
David
Tagged:
0