The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Read HTML Table - Extension Operator
I was expecting the following URL to parse properly:
https://www.hockey-reference.com/leagues/NHL_2019_skaters.html
However, the operator did not find any tables on the page. The tutorial process does properly parse tables from wikipedia, but fails on the page above.
That said, this is my go-to reference for my students as the tables are easily parsed in R and Python. For example:
import pandas as pd
tables = pd.read_html("https://www.hockey-reference.com/leagues/NHL_2019_skaters.html")
skaters = tables[0]
skaters.head().
Yes, there has to be some cleanup on the columns and data types, but that is part of the exercise and why I like using this reference. I figured it would be even more powerful as a training exercise in RM given the amount of data prep that is necessary.
Any helps or tips on how to configure this operator would be much appreciated!
https://www.hockey-reference.com/leagues/NHL_2019_skaters.html
However, the operator did not find any tables on the page. The tutorial process does properly parse tables from wikipedia, but fails on the page above.
That said, this is my go-to reference for my students as the tables are easily parsed in R and Python. For example:
import pandas as pd
tables = pd.read_html("https://www.hockey-reference.com/leagues/NHL_2019_skaters.html")
skaters = tables[0]
skaters.head().
Yes, there has to be some cleanup on the columns and data types, but that is part of the exercise and why I like using this reference. I figured it would be even more powerful as a training exercise in RM given the amount of data prep that is necessary.
Any helps or tips on how to configure this operator would be much appreciated!
Tagged:
0
Best Answer
-
MarcoBarradas Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, Member Posts: 272 Unicorn@btibert I found another solution just change the path of the HTML file that is created
@ey it would be great if we could connect the file connector of the Write Document operator to the Read HTML Table.
Sorry for not commenting, but I'm on a rush but I wanted to help.<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="187"> <parameter key="url" value="https://www.hockey-reference.com/leagues/NHL_2019_skaters.html"/> <parameter key="random_user_agent" value="true"/> <parameter key="connection_timeout" value="10000"/> <parameter key="read_timeout" value="10000"/> <parameter key="follow_redirects" value="true"/> <parameter key="accept_cookies" value="all"/> <parameter key="cookie_scope" value="thread"/> <parameter key="request_method" value="GET"/> <list key="query_parameters"/> <list key="request_properties"/> <parameter key="override_encoding" value="false"/> <parameter key="encoding" value="SYSTEM"/> </operator> <operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="124" name="Multiply" width="90" x="179" y="136"/> <operator activated="true" class="text:cut_document" compatibility="8.2.000" expanded="true" height="68" name="Table Headers" width="90" x="179" y="340"> <parameter key="query_type" value="Regular Region"/> <list key="string_machting_queries"/> <parameter key="attribute_type" value="Nominal"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"> <parameter key="Header" value="<thead>.</thead>"/> </list> <list key="xpath_queries"/> <list key="namespaces"/> <parameter key="ignore_CDATA" value="true"/> <parameter key="assume_html" value="true"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <process expanded="true"> <operator activated="true" breakpoints="before" class="text:cut_document" compatibility="8.2.000" expanded="true" height="68" name="Cut Document (2)" width="90" x="380" y="34"> <parameter key="query_type" value="String Matching"/> <list key="string_machting_queries"> <parameter key="header" value="<tr>.</tr>"/> </list> <parameter key="attribute_type" value="Nominal"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"/> <list key="xpath_queries"/> <list key="namespaces"/> <parameter key="ignore_CDATA" value="true"/> <parameter key="assume_html" value="true"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <process expanded="true"> <connect from_port="segment" to_port="document 1"/> <portSpacing port="source_segment" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <connect from_port="segment" to_op="Cut Document (2)" to_port="document"/> <connect from_op="Cut Document (2)" from_port="documents" to_port="document 1"/> <portSpacing port="source_segment" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:cut_document" compatibility="8.2.000" expanded="true" height="68" name="Rows of the table" width="90" x="313" y="238"> <parameter key="query_type" value="Regular Region"/> <list key="string_machting_queries"/> <parameter key="attribute_type" value="Nominal"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"> <parameter key="Row" value="<tr >.</tr>"/> </list> <list key="xpath_queries"/> <list key="namespaces"/> <parameter key="ignore_CDATA" value="true"/> <parameter key="assume_html" value="true"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <process expanded="true"> <operator activated="true" class="text:extract_information" compatibility="8.2.000" expanded="true" height="68" name="Get_Data (2)" width="90" x="380" y="34"> <parameter key="query_type" value="Regular Expression"/> <list key="string_machting_queries"/> <parameter key="attribute_type" value="Nominal"/> <list key="regular_expression_queries"> <parameter key="Rank" value="<.*data-stat="ranker".*>(\d+)+</th>"/> <parameter key="Player" value="<.*data-stat="player".*>([\w\s]+)</a></td><td.*data-stat="age" >"/> <parameter key="Age" value="<td.*data-stat="age".*>(\d+)</td><td.*"team_id".*>"/> </list> <list key="regular_region_queries"> <parameter key="Rank" value="<\.*data-stat="ranker"\.*>.</th>"/> <parameter key="Player" value="<\.*data-stat="player"\.*>.</th>"/> </list> <list key="xpath_queries"/> <list key="namespaces"/> <parameter key="ignore_CDATA" value="true"/> <parameter key="assume_html" value="true"/> <list key="index_queries"/> <list key="jsonpath_queries"/> </operator> <connect from_port="segment" to_op="Get_Data (2)" to_port="document"/> <connect from_op="Get_Data (2)" from_port="document" to_port="document 1"/> <portSpacing port="source_segment" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:process_documents" compatibility="8.2.000" expanded="true" height="103" name="With RegEx" width="90" x="313" y="34"> <parameter key="create_word_vector" value="false"/> <parameter key="vector_creation" value="TF-IDF"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="false"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <process expanded="true"> <operator activated="true" class="text:cut_document" compatibility="8.2.000" expanded="true" height="68" name="Cut Rows" width="90" x="313" y="34"> <parameter key="query_type" value="Regular Region"/> <list key="string_machting_queries"/> <parameter key="attribute_type" value="Nominal"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"> <parameter key="Row" value="<tr >.</tr>"/> </list> <list key="xpath_queries"/> <list key="namespaces"/> <parameter key="ignore_CDATA" value="true"/> <parameter key="assume_html" value="true"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <process expanded="true"> <operator activated="true" class="text:extract_information" compatibility="8.2.000" expanded="true" height="68" name="Get_Data" width="90" x="246" y="34"> <parameter key="query_type" value="Regular Expression"/> <list key="string_machting_queries"/> <parameter key="attribute_type" value="Nominal"/> <list key="regular_expression_queries"> <parameter key="Rank" value="<.*data-stat="ranker".*>(\d+)+</th>"/> <parameter key="Player" value="<.*data-stat="player".*>([\w\s]+)</a></td><td.*data-stat="age" >"/> <parameter key="Age" value="<td.*data-stat="age".*>(\d+)</td><td.*"team_id".*>"/> </list> <list key="regular_region_queries"> <parameter key="Rank" value="<\.*data-stat="ranker"\.*>.</th>"/> <parameter key="Player" value="<\.*data-stat="player"\.*>.</th>"/> </list> <list key="xpath_queries"/> <list key="namespaces"/> <parameter key="ignore_CDATA" value="true"/> <parameter key="assume_html" value="true"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <description align="center" color="transparent" colored="false" width="126">With regular expressions and capture groups ( ) we get the data we want</description> </operator> <connect from_port="segment" to_op="Get_Data" to_port="document"/> <connect from_op="Get_Data" from_port="document" to_port="document 1"/> <portSpacing port="source_segment" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> <description align="center" color="yellow" colored="false" height="94" resized="true" width="362" x="140" y="233">(\d+) gets all the digits that are on the row<br>([\w\s]+) gets all the text and spaces that <br/>we need from the row</description> </process> <description align="center" color="orange" colored="true" width="126">With the Regular Region we are getting everything between &lt;tr&gt;&lt;/tr&gt; tags</description> </operator> <connect from_port="document" to_op="Cut Rows" to_port="document"/> <connect from_op="Cut Rows" from_port="documents" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:create_document" compatibility="8.2.000" expanded="true" height="68" name="table openning tag" width="90" x="313" y="136"> <parameter key="text" value="<table>"/> <parameter key="add label" value="false"/> <parameter key="label_type" value="nominal"/> </operator> <operator activated="true" class="text:create_document" compatibility="8.2.000" expanded="true" height="68" name="table clossing tag" width="90" x="313" y="493"> <parameter key="text" value="</table>"/> <parameter key="add label" value="false"/> <parameter key="label_type" value="nominal"/> </operator> <operator activated="true" class="text:combine_documents" compatibility="8.2.000" expanded="true" height="145" name="Combine Documents" width="90" x="514" y="289"/> <operator activated="true" class="text:write_document" compatibility="8.2.000" expanded="true" height="82" name="Creates a file with the HTML table" width="90" x="648" y="289"> <parameter key="overwrite" value="true"/> <parameter key="encoding" value="SYSTEM"/> </operator> <operator activated="true" class="write_file" compatibility="9.3.001" expanded="true" height="68" name="Write File to your computer" width="90" x="782" y="289"> <parameter key="resource_type" value="file"/> <parameter key="filename" value="C:\Users\mbarradas\Desktop\table.html"/> <parameter key="mime_type" value="application/octet-stream"/> </operator> <operator activated="true" class="web_table_extraction:html2exampleset_operator" compatibility="1.0.000" expanded="true" height="68" name="Extracts the table from the file created" width="90" x="648" y="136"> <parameter key="resource_type" value="file"/> <parameter key="file name" value="C:\Users\mbarradas\Desktop\table.html"/> </operator> <connect from_op="Get Page" from_port="output" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="With RegEx" to_port="documents 1"/> <connect from_op="Multiply" from_port="output 2" to_op="Rows of the table" to_port="document"/> <connect from_op="Multiply" from_port="output 3" to_op="Table Headers" to_port="document"/> <connect from_op="Table Headers" from_port="documents" to_op="Combine Documents" to_port="documents 2"/> <connect from_op="Rows of the table" from_port="documents" to_op="Combine Documents" to_port="documents 3"/> <connect from_op="With RegEx" from_port="example set" to_port="result 1"/> <connect from_op="table openning tag" from_port="output" to_op="Combine Documents" to_port="documents 1"/> <connect from_op="table clossing tag" from_port="output" to_op="Combine Documents" to_port="documents 4"/> <connect from_op="Combine Documents" from_port="document" to_op="Creates a file with the HTML table" to_port="document"/> <connect from_op="Creates a file with the HTML table" from_port="file" to_op="Write File to your computer" to_port="file"/> <connect from_op="Extracts the table from the file created" from_port="collection of html data tables as example sets" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process>
6
Answers
Dortmund, Germany
The "Read HTML Table/Tables" operator is built on a set of rules that allow it to extract data from HTML tables. Unfortunately, as seen from some other questions, very often, the visual html interface looks simpler than its html source code. The tables in the page you are referring to, are structured in a rather complex manner, which is not a standard table (as considered by the operator). The operator is using a classifier that was trained by data (a subset of common crawl corpus) and works best with more straight-forward tables (non-nested header, no column spanning, clear definition of row and data items in the table body without further nesting, etc.). Hence, in this case the operator is not successful. Also, the "Extract Structured Data" operator, which we wrote to extract data tables from sites which use schema.org standard to encode microdata (completely or incompletely), is not of much use here. The reasons are 1) there is no use of schema.org tags and 2) the dom based parsing used by this operator requries a hook to an encapsulating node in the DOM tree, which is not used in any standard manner i.e. the classes or itemtype for the table elements (header and rows) are missing.
This leaves us to the last option, which is not ideal given the processing time involved, but I tried it anyway. Save the page as PDF, then try the "Read PDF Tables" operator on the PDF file. Try to tune extraction criteria so empty and no-row tables are ignored. This operator is using image detection techniqus from Tabula Java (and we plan to update it further) to extract data from detected tables. I am attaching the zip file containing IO objects (representing ExampleSets), which you can unzip and paste in your local repository. This operator was able to extract all data from the tables, but certain tables are appended in 1 ExampleSet, which you can split of course. This may not be an ideal solution but depending on your circumstances and context, it still provides one solution.
Thanks for your feedback. We have been thinking on incorporating some recent libraries and will continue to evaluate how to improve web table extraction in near future.
Best Regards,
Edwin Yaqub
I now it is not as simple as on Python o R but this is how you may extract the information when the other operators fail
I'm not the best with it but hope it works.
https://community.rapidminer.com/discussion/44904/using-the-nhl-api-to-analyze-pro-ice-hockey-data-part-1
Scott
Thats a great link! I have been playing around with the API for a while (in R and Python) so thats absolutely fantastic to see it here in this tool too!