The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

read data from html tables

Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
edited December 2018 in Product Feedback - Resolved

There are many pages on the web that contain useful data in the form of simple html tables.  Here's an example:

https://en.wikipedia.org/wiki/List_of_metropolitan_areas_of_the_United_States

 

RapidMiner can be used to retrieve data automatically in html form using "get page" and store it as a document, and can even do this iteratively if a set of related pages are required.  But what users often want to do is to extract the information in the html table into a usable example set in RapidMiner.  So an operator should be created that does the following:   

  1. collect the table column headers and use them as attribute names
  2. collect each data row from the table and store it as an example
  3. identify and set the appropriate data type for each resulting attribute

It seems like it would be an incredibly useful operator that did all this automatically -  "HTML table to data" or something similar.    In theory this could be similar to the read csv operator, with a small wizard to identify the table, the columns, set the data types, etc.

 

P.S.  I know it is technically feasible using a series of xpath expressions in the read xml operator, but after consulting with some RM product experts in the general studio forum, the consensus is that it is still a multi-step process that requires good knowledge of xpath parsing under current options.  So adding a single operator that did all the parsing, renaming, etc., would be a significant improvement.

Brian T.
Lindon Ventures 
Data Science Consulting from Certified RapidMiner Experts
0
0 votes

Fixed and Released · Last Updated

Comments

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    This sounds a little lke a simple version of how Diffbot works.  Have you tried out their service?  

    I agree having this as a simple operator would be pretty handy. 

     

     

  • Edin_KlapicEdin_Klapic Employee-RapidMiner, RMResearcher, Member Posts: 299 RM Data Scientist

    Hi all,

     

    my colleague Edwin Yaqub recently developed an extension for this use case. Maybe you can check this out and give us some feedback?

     

    Here is the link to the RapidMiner blog post. The extension is available on the marketplace.

    http://community.rapidminer.com/t5/Community-Blog/The-Web-Table-Extraction-Operator/ba-p/37353

     

    Best regards,

    Edin

  • ey1ey1 Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 21 RM Research

    Hi all,

     

    This seems to be a slightly older thread, but we just released an updated version of the Web Table Extraction extension.
    It has an operator 'Read HTML Table' which adresses the above features requested by Telcontar120. The latest release adds type guessing on the ExampleSet and also allows to load the HTML documents from local file path in addition to web URL. The former is helpful when dealing with large amount of html data files collected through crawls.

     

    The html tables can have complicated header structures and many of these tables are used for layout i.e. they are not data tables. The operator is smart in that it uses a model to classify layout from data tables. The model is trained on lots of HTML documents that were crawled and painstakingly cleaned in public research projects such as the Common Crawl, the Dresden Web Table Corpus and the Web Data Commons project (Uni Mannheim). Since there can be an infinite many ways tables are structured using HTML, some standardization rules apply to extract data from html pages i.e., tables are not nested, the headers are single-layered and that atleast 2 attributes and 3 rows exist in the table. The operator has been tested to work on Wiki-like but also arbitrary many other websites where HTML tables follow similar structures.

     

    You can download the latest release from this link:
    https://marketplace.rapidminer.com/UpdateServer/faces/download.xhtml?productId=rmx_web_table_extraction&platform=ANY&version=0.1.6

     

    Hope it helps.


    Kind Regards,
    Edwin Yaqub

  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
Sign In or Register to comment.