Using DS4DM and Web Table Extraction extensions for Google Table and HTML table extraction
The new versions of Data Search for Data Mining extension (version 0.1.3) and Web Table Extraction extension (version 0.1.7) are now available from the RapidMiner Marketplace. In this article I will share how the newly added Operators bring the Web as a new data source for RapidMiner.
Data Tables on the Web
Combining data from multiple sources is often necessary for accurate data analysis. A recent survey showed that 98% of .com domains are made up of HTML content compared to the total number of PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, RTF, ODP, ODT, ODS and EPUB files. This leads to great interest in retrieving data from the <table> tag in html files. However, Google researchers showed that of 14.1 billion raw HTML tables extracted from the English documents in Google's main index, only 154 million were high-quality data tables and the larger lot were merely layout tables. So, on one hand, the Web is the right place to look for public data tables but on the other hand, the challenge is to narrow down the search to the small subset of webpages. These should contain not just data tables but data tables that match your interest.
On a positive note, Google has made public its research work on keywords-based tabular search. This search is based on their “Web Tables” index - an index of over a hundred million public HTML data tables. In addition, Google has also made its “Fusion Tables” index available for search. This index is made up of data tables, which have been made public by the users of “Google Fusion Tables service” that allows to manage a collection of tables using SQL-like queries.
Google Table Search Operator
With this background, let me introduce the new Google Table Search Operator, which is shipped in the updated version of Data Search for Data Mining extension. This Operator allows RapidMiner users to perform keywords-based search on Google for webpages that contain data tables matching the keyword(s). The output is an ExampleSet carrying URLs of target webpages in an attribute. This is great, but falls one step short of instant gratification i.e. making data tables on these webpages available as ExampleSet(s) for immediate inspection and consumption.
Read HTML Tables Operator
This gap is filled by the new Read HTML Tables Operator, which is shipped in the updated version of Web Tables Extraction extension. It takes the output ExampleSet from Google Table Search Operator, iterates over all webpage URLs, extracts data tables from these webpages and outputs them as a collection. Together, these two Operators bring the Web as a new data source for RapidMiner!
The figure below shows a RapidMiner process, where the Google Table Search Operator searches the Google Web Tables corpus for keywords "football clubs" and is configured to filter top 10 results. Next, the Read HTML Tables Operator outputs a collection of ExampleSet(s) found to match the keywords. The title of the page and document from where data table is derived can be seen in the Annotation view of the table, revealing its source. For local or offline use, the Read HTML Tables Operator can also extract data tables from an ExampleSet containing HTML file paths instead of URLs.
Google Table Search operator together with Read HTML Tables operator provide a keywords-based search for data tables on the Web and make them available as a collection of Example sets
The extraction algorithm used by Read HTML Tables is scrape-agnostic i.e., no regular expression, XQuery or XPath configurations are required as the aim is broader applicability. Please note that extraction works best for <table> tags which have a clearly defined (non-nesting) header and row structure. For more information, you may refer to the past blog article on Web Table Extraction.
Added Value through Search-Join data enrichment
Searching and retrieving data tables from the webpages (as shown above) brings an untapped source of structured data to RapidMiner. The question is how to take deeper advantage of this mass data? An added value comes from query-driven data enrichment – a method which finds tables from a corpus, which are contextually relevant to your given table. If we can discover data in a smart and efficient way from a large and heterogenous corpus of tables and also automatically integrate new attributes with existing dataset by automagically handling the different schemas of discovered tables, the output is an enriched table that can lead to holistic and accurate analysis. However, query-driven data enrichment requires a managed corpus of tables as well as a search engine to process queries.
As shown in an earlier blog article, as part of German research project DS4DM, we already implemented query-driven data enrichment for RapidMiner. The aptly named Search-Join Operators provide this together with the backend search engine, which is developed by our project partner (The University of Mannheim, Web and Data Science group). The search engine harnesses a managed corpus, currently composed of data tables from Wikipedia.
Repository Management Operators
In this update, we also implemented two Operators that give users API-based control of their managed corpus for corporate/private use. The managed corpus may be very large and tables should be ingested in it automatically. The Data Table Upload Operator allows to upload a RapidMiner ExampleSet to the backend. Naturally, this includes data from any connector already available in RapidMiner as well the tables that can now be discovered directly from the Web by the Google Table Search Operator. The backend pre-processes, indexes, and stores the received data in its corpus such that it becomes readily available for future queries.
For better organization, data must be uploaded to a particular repository at the backend. A repository provides a logical grouping of similar data within the managed corpus e.g., data from different departments or products should be stored in separate repositories. A repository can be created using the Create Repository Operator. Please note that data management operators are found in the Repository Management operator group.
The figure below shows a RapidMiner process, which uploads the Products dataset to a repository at the backend, specified by the repository parameter. This list is populated with names of repositories fetched from the backend server. The data search connection parameter allows to select and create a connection with the backend server (there may be private on-premise or public backends in future). The optional subject id parameter is the primary identifier, which can also be heuristically determined at the backend.
Data Table Upload operator connects to the DS4DM server and uploads a data table to a repository at the backend
Try out the new features by updating your Data Search for Data Mining extension as well the Web Table Extraction extension from the RapidMiner Marketplace and run the tutorial processes. Please note that this is work in progress and we are all ears to your valuable feedback.