The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Text Mining - Industry 4
charlesmrt
Member Posts: 4 Learner III
Hey,
I want to extract all the texts from this page: http://www.plattform-i40.de/I40/Navigation/Karte/SiteGlobals/Forms/Formulare/EN/map-use-cases-formular.html and create a table with different factors extracted from these texts, each line is a case, each column is a data extracted from the text. I think i'll use 6 column: Value Creation, Product Examples, Region....
Then I want to link those datas to know which one fits most for an external given case. For instance: Given Case X fits at 80% with company of line 35, 60% with company of line 118, etc...
Do you know how I can do all of that?
It's for my Master Thesis.
Thanks a lot,
Charles
Tagged:
0
Answers
To summerize the first part of your question: You want to scrape this webpage and obtain the information included on this webpage. So how to do this?
This web page is clearly a result of a combination of HTML, CSS and Javascript. See the picture. All information is included but not all in clear HTML so "traditional" web scraping doesn't bring the required results. But still all information/data is availlable but you have to do something smarter like using Xpath in the webpage document to find and retrieve every individual piece of (AJAX/javascript) data in the document. You can do that in RapidMiner: Have a look at the toturial of a guy called El Chief on YouTube. https://www.youtube.com/watch?v=vKW5yd1eUpA
Hey,
Thanks a lot for answering, I did'nt manage to extract data from the html page. The link you sent me seems to be very useful but the classes used are not exactly the same and i don't manage to find the correct x-path to extract data.
Could you help me if you know how to correctly extract data from HTML.
From this page: http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/082-research-and-development-center-in-the-field-of-industrial-automation/article-research-and-development-center-in-the-field-of-industrial-automation.html , I want to extract Manufacturing industry and to automatically link it with Application example.
Thanks,
Charles
hello @charlesmrt - welcome to the community. It was my hope that @ey's nice "Read HTML Table" operator would do the trick here but alas it did not. However using "Get Page" and "Extract Content" gets you pretty far:
Scott
Hey,
Thanks for answering, in the file attached, you can see the HTML, I just want to extract "software solution", I tried to use "//*[contains(.,'Product example')]/../span[last()]" or "//*[contains(.,'Product example')]/../span[1]" but it doesn't work.. How could I do?
The link: http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/150-smart-engineering-and-production-4-0-en/article-smart-engineering-and-production-4-0-en.html
Thanks,
Charles
oh that seems very complicated. I would use RegEx.
Scott
Thanks,
I found an other way to do it, by downloading html page on my computer thanks to "Download them all", then I used a text processing and Extract Information with Regular Expression. I obtained a Table in which I got all the informations.
But i still have a question, in Regular expression, i can extract only one expression per column of my table, the query expression is unique, but sometimes i got many solutions for one attribute name. How can I do to have multiple solutions in one column, I used "|" but it makes a disjonction of element not an accumulation.
Thanks,
Charles