The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Simple Text Mining

JuWiJuWi Member Posts: 7 Contributor II
edited July 2019 in Help
Hello,

i have to do some reasearches concerning Text Mining with RapidMiner.

I have the RapidMiner 4.6 and the Text PLugin installed.

I successfully crawled  some pages from the web and stored them as html files.
Now i want visualize my results.

For example:
I crawled this Forum and stored the pages whereever the keywords "text" and "mining" appear.
The Crawler stored about 80 html pages.

Is it now possible to extract and visualize e.g. every sentence from every html page where these keywords appear?

Regards
JuWi

Answers

  • fischerfischer Member Posts: 439 Maven
    Hi,

    If you are asking for highlighting terms in a rendered HTML view: No.

    But you can, e.g.
    - use an eample filter to filter for texts containing a specified token,
    - use a scatter plot and put one token on one axis, and, e.g., the label on the other one,
    - and more.

    Cheers,
    Simon

  • JuWiJuWi Member Posts: 7 Contributor II
    Thank you for your answer. I will try these things.

    At the moment I am trying find out how I count words in a text. By that i mean how many times each word appears in the text.

    For Example: I used the TextInput to read several .pdf files. I added the StringTokenizer and gotone row for each word that appears in alle .pdf files.
    Now it would be interesting to know how many times each word appears in the files. Unfortenetly i couldn't find out (also with the help of the forum) how to interpret the collumns: Statistics and Range.

    Hope you can help me.

    Regards
    JuWi
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    these statistics give you information about one attribute. It reflects the mean and variance of numerical attributes and mode and counts of nominal attributes. Hence the meta data view doesn't give you any information about one text at all, just about the average occurrences of the words. To see the count of a word in one single text, please switch to the data view. Each text is there represented by a row, while the words (as attributes) are shown in a column. If you have selected occurrences as vector creation method in the TextInput, they will be shown there per text.

    Greetings,
      Sebastian
  • JuWiJuWi Member Posts: 7 Contributor II
    Hi,

    thanks for your answer. I tried what you told me and it worked.
    Now, my Task is: I have several documents which i want to analyse.

    By analyse i mean, to find out the occurences or importance of every word that might be intresting.

    So far I used the TextInput Operator and the StringTokenizer to read the files with the TFIDF or TermOccurence Algorithm.

    Now I get, as you told me, one row for each document, containing all the words that occur or were rated by the TFIDF algorithm in each text.

    I also used a StopwordFilter to only get Words that are Important.

    But, so far i couldn't find a way to sort the words by their occurence in all thhe text, which is my main goal.

    And another question would be, when I've sorted all my words by their occurence, how can i make something like a diagramm to show the e.g. the ten most occuring words in the text.

    Thx for your help.

    Regards
    JuWi

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    if you really want only to count words of one text, you could use Process Document and then take a look at the resulting word list. There are all word count's listed. Might be, they are plottable too, but Im not sure about this.

    Greetings,
      Sebastian
  • JuWiJuWi Member Posts: 7 Contributor II

    Hey,

    Is that Process Document may be in RM 5? Because i can't find it in RM 4.6.

    Well, actually I am trying to find out the features that RapidMiner can offer me for my tasks.

    That would be analysing documents by their content and crawling Pages from the Internet.

    Both more or less works, but I think i only understood a tiny part of what I can do with RapidMiner.

    Analysing Text is working so far as I get Information out of the Data View. My Goal is to easily find out what the text is mainly about by the occurance of the words.

    Crawling the Web is working so far as I can crawl pages by following the links and choosing by the rules if i want to store the pages. Now I am wondering if it is possible to get Information out sof social Media lika Facebook or twitter. The main problem here that I face ist that I can't, so to speak "enter" the community because the Crawler has no access into it.

    An example would be out of the RapidMiner Textplugin examples, where the api of maps.google.de is craweld in order to make geographical Text-Mining.

    Do you know any ways how I can fulfill my tasks?

    So far my tactics are learning by doing..not very successful..

    Regards
    JuWI
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    there are several ways:
    First of all I would switch to RapidMiner 5 for analysing the text. Not only lowers RapidMiner 5 the learning curve a lot, but it offers the new Text Processing Extension, which is much easier to use than the old Text Plugin. Unfortunately the Web Crawler isn't finished yet, so you can't crawl yet in RapidMiner 5.

    For learning how to achieve your tasks, you might visit one of our courses or webinars. The RapidMiner book still remains unfinished, yet and so I can't offer any official guide. If you want a more focused discussion and help regarding your special problem, you could book some telephone support as well. We have several customers booking telephone support per hours and calling us when they are confronted with problems they can't solve. This might be a way, too.

    Greetings,
      Sebastian
  • JuWiJuWi Member Posts: 7 Contributor II
    Hi,

    I switched to RM 5 , and tried to install the Text Processing Extension. First i tried with the Update as it is said on rapid-I.com.

    It didn't work, the error was :" can't find .lib\plugins\maneged\rmx_text-5.0.0.jar" . So i downloaded the extension from sourceforge, put .jar file into the plugins in a new file that i named "managed". Then tried the Updater again, and it seemed to work.
    Now i can't find any Text Processing Operators, can tell where it is suppoesed to be?

    Regards JuWi

    Well just found out, althoug the updater was downloadig a file that was 3.7MB  big, it didnt work. The Updater still says it's not installed.

    So I have no idea why the updater isn't working..
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    there might be a problem if you are not working without an administrator account. For solving this issue, you could check the property rapidminer.update.to home in the preferences dialog. The extensions will then be installed into your user directory.
    Before you do this, please erase the file again, you copied into your RapidMiner directory. Managed extensions are only used, if they have been correctly installed.

    If this does not work, you might download the plugin from sourceforge and copy it into the lib\plugins directory.

    Greetings,
      Sebastian
  • JuWiJuWi Member Posts: 7 Contributor II
    Thanks, that was the problem.

    Now I have another problem. I used the operator  "Prcoess Documents from File"  to read some PDF files which I got from someone else.  So far there was no problamn in processing PDF files. But with those PDF files I got the error: " The Setup does not seem to have any obvius errors. But you should check the log messages..."

    I downloaded the exact same files on my own, and the error didn't occure.

    Do you have idea what is wrong?

    Regards
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    as far as I remember, you are still using RapidMiner 4.6? I would be really interested in getting to know, if this error still occurs in RapidMiner 5.0.

    Greetings,
      Sebastian
  • JuWiJuWi Member Posts: 7 Contributor II
    I switched to RM5. So that error occured with RM5.

    If you need any more Information, just tell me.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    please send me your process. I will take a look at it, if it's a design problem or a bug :)

    Greetings,
    Sebastian
Sign In or Register to comment.