The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Relation between text and customer id
Hi,
I have a problem. I have one database table with the following columns: text, customer id and customer name.
Now I want to get a wordlist out of the text. So that I can see customer 1 has written the word "RapidMiner" five times and customer 2 has written "RapidMiner" four times and "Mining" three times.
Does anybody have an idea? Sry for my bad english :-[
Thank you very much!
I have a problem. I have one database table with the following columns: text, customer id and customer name.
Now I want to get a wordlist out of the text. So that I can see customer 1 has written the word "RapidMiner" five times and customer 2 has written "RapidMiner" four times and "Mining" three times.
Does anybody have an idea? Sry for my bad english :-[
Thank you very much!
0
Answers
I could read every single row and do a wordlist out of the Text coloumn of every row and write and append the wordlists of every row to the database and I could add the customer ID coloumn.
I have to convert the wordlist to data to write it in the database.
How can I read every single row? I dont understand the loops.
How can I add the customer ID coloumn?
Is that possible?
I need help !
It write not for every row a wordlist. It writes the whole wordlist for all rows many times. No one who could help me?
You can do it without using Loop Examples
Here's an example. regards
Andrew
Any idea?
If you have multiple rows per customer, you can merge them and then proceed. This is a slightly adapted version of the process awchisholm posted earlier: Regargs,
Marco
How is the best way to process this result to write it in a database?
Is it possible to get it in a database table like this:
word customer
bad customer 1
rapidminer customer 2
mining customer 3
great customer 2
mining customer 2
mining customer 2
Is that possible or does anybody have another idea?
Best regards
I have modifed the process a bit more so it only returns 3 columns (customer/word/count) which exactly represents what you wanted to achieve in your original post Regards,
Marco
That is exactly what I'm looking for.
What is the function of the attribute "^(?!created).*$" in the De-Pivot Operator?
Is it possible to kick the words with the count "0" out in the De-Pivot Operator, because I get an Memory Error.
I also tested the Stream Database Operator. Any Ideas?
Thank you so much !
The "^(?!created).*$" is a regular expression that selects any attribute not called "created".
You could try setting the 0 values to missing before the De-Pivot operator as in the following process. I don't have your data so I can't be sure it will help.
Andrew
It works, but I think it does not help me, because the "Declare missing value" operator runs for about 75 mins and the test database table only has about 1.000 rows. Other ones has about one million rows.
Is it possible to change the regular expression in the "De-Pivot" Operator to take the words > 0 only? Do you have any other ideas ?
Thank you very much for your help!
Best regards
There are ways out of this but they are beginning to get advanced. The general approach I would take is to split the data into batches. One very simple way is to use Loop Examples in the attached (which has three options - the original, and 2 alternatives for Loop Examples - comment out the ones you don't want). One of the two options doesn't rely on missing values so the use of Loop Examples might help with the original memory problem.
I don't have your data so I have no idea whether it will help and you might have duplicate customers that will need aggregating - I will leave that as an exercise.,,
Andrew
The both loops takes much more time, than the original way. The original one takes about 45 secounds and the loops take about 10 mins.
Is that normal?
Is there a way to make it faster or antoher idea to throw the words with count = 0 out in the original way before or in the De Pivot Operator?
Thank you so much for your help !
I would be very interested in an exact comparison between the 2 loops because one uses the Missing Data method and the other doesn't. The Missing Data method should be slower. Could you say how many examples and how many attributes there are in the input data and could you say whether this is a test set or the full data?
Generally looping is slower but the gain is that less memory may be used and so the whole process may (eventually) complete. The original problem was that you were running out of memory; do you also have a time constraint as well?
It may be that the way you are starting RapidMiner limits the available memory. If you search this forum you will find ways to check this. Of course, you could buy more memory
regards
Andrew
just a quick information: the trick via "Declare Missing Value" by awchisholm should actually be the most efficient way to do this as that operator should not be slow. All it does is iterate over all selected attributes and for each of them over all examples. I just did this for 100000 numeric attributes and 1000 examples (i.e. 100 million values) on my dev machine, took about 30 seconds.
If process execution becomes really slow usually the cause is that RM Studio does not have enough memory available (click "View" -> "System Monitor" in the top menu bar to check) and therefore Java desperately tries to free some memory. If you're using RM Studio 5, all you need to do is let Studio use more memory (or execute it on a machine with more memory).
Regards,
Marco
How can I let RM Studio use more memory?
I changed the rapidminer.bat in the scripts folder but nothing happened.
I only have 4 GB on the machine. Do you think it is enough?
The smallest table has 2.100 examples with about 17.000 attributes as input for the "Declare missing value" operator.
Best regards
Okay, I think I found it. But how much memory should RM use?
I gave RM 1.500 MB but it uses only a little bit of it during the "Declare missing value" operator. Is it possible that RM uses the whole memory so that the "Declare missing value" operator finished faster? It run for about 10 mins and is not finished :-/ .
I slowly despair of that fact.
Or is there a complete other solution to get the relation between the words and the customer ID?
My test on a Core i5 with RapidMiner Studio 6 and 100.000.000 values took 30 seconds and used just above 3GB of memory. It will use the available memory automatically if needed, for your small dataset is won't need much at all.
I'm sorry, but I'm afraid I have no idea why your Declare Missing Values operator is taking ages..
Regards,
Marco
I hope that Andrew can help and has another idea...
On an example set of 652 examples and 5068 attributes the time reduced from 520 seconds to 2 seconds.
On a larger example set of 65248 examples and 5081 attributes the time with Materialize Data was 170 seconds.
regards
Andrew