Open an existing project

Decrypter · July 2021

Hello,

I am of course a newbie, and I am trying to open an existing project which is the following:
the project has a lot of resources, and I am unable to open them (you can see that in the picture)
The second problem is I can't open the .md files which contain the data for the project.

And for the file with the extension .properties i don't know how and where to use them,

for example the clustering. properties contains the following scrip

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE properties SYSTEM "">

<properties>

<comment>Properties of repository entry Clustering</comment>

<entry key="owner">zhaohengrui</entry>

</properties>

I search a lot for a similar thing, but I didn't find, Apologies if this is has been asked before and I didn't see it.

Thank you

Image: https://us.v-cdn.net/6030995/uploads/editor/da/8w21tx5lzfel.jpg

Image: https://us.v-cdn.net/6030995/uploads/editor/oy/v874dlujvikc.jpg

MarcoBarradas · July 2021

@Decrypter it seems to me that you need to create a repository out of that folder.

Follow the images steps.

Image: https://us.v-cdn.net/6030995/uploads/editor/u5/zyddah7xhxvb.jpg

Image: https://us.v-cdn.net/6030995/uploads/editor/58/j1yhfhmvjjb6.jpg

Image: https://us.v-cdn.net/6030995/uploads/editor/ge/ijhg6ktp9je6.jpg

Image: https://us.v-cdn.net/6030995/uploads/editor/kn/2g8c3dv970g5.jpg

Image: https://us.v-cdn.net/6030995/uploads/editor/ym/mziaouofflyb.jpg

Give it a name and Click on the folder icon and search for your file folder.

That would load all your file structure into RM and you can work with it.

MarcoBarradas · July 2021

@Decrypter

It seems that the stored objects on the folder are broken.

The good news is that you can rebuild the data sets with the excel files that are provided on the DataSet Folder.

You just need to use a read excel operator and a nominal to text to execute everything else.

I'm uploading an example of what you will need to do. But everything is described on the Report of Automated Job.pdf file

MarcoBarradas · July 2021

Hi @Decrypter

The issue you are having id due to the number of attributes the process documents operator is throwing (more than 1.5k) that will take a lot o memory and time to create the clusters.

You need to do 2 things to fix that issue:

On the pdf the mention that they applied a filter dictionary with a list of words provided. You can find that filter words on the JobStopwrods.txt
I suggest you prune the output of the process documents an use the prune by ranking method.

There are a couple of steps that the mention on their document that are not done on their process so you'll need to fix a couple of things.

You'll have a better understanding on what is happening if you take the Text Mining course on our academy. Its free!!!
https://academy.rapidminer.com/learn/course/text-and-web-mining-with-rapidminer/text-and-web-mining/lets-get-started

Attached you'll find the second version of the process to get you started.

Have a great weekend.

MarcoBarradas · July 2021

Hi @Decrypter,

Please check the changes I did on the process I shared before.

The error you are getting is related to the type of columns you are outputting on the Process Documents. On my process I read the excel file and then I apply a Nominal to Text operator before I use the Process Documents Operator.

That one tells RapidMiner that the two columns should be treated as text and that will remove the error you are seeing.

For the second comment on your post related to the fine tunning (changing cluster_1 and the other to other text) you'll need to use a Map Operator. In that one you can provide a list of word that will replace the values of the clustering output to whatever text you like to use.

If you have doubts on how any operator works go to the help provided on each operator. You can even see some examples if you go to the lower part of the help text.

MarcoBarradas · July 2021

Hi @Decrypter
The files process under the folder 1 Clustering are not consecutive steps.
They are multiple analysis they did to the same data set Unlabelled Job Posting Dataset you can create the same DataSet (DS) by running a process with these two operators I used on the process I shared + a store operator

Image: https://us.v-cdn.net/6030995/uploads/editor/nk/03lq0um5dsbl.jpg

And point that store object to the Folder 1 Clustering by doing that you'll be able to run the other process without any error.
Please check the process I share before for other adjustments you'll need to do before you run the Process Documents from Data operator. If you don´t adjust them the process may take all your memory.

You are getting closer.

MarcoBarradas · July 2021

Hi @Decrypter

The files you downloaded do not seem to be final versions.

For the first error the issue is with the connection of the output of the Clustering Operator they need to be connected in another way.

Check on the Help for that operator.

For the second error you'll need to have a label (column you want to predict) again the process that is show in here is wrong.
It needs to work with the output of the Clustering Process.
/1 Clustering/Labelled Job Posting Dataset (K-Means) in the pdf they mention they are going to create a model to predict the type of job offer. That would be the label.

On the text the mention you need to convert each cluster to a word.
For that you can add a MAP operator with the list of word and the word it needs to replace.

I would stop my help at this point since with these examples you have enough answers to adapt all the other process that you'll open throughout the folders.

I strongly recommend going to https://academy.rapidminer.com/
for more in depth videos on how to achieve the multiple tasks your project needs.

Your process should look like the image below

Image: https://us.v-cdn.net/6030995/uploads/editor/q0/y4jvidneqt6c.png

Enjoy the weekend.

MarcoBarradas · July 2021

@Decrypter
You need to store the wordlist output from the Clustering process with a Store Operator as a DS.
Then you'll need to use that DS and connect it to the Process Document wor input port (It will tell the operator which columns you want to keep) Remember that the Data you use to score (ResumeData) should have the same # of attributes with the same names and types for any model that you would like to score.
You will also need to set the label attribute as a Label with the Set Role operator.
Check if all the process are pointing to the folder in which the data is stored on your computer. The process I have shared should help you understand the changes you need to change on the process that are stored in your Rapid Minner Repository.

kayman · July 2021

You can ignore the properties, these are recreated when loading the project. The md files should be opened from within rm studio (load data operator).

Decrypter · July 2021

Thank you for your answer, where Can I find load data operator?

Decrypter · July 2021

Thank @MarcoBarradas I already tried to create it, without any success, the problem I think in reading md file.
Here again the project which I want to run:
https://github.com/superhen/Automated-Job-Resume-Matching-Solution

Decrypter · July 2021

@MarcoBarradas
Thank you for the help, I just tried this new file, but I got this error.
I have 32Gb memory, and I increase the maximum memory in rapid miner studio to 999999999.
But I get always this error!

Image: https://us.v-cdn.net/6030995/uploads/editor/33/wbrqfq97gn9p.png

Decrypter · July 2021

MarcoBarradas for your help and for the course, I will take for sure.

For the process it generates a file called "Labelled Job Posting Dataset (K-Means)", and when I used in the "1.1.1 Project_Clustering_K-Means_Performance" I get this error again:

Image: https://us.v-cdn.net/6030995/uploads/editor/6z/f84wmepgd6bk.jpg

In the document I don't know how they did such classification in page 8, figure 10 and 11, they talked about "looking into

the top15 most frequent words of each cluster" and "apply fine tuning on them, transform cluster number (i.e.

‘Cluster_3’) to nominal name (i.e. ‘developer’) and conduct a modified labelled job posting dataset"

I think that's the problem, because the generated file using your process, it's very different from figure 10 and 11 in the document.

Just bear with me please, Maybe this can help someone in the future too.

Decrypter · July 2021

Hey @MarcoBarradas

Thank you for your explantation.
Indeed I used the last process which you shared with me (Clustering_V2.rmp), if you simulate with it you find a generated file named "Labelled Job Posting Dataset (K-Means)", till now there is no problem.

But in the next steps I need to work with this generated file to continue my project. If you take the process "1.1.1 Project_Clustering_K-Means_Performance.rmp" and try to simulated using the generated file, you will get the error which i show you in the pervious comment.

Decrypter · July 2021

Hey @MarcoBarradas

Indeed I did the same thing as you mentioned, but I get that error.
Attached the process which i modified according to your solution.

And I did the same thing for 2 Classification processes, and getting same error.

Image: https://us.v-cdn.net/6030995/uploads/editor/3d/6i1el5uwylke.jpg

Decrypter · July 2021

@MarcoBarradas and here the process where I applied your idea, but I got again this error

Image: https://us.v-cdn.net/6030995/uploads/editor/25/myevsoh5kjel.jpg

Decrypter · July 2021

Dear @MarcoBarradas
Thank you very much, you already help me a lot, I really appreciate that!!

I just have 2 things I hope please if you can help me with.

1- The first, in your file Clustering_V3, I get this error:

Image: https://us.v-cdn.net/6030995/uploads/editor/xm/xf1cz7bhx5hd.jpg

2- And the second one is in the 2.1.2 ResumeDataSet_Processing

Image: https://us.v-cdn.net/6030995/uploads/editor/dc/5lzn3q4i3gwk.jpg

I am sorry for asking a lot, just bear with me, it's my final graduation project.

Thank a lot

Decrypter · July 2021

Thank you @MarcoBarradas

I just tried doing that now, but without any luck:

Image: https://us.v-cdn.net/6030995/uploads/editor/1n/czra1anrl7h1.jpg

I generated A new ResumeData using the clustering (k-mean) and I upload this new ResumeData in 2.1.2 ResumeDataSet_Processing.
And I got again this new not match.

This is the final step, to match the ResumeDataSet, please I will really appreciate you final try!!

Decrypter · July 2021

Problem solved using @MarcoBarradas solution.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Open an existing project

Best Answers

Answers