Please help with document classification (nearest neighbor)! Total rapidminer novice here!
Hi everybody,
I just found out about Rapidminer a few days ago. I work for a nonprofit and have been interested in using data science to help sort grant applications. I think it could really help. I tried following a few tutorials and guides and eventually I got here (the attached process). But when I run the process, I get an accuracy of zero. I think there's something wrong going on with the categories. But yeah I'm lost, because I don't really understand the application and have just been following instructions. But I'm hesitant to really get to know RapidMiner until I can see the results and know that investing time will get me where my organization needs to be.
I would really appreciate your help.
Thanks everyone
Answers
Can you provide your source Excel data file?
From inspecting the Read Excel operator, it looks like there is no metadata assigned. Did you go through the Import Wizard part of the operator? I would also select the attribute column that has the proposed text frield and not 'all.' Also, I would try a Naive Bayes first. K-nn with a k=1 is just asking for trouble.
Yes, what @M_Martin said, a sample data file would be helpful.
This is the Excel doc I've been using as my source data.
I don't know what an Import Wizard is and I'll try a Naive Bayes analysis, but I'm very not confident in my abilities at this moment.
Thanks for the responses!
What I want to understand is what is your ultimate goal with classifying this data?
I'm hoping to eventually be able to cluster / automatically classify documents so we can better identify patterns in grant applications. For example, if we find a large number of grant applications that mention "gardens", "mulch", "plants", etc. then we can identify which grantees are focused on urban farming and provide specialized assistance.
Without doing any classification work, I cleaned up the process a bit. The question remains, what are you trying to classify? year? organiziation?
Oh, I'm trying to classify the text. Text analysis is what I am attempting.
I get that, but what is the label? Meaning you process all the text into TFIDF vectors and then you want to use that information to later classify what? Organizations?
Your data set has three attribute columns: year, organization, and text. Do you want to learn a model from the text attribute to help you indentify from what organizaiton it's coming from? Or from what year?
Gotcha - I see what you're asking. I'm hoping to use the text to classify organizations.
Then you're going to have a problem. You have 60+ organizations and your data is to 'thin' to give you an accurate classificaiton. You don't have enough example rows to to train on. Hence the 0% accuracy. Try for a more reasonable amount of classes, somewhere between 2 and 5 if possible with more examples for each class.
Would it help if each row had more text in it, rather than increasing the number of rows?
I'm afraid that won't help at all. In nearly all cases you have one piece of text for each organization. Even when you process out the stop words and prune things, the model can't properly classify it at all. Just guessing here but you'll need probably 25 text entries per organization, which will increase your rows to over 1,700 examples. Even then you might get bad missclassificaiton because a lot of what each company writes appears to be similar to the other. The model gets confused.
Here's an idea. I'm guessing you're looking at which grants to go after for your organization. So look at your historical data and create two classes like "Go for this grant" or "skip this grant." You can then heavily weight the organizations you got the grants from an then feed it into a classifier. Hopefully then the model will be able to learn the patterns for a grant that you want to go after.
What do you mean when you say I have one piece of text for each organization? I'm really confused how adding more text to each organization wouldn't provide more data for the model to find patterns.
I'm not looking to categorize by 'Go for this grant' or 'skip this grant'. I'm trying to categorize the grants we've already approved into groups like: community gardens, neighborhood cleanup, youth education, etc. Is this just outside of Rapidminer's capability with only 86 instances?
You don't have enough training data to do this. 86 training rows for 60+ different classes (organizations) will not work regardless if you use RapidMiner or something else.
Why don't you add those categories like "community garden" "youth education", etc into the data set and use that as your label instead. If that's less than 5 or so categories, then you might be able get some better results. Still 86 training rows is pretty low.
Hi Nathan:
Thanks for making your source data available. I have a few suggestions that I hope will be helpful.
Given that the dataset is fairly small, and that there are only three data fields, I think you could get to where you want to be if you were to further classify and segment your data with additional relevant "data adjectives". Essentially, this will greatly help you classifiy the requests you receive moving forward from organizations you are hearing from for the first time. I think the classifications you make yourself will in the long run be more relevant than what RapidMiner (or any other Data Mining Tool) would provide.
Yes, this will be time consuming and sometimes frustrating, but the segmentation choices you make yourself (and in collaboration with your fellow stakeholders) will, in my opinion, be very valuable to your organization in the long run, and will be very informative to you in the shorter term about the organizations you come in contact with.
There are many BI and data visualization tools (like Tableau, for example) that could produce great reporting and vidually appealing analytical deliverables once you have a rich and descriptive data model in place.
As I said before, once you have defined a clear segmentation map of the requests you receive, and communicate this "segmentation map" to people you work with, it will be much easier to classify organizations you come in contact with in the future, and these future classifications will be in aligned with segmentation policies you have developed yourself and fellow stakeholders.
This is akin to how manufacturers and retailers segment their products into categories and sub-categories as part of analyzing sales of products. Data Mining / Data Science can be very helpful, but it is not a replacement for a rich and descriptive data model, and you have an opportunity to do your organization a real service by taking on the challenge of building a "segmentation map" for classifying the organizations you work with. Once the data is rich and descriptive, Data Mining / Data Science can often help take your understnading of your data to the next level.
I spent a few minutes in a very rough draft attempt to classify some of the organizations in your data, that I thiink you (and your fellow stakeholders) could greatly improve upon. I've attached my attempt to this post as a .csv file, as the Rpaid Miner Studio forum does not allow posts in Excel format. You should be able to open this file in Excel and then save it as an Excel file.
Good luck in your worthwhile work, and best wishes.
Michael Martin
Hi Nathan:
One last thought related to my note of yesterday: in addition to adding "Project Classification" fields to your data model, you could also add one or more "Keyword Search" columns to your data model. I have attached a .csv file with a "Keyword Search" column to your original data.
At the risk of being redundant, I believe that subject matter classification drives understanding and collaboration. Think of all the classification models related to biology, chemistry, psychology, and medicine - and they can be chnaged in light of new knowledge. A good classification model + good data science can sometimes greatly enhance the meaning and utility of the data
Best wishes,
Michael Martin
Another approach here would be to ignore the organization entirely and just do unsupervised clustering. This can definitely be done using only the examples that you have, and you can specify the number of clusters you want. However, you will have little to no control over what clusters get generated as a result and they may not group things in the ways that you would like to think about them.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts