The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Text mining
hello . I am pursuing a master's degree in business information management at Mersin University. I will share my problem in detail on the link you sent I use rapid miner program in my thesis. but I encountered two problems. I have 4500 Turkish theses and 1500 articles. 150 pages each thesis. 150 * 4500. each article is 20 pages. 1500 * 20. I want to classify them with rapid miner. But since my thesis count is high, I cannot make this classification with the rapidminer and it constantly gives errors. How can I solve this problem. my pc i5 processor is 5 gb.
My second problem is that I want to use the most frequently used words in my thesis, but when I do stem (snowball) in Turkish, different words come out as well as the words are not reserved for their suffixes. so I can't use the stem and I get a lot of words with the same meaning. I cannot advance my thesis briefly. can you help me
1
Answers
One tip I can give you to share your configuration and process images try File--:Print/ExporImage then choose design and the export the image.
You can also use Loop Batches to reduce the amount of memory used while trying to process all your files.
I don't know enough about Turkish and stemming to have suggestions for the second issue, but perhaps google some questions related to stemming in Turkish and you may find some helpful resources.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
I am not very good at text mining with RM, feeling better in Python. Stem(snowball) works properly for English words actually. It may not be the best choice for Turkish. If your essays are Turkish, you may need to use Stem(Dictionary) which requires a document of patterns in Turkish. Normally there are good dictionaries for Turkish words that can be used in R/Python. You can search for one.
In your post on 25 March, you are showing the exa port result from your process document process. If you can connect the "wor" to res port you can see TF-IDF counts.
Also, this will give you an idea about your document so you can further transform your dataset.
I would recommend you Filter Tokens(by length) operator so you can cut many words at once after you examine the words table.
I just made an example set of 5 academic papers about Neural Network in Flood Forecasting.
Is it something that can help you? You can further filter and select data for modeling.
Second, in order to classify documents, I missed the point about how you are planning to do it. Are you using meta-data like information to classify them? Or just processing the document and feeding the k_NN model? If you can tell me more about I may try to help you.
bests,
Deniz
I have divided the individual articles and theses into three main themes according to their topics. I divided the main themes into sub-themes. I applied k-nn after that
Sure we can discuss in Turkish, I write you a message.
Bests