Classification and clustering of clients of a bank
Greetings to all members!
I have never used Rapidminer, I do not know IT and I really need your help. I have a database of about 300 clients of a bank. The database has: name, county, age, civil status, children, active loans, home, higher education, income, company where they work. I have to categorize these clients in 4 categories: A, B, C, D. Category A are customers who have high salary and do not pose a risk. Category B, are clients who receive the credit, and have active credit and who does not represent a risk of default. Category C, are clients that find it harder to borrow, need to have a co-payer or derogation from the bank. Category D, customers who are unlikely to be given credit.
What features of the application should I use to be able to accomplish this project? The application should be like a credit scoring, classify these customers, and divide them into the four categories to show what type of customer gets credit and who does not.
I would like to receive an answer from you, it would help me a lot to know how to start.
Thanks!
Answers
Hi @catsta,
Deciding whether a customer should or should not receive a credit might fall into a classification or clustering problem. Let's say you decide it is a clustering problem because you want your algorithm to go unsupervised (supervised algorithms work when a subset of your data contains the truth).
First, examine if your data is consistent in terms of distribution. Next, consider what are the important columns to make such a decision (e.g.: age, savings, months of service, amount to be loaned, if the customer is self-employee or not, etc.) and [Select] the attributes that are more significant to you. Clean up your data, ensure you can [Set Role] to each example as an ID so you can get back to the example once you scored it. Pick a segmentation algorithm ([k-means], [x-means], or others), and begin experimenting with the amount of clusters you need, adjusting your parameters.
One thing to consider is that since you have age (e.g. from 18 to 99), salary (e.g. from 0 to 100000), etc, you might want to [Normalize] your data, or better, [Discretize] it (by user specification), so you can have control on how the discretization is being made.
Perhaps you might want to understand what clusters represent what kinds of data in your model by using a [Decision Tree]. More often than not, I take the output from clustering algorithms and create a [Decision Tree] that has N+1 levels with N being the amount of labels I have to understand it.
If, on the other hand, you want to try this as a classification problem (which I wouldn't recommend given you only have 300 examples as an entry), you might want to perform the scoring yourself with a few examples (or even better, with an algorithm trained by yourself with the border conditions). Consider, however, that the Decision Tree is by nature an overfitting algorithm, and you cannot predict how certain data it didn't consider before will be labeled, as it will take the best matching pair to meet its criteria, so take an eye on confidence levels and readjust here and there.
Hope this helps.
By the way, I just remembered something that might be useful for you to begin understanding your problem. With RapidMiner Studio, when you are presented with the first screen, if you create a new process, you have predefined templates for certain common cases.
The case of "Credit Risk Modeling" (light blue) uses an algorithm named "Support Vector Machine" to help discovering your clusters of data. It is also available on the "Repository Tab > Samples > Templates > Credit Risk Modeling". You might want to begin building your solution based off this example.
Hi @catsta @rfuentealba
Just to add my 5 cents to the topic, as I have had experience with credit scoring.
Though there are numerous studies for using cluster analysis for credit scoring existing, I am rather sceptical that it is possible to get meaningful results in this area by using clustering algorithms, at least fast and easy. Credit scoring is a classification problem by its nature, so you'd need historic data on clients perfromance in order to build a classification model. Using clustering algorithms, you may get a good separation of different customers segments but it's pretty hard to make sure that those segments actually represent different levels of credit risk.
Vladimir
http://whatthefraud.wtf
Hi @catsta,
Before developing a RapidMiner process you should have some idea on how to solve the problem "on paper". There are different options, depending on which data is available. From what I read there are 2 options:
1. Train a decision model based on historic data, for example a database from clients that have defaulted and clients that haven't.
2. Apply decision rules coming from best practices / especialists.
The second option doesn't involve machine learning, so in that case RapidMiner wouldn't be necessary (but can be used for data manipulation if you want).
If you can tell us a little more, we can help you further.
Regards,
Sebastian