Donor Analysis
Hi!
I'm doing a donor (customer) analysis for my master-thesis and I hope you can help me, as I'm not very deep into RapidMiner. I have data from three different departments (dialogue marketing, campaign team, online marketing) of a NPO, as they don't have a central data warehouse yet. I already managed to match the three data sheets and did some data preparation.
My problem now is that I don't know which my final operator will be and therefore what my next steps are.
I have following data from the donors: donorID, e-mail, zipcode, gender (man/woman/family), creation date, product status (we differ 9 products, e.g. "godfather", "member", "protector"), origin (e.g. "internet", "mailing"), total dontation, number of donations and date of birth.
I want to find new insights in the data. There was never an analysis of the complete data. The three departments have different goals. The dialogue marketing team tries to get high amounts of donations. The campaign team wants a lot of signatures for petitions. The online marketing team wants the people to subscribe for the newsletter. I want to find the donors who donated the biggest amount of money. Maybe donors who are also subscribed to our newsletter donate more money, or maybe not. Maybe donors who are above 40, signed a petition and are from a specific region donate a lot of money.
Is it better to have different data sheets (e.g. matched donors from dialogue and online marketing team) or use only one big one (with columns: newsletter TRUE/FALSE, campaign TRUE/FALSE). Which operators should I use to analyse the data?
I also have some questions for data preparation. I want to transform the date of birth in age. Is there an operator who calculates the age, using the current date? Is there an operator I can use to generate age groups (e.g. 18-25, 26-35, 36-45, ...)?
The zipcodes consist of five numbers (Germany). To get a bigger region, I'd like to use only the first two numbers. Which operator can I use to cut the four last numbers?
Thanks in advance for your help!
Tim
Answers
Hi,
I simplify my question, as it is very long:
I have a list with following data from my customers/donors: ID, E-Mail, region (first two numbers of zipcode), gender (man/woman/family), product status (9 different products), origin (e.g. "internet", "mailing"), total donation, number of donations, age, newsletter subscriber ("TRUE"/"FALSE"), campaign subscriber ("TRUE"/"FALSE").
I want to find new target groups, e.g. it could be, that men with product 1 and age between 25 and 35 donate the highest amount of money. Is there an operator that checks all my attributes and finds new target groups for me? My biggest problem is, that most of the operators can't handle attributes with different data types (I have integer, real, binominal, polynominal).
I added my current process, which is only data preparation. I labeled the ID and label attribute and replaced missing values.
Best regards
Tim
So you could look at doing a cluster analysis and see some potential "target groups." You could use k-means or x-means to group your data into statistical 'blobs' if you will. One group might show that gender or age is a heavy factor, from there you can maybe subset the data and inspect what's going one.
Hi Tim,
i would first consider to turn this whole problem into a supervised learning problem. One might be: Predict how much a donor is willing to give.
This information can be used if you recruit a new donor. It might also be used to target "under-performing" donors.
Best,
Martin
Dortmund, Germany
I think that Generalized Linear Models could be a good fit for your problem, provided you have a threshhold or cathegories for the different amounts of investments (you may have to cathegorize the label attribute). You get interpretable coefficients out of it, which is a big plus.
I still have problems with my analysis.
1.) For clustering, I have to normalize my data. But how can I normalize text? I have different products, but saying product 1 is "0.1" and product 4 is "0.4" doesn't make sense. How do I handle with that? Is it correct normalization, when I set a "0" for "men" and a "1" for "women"? My region attribute consists of the first two numbers of the zip code (Germany). How can I normalize that?
2.) I want to find clusters with customers who spend the most of money. Do I have to label my "money spent" attribute? When I do that, I can't see, how much money these clusters spend in average. How can I see that?
3.) I'd like to see a scatter plot, to have a graphical overview of my clusters. In my process, I can't see that, but why?
I added my process xml. I only wanted to watch at the attributes "money spent", "region" (first two numbers of zip code), "products" and "gender".