Confused how to approach my data, to start by Clustering? or Prediction directly? or a better idea?

Gonfiaf_Zuraik · October 2018

Dear all,

I am working with a dataset, that contains more than 8456rows, 26 columns. this data is about projects that are taken place in Europe, each row is a project.

these are the columns:

Office

Office Country

Competence

Executive competence

Classification

Enquiry date

Creation date

Confirmation date

Proposal Date

Final invoice sent date

Intermediary

Customer ID

Customer

Event

Group name

Reference code

Start date

End date

Project manager

Main contact

Via sales contact

Project location

Project country

Heard About Us

Source Market

Client Kind

Client Sector

Region

Market

Lead Sent to

Event Frequency

Pipeline Future Projects

Initial Pax

Estimated turnover

Estimated costs

Estimated profit %

Status

Pax

Net turnover

Net costs

Gross profit

Gross profit %

Net profit

Net profit %

Agency commissions

Supplier commissions

Cancellation/Rejection reason

Cancellation date

Remarks

Controlled

Financial Regime

Currency

Exchange Rate

Payment status %

Required(Net)

Required

Invoiced

To invoice

Receipt

To pay

Custom invoices

Balance carried forward

Comments to low margin

Debits

Assets

Balance

TO Inv.

TO Acc.

TO Total

Cost Eff.

Cost Man.

Cost Acc.

Cost Total

for privacy policy I cannot expose the data itself, so I created an imaginary data just for illustration:

Office	Office Country	Competence	Executive competence	Classification	Enquiry date	Creation date	Confirmation date	Proposal Date	Final invoice sent date	Intermediary	Customer ID	Customer	Event	Reference code	Start date	End date	Project manager	Project location	Project country	Heard About Us	Source Market	Client Kind	Client Sector	Region	Initial Pax	Estimated turnover	Estimated costs	Estimated profit %	Status	Pax	Net turnover	Net costs	Gross profit	Gross profit %	Net profit	Net profit %	Agency commissions	Supplier commissions	Cancellation/Rejection reason	Cancellation date	Remarks	Controlled	Financial Regime	Currency	Exchange Rate	Payment status %	Required(Net)	Required	Invoiced	To invoice	Receipt	To pay	Custom invoices	Balance carried forward	Debits	Assets	Balance	TO Inv.	TO Acc.	TO Total	Cost Eff.	Cost Man.	Cost Acc.	Cost Total
Saint Louis	Senegal	BL	Saint Louis	Unknown	22.02.2016	08.04.2016	08.04.2016	23.02.2016	08.04.2016		11896	Customer2	zina 2016	code e1 2	15.04.2016	16.04.2016	Maya	Saint Louis 1 hall	Senegal		BL	Agency	Other		35	0	0	0	Completed	35	1.950	1.486	463	24	122	6	0	0					Input/Output	EUR	1	100	1.950	2.321	2.321	0	2.321	0	0	0	0	0	0	1.950	0	1.950	0	0	1.487	1.487
Saint Louis	Senegal	BL	Saint Louis	Other	08.06.2016	08.07.2016	08.07.2016	14.06.2016	25.07.2016		43	Customer3		code e1 3	07.07.2016	07.07.2016	Maya	Saint Louis	Senegal		BL	Agency	Other		0	200	0	100	Completed	0	297	9	288	97	236	79	0	0					Input/Output	EUR	1	100	297	354	354	0	354	0	0	0	0	0	0	297	0	297	0	0	9	9
Saint Louis	Senegal	BL	Saint Louis	Embassy	19.05.2016	20.05.2016	04.08.2016	04.08.2016	04.08.2016		1978	Customer4	leab 2016	code e1 4	11.09.2016	16.09.2016	Laura	Saint Louis	Senegal		BL	Agency			32	12.000	0	100	Completed	32	9.614	7.416	2.197	23	515	5	0	0					Input/Output	EUR	1	100	9.614	11.441	11.441	0	11.441	0	0	0	0	0	0	9.614	0	9.614	0	0	7.417	7.417
Saint Louis	Senegal	BL	Saint Louis	Embassy	20.05.2016	21.05.2016	28.06.2016	28.06.2016	04.08.2016		1978	Customer5	leab 2016	code e1 5	12.09.2016	16.09.2016	Laura	Saint Louis	Senegal		BL	Agency			12	4.500	0	100	Completed	12	4.550	3.526	1.024	22	227	5	0	0					Input/Output	EUR	1	100	4.550	5.415	5.415	0	5.415	0	0	0	0	0	0	4.550	0	4.550	0	0	3.526	3.526
Saint Louis	Senegal	BL	Saint Louis	Unknown	21.03.2016	01.04.2016	15.06.2016	01.04.2016	28.11.2016		807	Customer6	festival 2016	code e1 6	23.09.2016	25.09.2016	Martin	Saint Louis	Senegal		BL	Agency			20	18.000	0	100	Completed	20	11.276	9.676	2.104	19	130	1	0	503					Input/Output	EUR	1	100	11.277	12.815	12.815	0	12.815	0	0	0	0	0	0	11.277	0	11.277	0	0	9.676	9.676
Saint Louis	Senegal	BL	Saint Louis	Unknown	28.06.2016	29.06.2016	10.08.2016	10.08.2016	14.09.2016		43	Customer7		code e1 7	04.10.2016	05.10.2016	Laura	Saint Louis	Senegal		BL	Agency	Other		30	6.000	0	100	Completed	30	4.789	3.778	1.011	21	173	4	0	0					Input/Output	EUR	1	100	4.790	5.700	5.700	0	5.700	0	0	0	0	0	0	4.790	0	4.790	0	0	3.779	3.779
Saint Louis	Senegal	BL	Saint Louis	Unknown	05.08.2016	06.08.2016	10.08.2016	10.08.2016	10.08.2016		2374	Customer8		code e1 8	04.10.2016	06.10.2016	Laura	Saint Louis	Senegal		BL	Agency	Other		2	1.500	0	100	Completed	2	2.007	1.753	254	13	-97	-5	0	0					Input/Output	EUR	1	100	2.008	2.228	2.228	0	2.228	0	0	0	0	0	0	2.008	0	2.008	0	0	1.753	1.753
Saint Louis	Senegal	BL	Saint Louis	Incentive	01.09.2016	02.09.2016	29.11.2016	06.09.2016	02.11.2016		535	Customer9		code e1 9	19.10.2016	20.10.2016	Larissa	Saint Louis	Senegal		BL	Agency	Other		15	2.700	0	100	Completed	15	2.240	1.736	503	22	111	5	0	0					Input/Output	EUR	1	100	2.240	2.666	2.666	0	2.666	0	0	0	0	0	0	2.240	0	2.240	0	0	1.737	1.737
Saint Louis	Senegal	BL	Saint Louis	Incentive	22.09.2016	12.10.2016	23.11.2016	14.10.2016	07.11.2016		43	Customer10		code e1 10	19.10.2016	20.10.2016	Maya	Saint Louis	Senegal		BL	Agency	Other		25	1.000	0	100	Completed	25	2.360	1.433	926	39	513	22	0	0					Input/Output	EUR	1	100	2.360	2.808	2.808	0	2.808	0	0	0	0	0	0	2.360	0	2.360	0	0	1.434	1.434
Saint Louis	Senegal	BL	Saint Louis	Incentive	05.07.2016	06.07.2016	11.01.2017	12.07.2016	04.11.2016		535	Customer11		code e1 11	21.10.2016	22.10.2016	Larissa	Saint Louis	Senegal		BL	Agency	Other		24	4.500	3.500	22	Completed	24	7.513	6.404	1.109	15	-206	-3	0	0					Input/Output	EUR	1	100	7.514	8.791	8.791	0	8.791	0	0	0	0	0	0	7.514	0	7.514	0	0	6.405	6.405

for these data, I want to make analysis and predictions/classifications to get new insight of the data and to contribute something. I am using this data from the company in order to help me write my master thesis upon.

I need to make a data mining process, predicting for example the Net turnover of next year, or to make cluster classification and to get new insights,

I am new somehow to this in rapidMiner and I am struggling in choosing my appropriate path for starting.

I thought about to generate two new columns at the beginning (inside the Turbo Preparation) one column called

"Year"=that takes the year of each project

and another column

"Poject's length"= that counts how many days each project lasts

i need to know please with these attributes that I have, can I reach to a satisfying result? do you have any ideas ? I am stucked in the middle with too much data and dilemmas inside my head which prevents me to concentrate and take the right approach

that's why I need some wet ideas, some motivations and recommendations please

I thought about Clustering, and getting insights from the clusters i'll get, and then upon it to continue with a decision tree model that predicts the next years net turnover for example, (it can be another idea rather than predicting the turnover if you have any, im open to everything)

I tried to make the auto model and to cluster, but actually im not getting any useful results. I guess there might be 2 reasons for this:

1. that I do not know how exactly to approach this procedure, and I am missing something.

or

2. the data that I have is not enough good for this type of approach

any help please guys ?

@sgenzer @jczogalla @David_A @mschmitz @stevefarr @Pavithra_Rao

Tons of Thanks and Gratitudes.

Kind regards,
Jana

Telcontar120 · November 2018

You could start with some simply exploratory data analysis to see the relationship between your attributes. How about some simple weighting by correlation or by information gain?
You could also use clustering to see what kind of patterns are in the data. You should also look for outliers.
Another option would be to reformulate your target label, sometimes predicting a continuous numerical (like net turnover) is more difficult. Could you redefine it into a classification problem, by setting a threshold level of net turnover and then assigning a class (either above that level or below it)?
Without seeing your actual data, it is almost impossible to say whether there is enough predictive power in your attributes to do a good job predicting your outcome. But these are a few other things you should try.

M_Martin · November 2018

Hi: In addition to the great advice from Telcontar120, perhaps it would also be a good idea to ask the people who gave you the data (if you haven't already) how they collected the data, the meanings of all of the data fields, and what they are hoping you might find and why, and how whatever you find out will actually be used. This might help you formulate and set goals as to what exactly you would like to learn or need to learn from exploring the data. If there's anyone you could talk to who has experience managing or has worked with people involved in some of the projects, this might give you some ideas.
If they just gave you the data and said "Find something interesting", you would certainly want to try and discover some interesting relationships between the various data fields which you could then talk about with the people who gave you the data, which might lead to you learning more about the meanings of all of the data fields or what your colleagues would like you to concentrate on.
You may also want to check for missing and NULL data values in the various data fields, and look for any inconsistencies in the data values in the various data fields because if the data is not entered in a consistent manner, this could make it more difficult for RapidMiner to find interesting relationships between the data fields. It's usually helpful to get a sense of minimum, average, median, and maximum values for the numeric data fields and how evenly (or unevenly evenly) the data for each data field is distributed.
Hope this helps, good luck, and best wishes, Michael Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Confused how to approach my data, to start by Clustering? or Prediction directly? or a better idea?

Answers