The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Large data set with Time Series
Dear all -
I am kind of new to using RapidMiner.
So, I am working with a large data set with Time Series (from 2000 to the 2019 year).
There are about ~200.000 lines and 4 different attributes (variable, region, times series, and values).
The Decision Tree and Forecasting with Windowing are one of those that are on my radar.
Anyway, I am kind of lost here... what type of analysis I could do within this type of database?
Thanks in advance for your help!
Alexsandro Toaldo
0
Best Answers
-
Toaldo Member Posts: 3 Learner IHi Martin -
Thanks for your prompt response.
This is a great question, therefore I am not sure yet.
As a background, I am working with public information about our city (Sao Paulo) which contain about ~200.000 register within 4 different attributes. As this is a time-series dataset, I am not sure where I could start and what type of analysis I can do. The attached file is a sample of the dataset.
0 -
MartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data ScientistHi,first you likely want to Pivot this whole table to get something like:Date, Region, Value Of Taxa de Universalizacão, Value Of ... , Value of ...This is more the data set of interest.In German we got the saying: To saddle the horse from the wrong side. That's somewhat what you do here. Usually you have a problem and formulate a question to the data you want to answer. You are doing it more the other way around, which is tough.Besides forecasting a general thing to do with this data may be outlier detection. Are there values which are unexpected? And why? Maybe this helps.Cheers,Martin- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany5
Answers
Dortmund, Germany
The attached is a template containing public data from our country city.
Under the first column "district" there are approximately 2.243 registers.
Time Series contains data from 1996 to 2019 (~23 years)
Column C to Column WE (approximately 600 different attributes) contains several different information about the data from our city (indexes, GDP, number of males, females, etc and etc). These are very large of data and high quality information.
My intended research approach initially are the following:
2) Select independent variables (10 to 20) explaining the selected independent variable (companies);
3) Decision tree on 10 selected neighborhoods explaining increase on companies;
4) Cluster neighborhoods considering the potential to increase the number of companies.
So, I have a couple of questions:
1) there are many attributes with no values. As this is a large set of data, should I leave it open or change it by zero?
2) What type of operator/analysis should I start the analysis, always considering the "District" as label (every single possible answer should come from Ditrict and size type of organization (large, medium, small).
Thanks for your attention!
Best,
A.Toaldo