RAPIDMINER DATA SCIENCE COMPETITION: FARMING ON "MARS" – SEPTEMBER 12 TO OCTOBER 13, 2017
Hello all community members -
Welcome to the 2nd RapidMiner Data Science Competition: Farming on "Mars"!
Our sponsor and we are super excited to bring this open competition to our 270,000+ users and we hope that you have a great time exploring this unique use case. Below is a brief summary and rules of the competition; complete documentation can be found in the attachments below. PLEASE READ all the attached documentation before beginning the competition and let the best model win!
Summary
One of the major challenges of the human colonization of “Mars” is the introduction of Earth-independent food production facilities, i.e. farming. A key element to farming on “Mars” will be the fertilization of available soil, which in its current state is not farmable due to a lack of nutrients. In order to address this, an experimental setup has been created under “Martian” environmental conditions to produce bio-fertilizer made from algae and measure the usable yield after each production run. This yield varies based on the exact quantities of certain base nutrients and the optional addition of one of two possible additional nutrients, α or β, inserted into the bio-fertilizer at some time t during the production run. The research facility has already done 1653 production runs, each one lasting 36 hours with 41 sensors recording data every hour, and recorded the potential yield of each one. These are your data to work with during this challenge.
Challenge
The goal of the challenge is to build a model that will classify which additional nutrient, α or β, and at what time t, will be most likely to boost yield during a production run. The metric to be optimized is the cumulative score value of the same 178 production runs in the test set; the baseline example above has a cumulative score value of 1000.
Submission and Evaluation
All submissions in this competition need to be posted in this thread with the entire XML of the process and the score. This includes the finished models, as well as the entire training process and all pre-processing steps. The deadline for submissions is October 13, 2017 at 23:59:59 UTC.
RapidMiner Server Instance
In order to increase the efficiency of model training and to demonstrate RapidMiner’s powerful parallel processing capabilities with its new SaaS on Amazon AWS EC2 , RapidMiner has agreed to provide a free Server EC2 instance for all participants for the duration of this competition. This server instance can be used by any participant free of charge, as often as desired, for the duration of the competition as long as all use is restricted to this competition only. Participants wishing to use this server must @sgenzer a private message to register and obtain the relevant connection details. The instance URL is https://competitions.rapidminer.com and will be online only for the duration of the competition.
Winner and Prizes
The winner of the competition will be selected based on the highest aggregate score value of the 178 testing production runs ≥ 1000, after applying the test dataset to the submitted models. All submissions will be validated by RapidMiner and the competition’s sponsor within 72 hours after their submission. The winners of this RapidMiner Data Science Challenge will be announced by October 17, 2017 in the competition’s thread.
RapidMiner and the competition sponsor will award the following prices to the winners:
1st place: US$1000
2nd place: US$250
3rd place: US$100
PLUS all participants who submit a valid entry in the thread prior to the deadline will be eligible to win one or more amazing RapidMiner “swag” items. Supplies are limited and will be awarded on a first come-first served basis.
Restrictions
All participants of the RapidMiner Data Science Competitions must be registered users in good standing of the RapidMiner User Community and age 18 or older at the time of entry. Employees, directors, consultants, and any other persons affiliated with RapidMiner, Inc. are not eligible to participate in this competition.
Good luck everyone and reply to this thread with questions and your models!
Scott
Links: Training Data Set
Answers
Hi Scott,
Looks like an interesting problem :smileyhappy: It appears that 'run 1341' in the test dataset may be corrupted.
Cheers
Dan
Hi Dan -
Hmm. I just downloaded the zip from the link above and I see no problems the files.
Download again?
https://rapidminer-my.sharepoint.com/personal/sgenzer_rapidminer_com/_layouts/15/guestaccess.aspx?docid=1c7686d0d5c0241e9b293c07bb98beeec&authkey=AfDPdBh_3zuwnerIo59cyA8
Scott
I guess it does have run 1341 but it also has a corrupted fragment at the bottom of the list. I'll just delete it.
D.
Hi,
I would like to clarify a few points on the explanation given.
"
These are the production yield increases for the production run at each hour of production. For this example, all yield increases for nutrient A (column AS) will be scored as invalid (-100) because it was shown later that nutrient B was needed (see cell C10). For column AT, the score is determined by which hour nutrient B was inserted: if nutrient B was inserted at t=0, score = 62.5 If nutrient B was inserted at t=5 hours, score = 59.5. If nutrient A was inserted at t = 24 hours, score = 54.3"
1. If nutrient B was inserted at t=5 hours, score = 59.5. It should be 59.9.
2. If nutrient A was inserted at t = 24 hours, score = 54.3. This statement is true only when the Label is equal to "A".
Pls clarify. Thank you.
hello @16B543J - thanks for your questions. I am assuming you are referring to the annotated training set 1? Here are my answers.
These are the production yield increases for the production run at each hour of production. For this example, all yield increases for nutrient A (column AS) will be scored as invalid (-100) because it was shown later that nutrient B was needed (see cell C10).
1. Yes that is correct.
For column AT, the score is determined by which hour nutrient B was inserted: if nutrient B was inserted at t=0, score = 62.5 If nutrient B was inserted at t=5 hours, score = 59.5. If nutrient A was inserted at t = 24 hours, score = 54.3".
1. If nutrient B was inserted at t=5 hours, score = 59.5. It should be 59.9.
2. If nutrient A was inserted at t = 24 hours, score = 54.3. This statement is true only when the Label is equal to "A".
2. I'm not really sure what your question is. For the annotated training set 1, if nutrient B was inserted at t=5, the score would be 59.9. And if nutrient B was inserted at t=24 hrs, the score would be 54.3. If nutrient A is inserted at any time, score = -100.
Thanks and good luck!
Scott
Thanks Scott for the clarification.
Hello Scott
I noticed there are around 7% of the rows contain missing values for the attributes sensor41, yieldIncreaseA and yieldIncreaseB. For example trainingset 1001 shows this. Is this intentional?
Andrew
Hello Scott
Could you change the annotation in cell AS:5 in the worked example to match your reply to avoid confusing later readers.
regards
Andrew
Hello @Andrew - thank you for the feedback. I finally got the aha moment about what @16B543J was referring to yesterday, i.e. the text explanation in the pink boxes. I think I have looked at that so many times that I glanced over it completely. My apologies. I will update the file in a few minutes.
As for your question about missing values, yes, there are many. These are actually real data from our sponsor and hence there all sorts of wonky things in it.
Scott
Hi Scott, I am joining this discussion a bit late...
I need some clarification on how the data was collected. According to the spec, in any run nutrient A or B can be added to the bio-fertiliser once at some time t. What is not clear: Is nutrient added before or after the reading of the sensors at time t? For example, at time t=0 was nutrient added before the very first reading of the sensors or after the first reading? It is crucial as it seems in a number of cases t=0 was the best option to add the nutrient, however, it would not make any sense to do so without taking the very first reading and it seems only sensor 41 was kind enough to give any data at that time.
Thanks a lot -- Jacob
hello @jacobcybulski - thanks for your question. Here's the answer I have received from the sponsor (who created the data set):
"The answer to Jacob's question would be that the nutrient is added right after the reading of all sensors is available for the specific point in time. The situation at t = 0 is a bit special and he makes a fair point. I have personally completely disregarded the option of making predictions at t=0 in my models, as there is only one sensor that provides data at this point in time. However, this does not mean that making a prediction at t = 0 is entirely implausable."
Scott
Thanks a lot Scott -- Jacob
Hi Scott,
At a data row where the label is B, and the value for "yieldIncreaseA" is "19900". Can I assume the value is "-100" and ignore the "19900".value?
Thanks
Hello @16B543J - that is correct. If the label is B and nutrient A is added at any point in the production run, the score is -100 irrespective of what is in column "yieldIncreaseA".
Scott
Good morning competitors,
Just wanted to remind everyone that there are less than 2 weeks left for this competition. In this vein, I would like to share again how submissions must be made in order to be valid:
As usual, please do not hesitate to ask questions as they arise on this thread. Good luck to everyone!
Scott
To check my understanding, I've implemented a process that uses the label of the test data as the correct class and I've assumed that t=0 is the time when the sample is introduced. I then calculate a score based on the sum of the t=0 valuea of yieldIncreaseA or yieldIncreaseB to get a result of 14354.5. Obviously, I'm cheating but my questions are
Andrew
Hello @Andrew and all -
Yes a sample scoring process would be useful. Here is one that can be used if you like. Note this this "model" does nothing but always pick nutrient A at hour 13 - not a good idea.
Scott
I've managed to score 1134 but I'll post at the last possible moment
Hello all
I've managed to score 1134 but I'll post my process at the last possible moment.
regards
Andrew
Boom! Well done, @Andrew! Anyone else coming in? There are prizes for 2nd and 3rd prizes.
Scott
Hi Scott,
Just a question on the submission method. I am sure it has been addressed in the original competition document and your post above, I just want to ensure I am following your instructions to the letter. Here are some of my assumptions:
Jacob
P.S. Lots of questions and I am yet to get some good results to submit
Hello @jacobcybulski - all good questions. Let me answer below.
Hello all competitors - FYI the Competition Server is currently locked up so any jobs sent to the server will not be queued. I will ask my colleagues to do a hard reboot tomorrow morning first thing.
UPDATED - COMPETITION SERVER IS BACK UP AND RUNNING (9:30AM EST).
Thanks for your understanding. Lots of lessons learned here for me too. Three more days to go!
Scott
Scott