Hi
everybody and happy new year to all of you.
I am asking
for your help because I am really not getting anywhere with my project. So you
guys are my last hope.
I am a
newbie to data mining and I am doing my best to learn the basics so that I get
started with my project.
I guess my
problem, is the fact that I am new to data mining and I have no clue how to get
started.The approach I have been
following is trying to read books about data mining and machine learning so
that I can understand and compare all the "numerous" algorithms out
there and then try to find to the one(s) that would apply to my case.
The first
problem with this approach is that I did not find any good resources (either
the subjects are treated on the surface or they are overly made complex and
hard to follow through). The second problem is that it is incredibly time
consuming.So I am wondering if I should
continue in this path or if I should try to proceed differently.
I am sure
that a lot of you guys have been in the same position and some of you have
struggled with this problem just like me.
So, I am hopping that you guys would suggest a method that would help me
get started.
The project
I am working on is related to the field of agriculture and has as objective to
try to find the best values of all the parameters that affect the outcome (the
amount of meat produced) of an animal production (could be dairy, poultry,
porch, etc...)
So as I
said, the approach is to run one or more algorithms on historical data for a
certain type of production (poultry for example) and trying tofind what should be the best values for the
operating conditions that would maximize the growth of the animals (weight), while trying to minimize the
production costs. A few examples of the questions that this project is trying
to solve are as follows: when is the best time and how long should the barns be
light? When and how much food should we give the animals? What is the best
operating temperature set point? When and how much cooling/heating should be
done? , etc....
As you
noticed, all these questions are concerned with the optimization of the
operating conditions but most importantly, the reduction of operating costs.
Huge amounts (10's of Go) of historical data for these operating conditions are
to be used for this purpose.
you guys would kind enough to help me work my way trough this. I would
appreciate your help and advice and I thank in advance all of you who took the
time to read this lengthy post
Cheers.
This is kind of a complex problem to be answered simply in a forum. First of all, you won't be giving 10's of gigs of data to the algorithms - not that they can't take it, but rather because it's likely not a valid solution, or they don't need it.
What you need to do is to clarify the problem you are trying to solve - e.g. maximize animal weight - determine the variables you want to check - e.g. the operating conditions - and run models on those. You should be able to generate a representative sample of data that is small enough for the models to run in a reasonable time.
There are many approaches you could take - for instance you could run a clustering model against all variables and see if weight is evenly distributed or is discriminated among other variables - it would be good to limit the variables for such an approach. You can use trees, naive bayes, neural nets, and even association rules to see the patterns of how variables are related. Most importantly, you need to be specific about your problem and your inputs.
If it makes a difference, you may want to create new variables based on existing inputs. For example, maybe the difference between the daily high and low temps is more predictive than the actual temperatures themselves? There seem to be many hypotheses to test and it sounds like fun!
Good luck and feel free to ask any additional questions in this forum or the newsgroup. Tips and tricks and other articles can be found at www.sqlserverdatamining.com
No comments:
Post a Comment