This assessed assignment must be submitted as a pdf file. The task is to apply machine learning to the so-called `primate-factors' dataset. The background is as follows.
Sometime in the early 1980s, an animal behaviour researcher studying the social behaviour of primates was killed in a car accident while returning from a field trip. In his belongings, police discovered several listings of data but the notebooks, assumed to contain details about the origins of the data, were all burned. The longest of the listings is reproduced at ex1-data.txt. This is in tabulated form and it is generally believed that each line records the values of several variables, followed by a value which is either 1 or 0. This is believed to be a classification value of some sort but the nature of the class has never been determined. The final 100 lines in the listing are missing this classification value.
The exercise involves (1) analysing the properties of the data, (2) implementing and then using one or more machine learning methods to derive predicted classification values for the final 100 cases and (3) submitting predictions online and (4) presenting the results of the experiment in a report.
Programs can be implemented in Java or any similar language, but not matlab.
Marks for the assessment will be awarded using the following scheme.
5% for the introduction to the topic of machine learning. (Approx half a page.)
5% for the introduction to the method(s) used in the experiment. (Approx half a page.)
10% for informative analysis of the training data. This can be based on either formal methods (such as analysis of correlation, density, variance etc) or informal methods (such as reasoned deduction about the probable nature of the researchers work). Ideally it should be based on both. The aim should be to show what assumptions can be made about the data, and explain how those assumptions might inform the choice or application of machine learning procedures. (Approx 2 pages.)
10% for presentation of results. This section should show what output your program produced and how it is to be interpreted. It may make use of tables and graphs to show how well the program performs. (Approx 3 pages.)
10% for level of functionality in the program.
10% for program design.
10% for concluding section and appendices. The conclusion should discuss any limitations of your approach. There should be a self-assessment (in terms of the specified credit factors) and a full bibliography. You must directly acknowledge any quoted material. The code and other relevant data (e.g., graphs) should be included as appendices (but javadoc listings are not required).
30% for the accuracy of the predicted classification values. Marks for this component will be normalised appropriately. The most accurate predictions in the group will score the full 30%. Other marks will be scaled accordingly.
10% for material that goes beyond the spec, e.g., evaluation of performance of multiple algorithms.
To recap, your report should have a general introduction, an introduction to the method used, analysis of the training data, a section on results plus supporting materials (e.g., the code). There should also be a section on limitations and a self-assessment (in terms of the given credit factors). You will get zero marks for any component missing from your report so it's a good idea to have one section of your report per credit factor. The page requirements are a guide only and are based on use of a normal sized print font (e.g., 11 point).
The full report should be submitted as a pdf document . As part of that submission, the predicted output values should be saved as a .txt file.