Machine Learning for Large-Scale Data Analysis and Decision Making (MATH80629A): Fall 2021

This project will be worth 40% of your final grade. You must work in teams of two or three.

Grading Scheme

Project Report (30%)

Clarity/Relevance of problem statement and description of approach: 10%
Discussion of relationship to previous work and references: 4%
Design and execution of experiments: 10%
Figures/Tables/Writing: easily readable, properly labeled, informative: 5%
Indiviual report: 1%

Project Presentation (10%)

Clarity of presentation: 3%
Slide or Poster quality: 2%
Correctness: 2%
Answers to questions: 3%

Timeline

Team Registration, due: September 20, 2021. Fill this form.
Study plan, due: October 18, 2021 (by the end of the day EDT). Upload the PDF of the proposal to gradescope by the head of the team.
Project meeting, October 25, 2021
Project Presentation, due: December 5, 2021. Upload the PDF of your poster/slides to gradescope by the head of the team.
In-class Presentation, on December 6, 2021.
Final group report, due: December 20, 2021 (by the end of the day EDT). Upload the PDF of the final group report to gradescope by the head of the team. And Upload the PDF of the final group report to gradescope by the head of the team
Final individual report, due: December 20, 2021 (by the end of the day EDT). Upload the PDF of the final individual report to gradescope (per each team member).

Goals

The aim of this project is to allow you to learn about machine learning by trying to solve a task with it.

First, select a question that can be answered using machine learning. I expect that your question will be about a model/algorithm or about an application. Then design a study that will try to answer your question. Your study must have an element of novelty. For example the novelty could be an extension or a variation of an existing algorithm or results of an existing method on a new dataset.

Your study should involve reading and understanding some background material. Your study must involve running some experiments. You are free to use (or not) any of the tools or models we have seen in class.

Alternatively: You could decide to participate in this open challenge: ML Reproducibility Challenge 2020. Let me know as soon as possible if you are interested in this.

Study plan: (1 upload per team) Please submit a one-page summary of your proposed research question and study to Gradescope. I will meet with each group to discuss study plans during the lecture of October 25. I will send you a schedule the day before. We will probably only have about 15 minutes so please make sure that your study plan is clear and precise. You may also include questions that you would like us to discuss at the end of the document.

The group report: (1 upload per team) Your report must contain a description of the question you are trying to answer, a clear description of the model/algorithm you are studying, a survey of related work which proper references, an empirical section that reports your results, and a conclusion that summarizes your findings and (if pertinent) highlights possible future directions of investigation. Your report should be no longer than 10 pages in length (plus references) for pairs or 13 pages (plus references) for teams of three.

The individual report: (1 upload per student) You will also submit a brief individual report (at most one page), which will: (1) Describe the parts of the project you worked on (which machine learning methods you applied, which preprocessing steps you performed on the data, which parts of the term paper you wrote, who you worked with on what parts, etc.) and what parts of the project your teammates worked on. (2) What you learned from the project. The purpose of the individual report is to facilitate fair grading and to allow the instructor to understand well what you learned from the project.

Some advice(mostly taken from csc2515 at UofT):

Be selective! Don’t choose a project that has nothing to do with machine learning. Don’t investigate an algorithm that has a high chance of failing or being un-implementable. Don’t attack a problem that is irrelevant, ill-defined or unsolvable. Spend most of your time doing machine learning and not related things such as pre-processing your data.
Be honest! You are not being marked on how good the results are. It doesn’t matter if your method is worse than the ones you compare to provided you implemented it properly. What matters is that you try something sensible and clearly describe the problem, your method, what you did, and what the results were. Be modest! Don’t pick a project that is way too hard. Usually, if you select the simplest thing you can think of to try, and do it carefully, it will take much longer than you think.
Be careful! Don’t do foolish things like test on your training data, set parameters by cheating, compare unfairly against other methods, include plots with unlabeled axes, use undefined symbols in equations, etc. Do sensible cross-checks like running your algorithms several times, leaving out small parts of your data, adding a few noisy points, etc. to make sure everything still works reasonably well. Make lots of pictures along the way.
Learn! The point of the project is to give you a chance to “test drive” the process of doing machine learning. Consider this an opportunity to learn how to write code to run large experiments, make nice figures, layout readable equations, describe your work concisely to a smart but uninitiated reader, etc.
Have fun! If you pick something you think is cool, that will make getting it to work less painful and writing up your results less boring.

Intresting dataset To find interesting datasets for your project, you can check: