Course Syllabus
Syllabus website - https://docs.google.com/document/d/1TwmRIpK_VE_YqtMXN2Q1g7qChdMqwCywsbgG2tYXFV4/edit
San José State University
Computer Engineering Department
CMPE 255, Data Mining, Section 47/49, Fall 2023
Course and Contact Information
Instructor: |
Vijay Eranti |
Office Location: |
TBA |
Telephone: |
(408) 394-1019 |
Email: |
|
Office Hours: |
(Virtual) Office Hours - By Appointment Always check with the CMPE website for the most up to date office hours, at https://cmpe.sjsu.edu/content/office-hours |
Student Assistant: |
TBA |
Class Days/Time: |
Mondays 6.00 pm to 8.45 pm |
Classroom: |
TBA |
Prerequisites: |
Basic understanding of Linear Algebra.
|
Course Format
This course requires the student to have a personal computer that is installed with a modern operating system. The lectures will be delivered in the classroom, however the students might be asked to use their laptops or smart devices during the class, or offline in order to participate in the class assignments.
Faculty Web Page and MYSJSU Messaging
Course materials such as syllabus, handouts, notes, assignment instructions, etc. can be found on Canvas at http://sjsu.instructure.com. You are responsible for regularly checking with the messaging system through MySJSU at http://my.sjsu.edu and Canvas to learn of any updates.
Course Description
Data mining process, problem definition, data representation and preprocessing, proximity, finding nearest neighbors, dimensionality reduction, exploratory analysis, association analysis and sequential patterns, supervised inference and prediction, classification, regression, model selection and evaluation, overfitting, clustering, auto data mining, advanced topics, recent trends and best practices.
Learning Outcomes
Upon successful completion of this course, students will:
- Be able to demonstrate an understanding of advanced knowledge of the practice of data mining in computer/software engineering, from vision to analysis, design, validation and deployment.
- Be able to tackle complex engineering problems and tasks, using contemporary engineering principles, methodologies and tools.
- Be able to demonstrate leadership and the ability to participate in teamwork in an environment with different disciplines of engineering, science and business.
- Be aware of ethical, economic and environmental implications of their work, as appropriate.
- Be able to advance successfully in the engineering profession, and sustain a process of life-long learning in engineering or other professional areas.
- Be able to communicate effectively, in both oral and written forms.
Course Learning Outcomes (CLO)
The main focus of this course is on data mining and its applications. More specifically, we will cover a broad range of data mining algorithms and techniques, focusing on how they are used in diverse applications such as forecasting, predicting future income, recommending product placements on shelves, finding genes with similar functions, etc. The lectures will revolve around the fundamental concepts of the areas covered as well as case studies, providing the students the opportunity to gain a deep understanding and be able to apply them in real-life scenarios. Multiple in-class and homework assignments required throughout the class will test the students’ ability to effectively harness the power of data mining in different scenarios. Although some data analytics and mining tools will be used for demonstration purposes, mastering specific tools is not the primary objective of the lectures. Instead, the students will be able to gain hands-on experience on such technologies via individual and group-based projects, where they will be expected to perform in-depth analysis using real-world data, and to enhance their professional engineering skills, including teamwork, technical leadership, and effective communication skills (both written and verbal).
Upon successful completion of this course, students will be able to:
- Understanding the project objectives and requirements from a domain perspective and then converting this knowledge into a data mining problem definition with a preliminary plan designed to achieve the objectives
- Identify the proper techniques and algorithms needed to prepare, preprocess, analyze and mine highly unstructured datasets, such as large text collections.
- Understand advantages and disadvantages of dimensionality reduction as a preprocessing tool for data analysis. Techniques of feature engineering and feature selection based on problem domain.
- Identify the correct class of methods that can be used to efficiently build a nearest neighbor graph, given properties of the input data.
- Discuss and apply fundamental data mining concepts and techniques, such as classification, clustering, or frequent pattern mining.
- Quickly get accustomed to any data mining/data analysis software application and be able to use it.
- Gain hands-on experience by performing an extensive analysis using data mining techniques, in individual and group projects.
- Effectively present and communicate the knowledge they have acquired in the course.
Required Texts/Readings
This class does not have a required textbook. Lecture slides and selected readings from online materials will cover the majority of the topics. A list of reference textbooks is also provided for those who would like to get some background knowledge or seek more details on the topics covered in class.
It is each student’s responsibility to consult with the updated syllabus on Canvas in order to identify which readings cover the concepts that are taught each week.
Online Materials
[0] Learning Data Mining In Python, Second Edition Robert Layton Packt Publisher
[1] Python Data Mining - Quick Start Guide - packt publisher Nathan
[2] Mastering Data Mining With Python - Megan Squire - Packt Publisher
[3] Machine Learning for Data Mining
[4] Mastering Social Media Mining with Python packt publisher
[5] Associations and Correlations Packt Publisher
[6] Basic and Advanced statistics and data mining packt publisher
[7] An Introduction to Data Mining And Predictive Analysis, Oreilly
[8] The Kaggle Book Packt pub
[A] Data Mining: The Textbook, by Charu C. Aggarwal, Springer, May 2015 (Springer: http://www.springer.com/us/book/9783319141411, Amazon: https://www.amazon.com/dp/3319141414/, Amazon other: https://www.amazon.com/gp/offer-listing/3319141414/condition=all)
[TSK] Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, Vipin Kumar, ISBN-13: 9780321335661 (download Ch. 4, 6, and 8 from http://www-users.cs.umn.edu/~kumar/dmbook/index.php)
[JWHT] An Introduction to Statistical Learning with Applications In R, by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Springer Texts in Statistics, 2013 (download from http://www-bcf.usc.edu/~gareth/ISL/)
Other Readings
Papers, tutorial slides, articles and other reading materials will be made available via Canvas.
Reference textbooks (not required)
[TSK] Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, Vipin Kumar, ISBN-13: 9780321335661 (remaining chapters)
[LRU] Mining of Massive Datasets, by Jure Leskovec, Anand Rajaraman and Jeffrey Ullman, 2nd edition, Cambridge University Press, December 2014 (download from http://infolab.stanford.edu/~ullman/mmds/book.pdf)
[ZM] Data Mining and Analysis: Fundamental Concepts and Algorithms, by Mohammed J. Zaki, Wagner Meira, Jr., Cambridge University Press, May 2014 (download from http://www.dataminingbook.info/pmwiki.php/Main/BookDownload)
[HKPK] Data Mining: Concepts and Techniques, by Jiawei Han, Micheline Kamber and Jian Pei, Morgan Kaufmann, Elsevier Inc. (2011) (2nd edition also acceptable)
[DBR] Open Intro Statistics, by David M. Diez, Christopher D. Barr, and Mine Cetinkaya-Rundel, 3rd edition, 2015 (https://www.openintro.org/stat/textbook.php)
[DSB] Data Science for Business: What you need to know about data mining and data-analytic thinking by Foster Provost & Tom Fawcett
[PGDM] A Programmer’s Guide to Data Mining by Ron Zacharski
Other technology requirements / equipment / material
Programming languages, platforms, as well as software applications and tools, such as Spark, Mahout, R/RStudio, Python, WEKA, Tableau, etc. that will be required for this class are either free to download, or the instructor will provide the students with academic licenses. Students will be informed in class and via Canvas ahead of time in order to install all required software.
Course Requirements and Assignments
SJSU classes are designed such that in order to be successful, it is expected that students will spend a minimum of forty-five hours for each unit of credit (normally three hours per unit per week), including preparing for class, participating in course activities, completing assignments, and so on. More details about student workload can be found in University Policy S12-3 at http://www.sjsu.edu/senate/docs/S12-3.pdf.
Descriptions of Assignments/Exams
In-class & social network assignments: Students will be evaluated based on their participation in in-class and social network assignments. All students are required to write their names on the submitted work and/or submit their answers online using their unique IDs, shared with the instructors. Failing to do so, even if the student was indeed present in the class, will result in zero credit as the instructor is unable to verify the student’s claims. Moreover, students whose name appears on submitted work, but were not in class, as well as the students who submitted their name on their behalf are violating the academic integrity policy and will be reported immediately to the office of Student Conduct and Ethical Development.
Short story assignment: Each student will be assigned a week during which they have to find, and present in class, a news story (e.g. about a new algorithm/technology etc.) that is related to the topics covered in class. This story has to have some technical background and the student will be asked to elaborate and analyze the elements pertaining to the class as well as carry a discussion with the rest of the group based on it. The peers will be able to provide feedback, taken into account for the grade.
Individual written/programming assignments and pop quizzes: Students will be provided with handouts describing the assignments and how they will be graded every week. These assignments will be in-class or take-home written assignments, in-class or take-home lab assignments, and presentation assignments for research papers or articles. Students will also have to answer to pop quizzes that will be based on the homework assignment that is due that day. The worst pop quiz grade will not be counted towards the final pop quiz grade of each student (“worst-one out policy”).
Term project: Groups of 3 students will be formed to work on a term-long group project related to data or web mining. The project has deliverables throughout the semester. The quality and completeness of all the deliverables will be considered in grading the projects. All projects will be demonstrated in class. The project details will be announced by the instructor and posted on the course’s website well before the deadlines.
Each group member is expected to participate in every phase of the project. The final grade of each member will be proportional to his/her participation in the group, as assessed by the instructor and the student’s peers. Each member should be able to answer questions regarding the project, present some part of the project demo, and participate in the system implementation and the writing of the technical reports. The term project will be graded on the basis of the following three components: a) project implementation, b) project report, c) project demonstration.
Extra-credit project: Students may choose to work on an optional individual extra-credit project designed to compare the efficiency of data mining algorithms. The project will require programming in at least one low-level programming language (C, C++, or Java) and one scripting/high-level programming language (Python, Pearl, R, Matlab, etc). Students will be required to work on their own and will not be allowed to share their solutions with others. The extra-credit project will be graded on the basis of the following three components: a) project implementation, b) solution efficiency, c) project report.
Exams: Exams will be a combination of multiple choice and short answer questions and will be based on the individual assignments and course material covered in class.
NOTE that University policy F69-24 at http://www.sjsu.edu/senate/docs/F69-24.pdf states that “Students should attend all meetings of their classes, not only because they are responsible for material discussed therein, but because active participation is frequently essential to insure maximum benefit for all members of the class. Attendance per se shall not be used as a criterion for grading.”
Students cannot take this class without fulfilling its prerequisite or obtaining instructor approval. Please note that, according to department policy, "students who do not provide documentation of having satisfied the class prerequisite and co-requisite requirements (if any) by the second class meeting will be dropped from the class."
Final Examination or Evaluation
This course has a comprehensive final exam.
Grading Information
The final grades will be calculated based on the following:
(A+) >= 98,
- >= 94 and <98
(A-) >= 90 and <94
(B+) >= 85 and <90
- >= 75 and <85
(B-) >= 70 and <75
(C+)>= 68 and <70,
- >= 64 and <78
(C-) >= 60 and <64,
(D) >= 50 and <60,
(F) < 50
- Students must obtain 60% or more in all required components of the course in order to get a passing grade (B or better) in this class
- No late assignments will be accepted. An extension will be granted only if a student has serious and compelling reasons that can be proven by an independent authority (e.g. doctor’s note if the student has been sick).
- The exam dates are final.
All students have the right, within a reasonable time, to know their academic scores, to review their grade-dependent work, and to be provided with explanations for the determination of their course grades.
Determination of Grades
The percentage weight assigned to class assignments is listed below. Detailed grading rubrics for the short-story assignment, team and extra-credit projects, and exact due dates for each assignment will be posted on Canvas. Students will have at one week to complete each homework assignment.
In-class & online activities |
5% |
Short story assignment |
10% |
Individual homework assignments |
10% |
Pop quizzes |
5% |
Term project |
20% |
Extra-credit project (optional) |
10% |
Midterm exam |
20% |
Final exam (comprehensive) |
30% |
Classroom Protocol
- Students are expected to arrive in time for class. While in class, they need to turn off cellphones unless directed otherwise by the instructor. Laptop/tablet/smart phone use is allowed only for activities related to the class.
University Policies
Per University Policy S16-9, university-wide policy information relevant to all courses, such as academic integrity, accommodations, etc. will be available on Office of Graduate and Undergraduate Programs’ Syllabus Information web page at http://www.sjsu.edu/gup/syllabusinfo/”
CMPE 255-01 / Data Mining, Fall 2023, Course Schedule
The schedule (and related dates/readings/assignments) is tentative and subject to change with fair notice. In case of guest lectures, the syllabus will be updated accordingly. Any changes will be announced in due time in class and on the course’s web site (Canvas). The students are obliged to consult the most updated and detailed version of the reading material and syllabus, which will be posted on Canvas.
Course Schedule
Wk |
Date |
Topics |
Readings |
Deadlines |
1 |
8/21 |
Introduction to Data Mining/Data Science : AI & Data Science, overall Landscape, Auto Data Science, process(CRISP-DM), tools and techniques, open source/commercial packages (scikit, Spark, MLflow), disciplines and fundamentals, kaggle competitions, Big Data and tools. |
A:1 and various papers/products |
|
2 |
8/28 |
Data Preparation & Data Proximity. Auto Data Cleaning, Data Loading, feature selection/cleansing, feature engineering, feature similarity, Data Engineering pipelines with big data, design patterns |
A:2-3, various papers/kaggle competition learnings |
|
3 |
9/11 |
Finding Nearest Neighbors, vector databases, Dimensionality reduction, state of art advances, libraries and tools and real world applications |
LRU:3, various papers |
HW 1 |
5 |
9/18 |
Association Analysis : Mining Frequent Patterns, Associations, and Correlations, Alternative models, concepts and algorithms and applications in real world |
A:4-5, TSK:6, various papers |
|
6 |
9/25 |
Classification : feature selection, model eval, various techniques and state of art. |
A:10-11, TSK:4, skim JWHT:3-4&9, various papers |
Project proposal |
7 |
9/25 |
Midterm exam |
||
8 |
10/2 |
Predictive modeling with Regression, other techniques |
-ditto- |
|
9 |
10/16 |
Clustering - 1 - basics |
A:6-9, TSK:8, various papers |
HW 4 |
10 |
10/23 |
Clustering - 2 with real world applications and tools |
skim MRS:1-2,6&19 |
HW 5 |
11 |
10/30 |
Text Mining and Graph Mining Overview |
MRS:21, skim A:17-18 |
|
12 |
11/6 |
Advanced topics, trends in data mining, spatial and temporal data |
Various papers |
Project report, code & slides |
13 |
11/13 |
Outlier detection, Mining data streams and time series |
Various papers |
HW 6 |
14 |
11/20 |
Data Mining as a service, Auto Data Science, industry tools and applications, data governance and introduction to privacy preserving data mining. |
Various papers |
|
15 |
11/27 |
Group Project Presentations |
||
16 |
12/2 |
Group Project Presentations |
||
17 |
12/8 |
Final Exam, TBA, 2:45 pm -- 5:00 pm |
Course Summary:
Date | Details | Due |
---|---|---|