It is one of the most popular and commonly quoted data sets in data science. This data set provides the exciting opportunity of building one’s own movie recommendation engine and is available in many sizes.
The smallest set meant for the purpose of education and development contains 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users. While the largest set meant for the same purpose contains 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users.
It also contains stable benchmark data set of 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users.
Project 2: Explore the data, and provide insights and forecasts about crimes in Chicago.
With the increasing demand to analyse large amounts of data within small time frames, organisations prefer working with the data directly over samples. This presents a herculean task for a data scientist with limitation of time.
Extracted from Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system, this data set contains information on reported incidents of crime in the city of Chicago from 2001 to present, with the absence of data from the most recent seven days. Not included in the data set, is data on murder, where data is recorded for each victim.
It contains 6.51 million rows and 22 columns, and is a multi-classification problem. In order to achieve mastery over working with abundant data, this data set can serve as the ideal stepping stone in the pursuit of tackling mountainous data.
Project 3 : Predict whether income exceeds $50,000 per year.
It contains the extracted weighted census data, and has 41 employment and demographic related variables.
While the the original table contained 199,523 rows and 42 columns, the newer refined versions of the data set contain anywhere between 14-16 columns and above 30,000 rows. It is a commonly cited data set of KNN(know nearest neighbors) and is a classification problem.
Project 4 : Customer Predictive Lifetime Value Modelling
The aim is to model the behavior of customers for purchasing anything in order to predict their future activities. The evaluation of this model can be performed by using such as Beta-geometric binomial model for customer alive probability or by using the Gamma-gamma model. These models collect, classify and clean the data around customer’s needs, expenses, recent purchases, etc. The data science algorithm is then used after the processing of data to spot the inter dependencies between the choices and behaviors of the customers assuring a better understanding of the customers.