close Home What is Sign Up Login

IMDB Reviews Dataset - Large Movie Review DatasetIMDB Reviews Dataset - Large Movie Review Dataset

ybouane about a year ago 1.0.4 FREE
Download this dataset


test.csv 32MB unrated.csv 66MB train.csv 33MB


Public Domain
# Overview This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided. # Dataset The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning. In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5. # Source []( Publications using the dataset must cite: *Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).*


Test setimdb/test

ID Positive Review Review 10-star rating Review Text Movie Url

Train setimdb/train

ID Positive Review Review 10-star rating Review Text Movie Url

Unrated reviews setimdb/unrated

ID Review Text Movie Url


OR Create an Account