approved
Product Reviews for Ordinal Quantification

This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. It appears in our research paper "Ordinal Quantification Through Regularization", which we have published at ECML-PKDD 2022. The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to two protocols that are suited for quantification research. The goal of quantification is not to predict the star rating of each individual instance, but the distribution of ratings in sets of textual reviews. More generally speaking, quantification aims at estimating the distribution of labels in unlabeled samples of data. The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task. This data set comprises two representations of the McAuley data. The first representation consists of TF-IDF features. The second representation is a RoBERTa embedding. This second representation is dense, while the first is sparse. In our experience, logistic regression classifiers work well with both representations. RoBERTa embeddings yield more accurate predictors than the TF-IDF features. You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts.

Tags
Data and Resources
To access the resources you must log in
  • Zenodo link

    Site containing the files to download

    The resource: 'Zenodo link' is not accessible as guest user. You must login to access it!
Personal Data Attributes

Description: Personal Data related Information

Field Value
Anonymised Pseudo Anonymized
ChildrenData No
Cross Border Authorised Yes
Data Protection Impact Assessment Yes
Ethics Committee Approval Yes
General Data Yes
Informed Consent Template Yes
Personal Data No
Personal data was manifestly made public by the data subject No
Sensitive Data No
Additional Info
Field Value
Accessibility Both
Accessibility Mode Download
Availability On-Line
Basic rights Download
Creation Date 2022-09-16
Creator Bunse, Mirko, [email protected], orcid.org/0000-0002-5515-6278
Dataset Citation Bunse, Mirko, Moreo, Alejandro, Sebastiani, Fabrizio, & Senz, Martin. (2022). Product Reviews for Ordinal Quantification (v0.1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7081208
Dataset Re-Use Safeguards None
DiskSize 23.4
External Identifier 10.5281/zenodo.7081208
Field/Scope of use Non-commercial research only
Format zip
Group Others
Language eng, English
License term 2022-09-16 /2032-09-16
Manifestation Type Virtual
Processing Degree Secondary
Retention Period 2022-09-16 /2032-09-16
Size 23.4 GB
Sublicense rights No
Territory of use World Wide
Thematic Cluster Text and Social Media Mining [TSMM]
system:type Dataset
Management Info
Field Value
Author Moreo Alejandro
Maintainer Moreo Alejandro
Version 1
Last Updated 24 June 2023, 01:09 (CEST)
Created 3 March 2023, 18:03 (CET)