This poster is part of the Open Repositories 2021 Poster Session which takes place in the week of June 7-10. We encourage you to ask questions and engage in discussion on this poster by using the comments feature. Authors will respond to comments during this week.
Pablo Panero; A Ioannidis; JB Gonzalez
Nobody wants to get something unwanted, like spam. The increase of spam content has become a problem in our digital era, and therefore it also affects digital repositories. Hosting spam can have an impact on a service, i.e. the actual hardware costs of storing it, getting skewed usage statistics, including distribution of material that violates copyright, and, most importantly, serving undesired content to users. Zenodo is a generalist research repository fostering open science practices. As the barrier for submissions is low, it is an easy target for spam. The repository’s staff has spent many hours manually detecting spam content, a process now assisted by an automated spam classification system, which still does not produce satisfactory results. Improvements of this classifier were based on an in-depth study of Zenodo’s data, a descriptive analysis, and feature extraction to corroborate expert knowledge gathered over years by Zenodo’s staff, as well as on a literature review of related topics such as spam classification in emails. Several types of neural network models were tested, displaying promising results for future integration. However, as the false positive rate is still unacceptable, Random Forest classifiers still prevail over neural network models.
(Click on the image to zoom in)
About the authors: