Zenodo spam detection based on neural networks - OPEN REPOSITORIES 2021, June 7-10.

This poster is part of the Open Repositories 2021 Poster Session which takes place in the week of June 7-10. We encourage you to ask questions and engage in discussion on this poster by using the comments feature. Authors will respond to comments during this week.

Authors:

Pablo Panero; A Ioannidis; JB Gonzalez

Poster description:

Nobody wants to get something unwanted, like spam. The increase of spam content has become a problem in our digital era, and therefore it also affects digital repositories. Hosting spam can have an impact on a service, i.e. the actual hardware costs of storing it, getting skewed usage statistics, including distribution of material that violates copyright, and, most importantly, serving undesired content to users. Zenodo is a generalist research repository fostering open science practices. As the barrier for submissions is low, it is an easy target for spam. The repository’s staff has spent many hours manually detecting spam content, a process now assisted by an automated spam classification system, which still does not produce satisfactory results. Improvements of this classifier were based on an in-depth study of Zenodo’s data, a descriptive analysis, and feature extraction to corroborate expert knowledge gathered over years by Zenodo’s staff, as well as on a literature review of related topics such as spam classification in emails. Several types of neural network models were tested, displaying promising results for future integration. However, as the false positive rate is still unacceptable, Random Forest classifiers still prevail over neural network models.

(Click on the image to zoom in)

About the authors:

Pablo Panero (Presenting author) works at CERN as a software engineer for Zenodo, an open data service for the long tail of science. Pablo has been working on the Invenio framework for several years, most recently as a core developer of the new InvenioRDM. Previously, Pablo has been the lead developer of CERN Search, the Invenio based enterprise search solution.

Alex Ioannidis works at CERN’s IT department and is currently responsible for Zenodo, an open data service for the long tail of science. Alexandros has many years of experience as a software engineer, having led the design and operation of many production-level commercial and open-source software systems. He is also a core maintainer of the open-source Invenio framework for digital repositories. He has worked on promoting Software as a first-class citizen in the scientific research world.

José Benito González López leads the CERN section that is in charge of Invenio, the Digital Repository Framework, and several services that are running on top of it: CERN Document Server – CERN’s institutional repository, Zenodo – open data service for the long-tail of science, CERN Open Data Repositories, Digital Memory projects and also backend development of B2Share (EUDAT service). José is also a very experienced open source software developer and project manager with more than 15 years of experience, many of them devoted to the Open Source Project Indico which is the result of a European Project with the same name.

2 thoughts on “Zenodo spam detection based on neural networks”

Eric Olson says:

June 9, 2021 at 8:57 AM

Hey, Pablo,

Thank you for sharing your spam experiences. Your talk is both scary but also a bit reassuring; I’m sure that OSF is one that popped up in your searches. We have been in a tug of war with this for a while now. I would love to chat with you about this sometime.
1. Pablo Panero says:
  
  June 9, 2021 at 11:05 AM
  
  Hi Eric!
  
  I did anonymized the results ;P… I’m more than happy to chat about it, drop me an e-mail (pablo.panero@cern.ch) and we can setup a time and have an (in)formal talk.
  
  Looking forward,
  Pablo

Comments are closed.