This site uses cookies. By continuing to browse the site, you agree to our use of cookies. Find out more about cookies.
Skip to main content
Metadecidim's official logo
  • English Triar la llengua Elegir el idioma Choose language
    • Català
    • Castellano
Sign Up Sign In
  • Home
  • Processes
  • Assemblies
  • Initiatives
  • Consultations
  • Conferences
  • Help

Propose new functionalities for Decidim software

#DecidimRoadmap Designing Decidim together

Phase 1 of 1
Open 2019-01-01 - 2030-12-31
Process phases Submit a proposal
  • The process
  • Debates
  • Propose new features
  • News
chevron-left Back to list

Use content classification systems for better SPAM detection

Avatar: Antti Hukkanen Antti Hukkanen
03/03/2021 18:09  

**Is your feature request related to a problem? Please describe.**

SPAM users are becoming bigger and bigger problem for all Decidim instances. They register profiles to put advertisement in their profile bio or a SPAM link in their personal URL and they are flooding the comments section with SPAM.

This is a real issue that is causing lots of extra work for the moderators of the platform. We should apply some automation in order to help their work.

**Describe the solution you'd like**

There is a gem available named Classifier Reborn which provides two alternative content classification algorithms:

  • Bayes - The system is trained using a predefined set of sentences to detect what are considered good and what are considered bad. When classifying content, it applies a word density search for the new content against this predefined database and provides a probability if the new content is considered good or bad.
  • Latent Semantic Indexer (LSI) - Behaves with similar logic as above but adds semantic indexing to the equation. Slower but more flexible.

More information available from:
https://jekyll.github.io/classifier-reborn/
https://github.com/jekyll/classifier-reborn

Based on one of these algorithms, we could calculate a SPAM probability score for any content the user enters + the user profile itself when it is updated because in the past years we have been seeing many users who create SPAM profiles to get a back link to their site for improved SEO scores.

**Describe alternatives you've considered**

  • Manually moderating all users/content that are considered SPAM - very work heavy
  • Using 3rd party APIs to detect SPAM but they are likely not any better as what is suggested above + they come with a cost (or alternatively with a privacy impact)

**Additional context**

The suggested content classification systems with the predefined databases are likely to work only for English. I haven't dug deeper whether such databases are available for other languages.

But, as of our experience, most of the SPAM users are spamming in English, so I think such classification systems could solve the problem at least for English SPAM.

If the classification needs to be applied to other languages as well, there could be some way to train the system further with other datasets. By default, it could just be trained in English to get rid of most of the SPAM users.

**Does this issue could impact on users private data?**

No.

**Funded by**

No funding available.

But once there is big enough amount of needless work caused by the spammers, I'll refer to this issue whether we could find someone to fund the development.

  • Filter results for category: Content classification systems Content classification systems

List of Endorsements

Avatar: Ivan Vergés Ivan Vergés verified-badge
Avatar: Virgile Deville Virgile Deville
Avatar: Romy Grasgruber-Kerl Romy Grasgruber-Kerl
Endorsements count3
Use content classification systems for better SPAM detection Comments 1

Reference: MDC-PROP-2021-03-16256
Version number 3 (of 3) see other versions
Check fingerprint

Fingerprint

The piece of text below is a shortened, hashed representation of this content. It's useful to ensure the content hasn't been tampered with, as a single modification would result in a totally different value.

Value: 26d4b65e92f7e26338f0fd162ff255a61587e1f5c72a99f52ff85aa39bcff39d

Source: {"body":{"en":"<p><strong>**Is your feature request related to a problem? Please describe.**</strong></p><p>SPAM users are becoming bigger and bigger problem for all Decidim instances. They register profiles to put advertisement in their profile bio or a SPAM link in their personal URL and they are flooding the comments section with SPAM.</p><p>This is a real issue that is causing lots of extra work for the moderators of the platform. We should apply some automation in order to help their work.</p><p><strong>**Describe the solution you'd like**</strong></p><p>There is a gem available named <a href=\"https://jekyll.github.io/classifier-reborn/\" target=\"_blank\">Classifier Reborn</a> which provides two alternative content classification algorithms:</p><ul><li><strong>Bayes</strong> - The system is trained using a predefined set of sentences to detect what are considered good and what are considered bad. When classifying content, it applies a word density search for the new content against this predefined database and provides a probability if the new content is considered good or bad.</li><li><strong>Latent Semantic Indexer (LSI)</strong> - Behaves with similar logic as above but adds semantic indexing to the equation. Slower but more flexible.</li></ul><p>More information available from:<br><a href=\"https://jekyll.github.io/classifier-reborn/\" target=\"_blank\">https://jekyll.github.io/classifier-reborn/</a><br><a href=\"https://github.com/jekyll/classifier-reborn\" target=\"_blank\">https://github.com/jekyll/classifier-reborn</a><br></p><p>Based on one of these algorithms, we could calculate a SPAM probability score for any content the user enters + the user profile itself when it is updated because in the past years we have been seeing many users who create SPAM profiles to get a back link to their site for improved SEO scores.</p><p><strong>**Describe alternatives you've considered**</strong></p><ul><li>Manually moderating all users/content that are considered SPAM - very work heavy</li><li>Using 3rd party APIs to detect SPAM but they are likely not any better as what is suggested above + they come with a cost (or alternatively with a privacy impact)</li></ul><p><strong>**Additional context**</strong></p><p>The suggested content classification systems with the predefined databases are likely to work only for English. I haven't dug deeper whether such databases are available for other languages.</p><p>But, as of our experience, most of the SPAM users are spamming in English, so I think such classification systems could solve the problem at least for English SPAM.</p><p>If the classification needs to be applied to other languages as well, there could be some way to train the system further with other datasets. By default, it could just be trained in English to get rid of most of the SPAM users.</p><p><strong>**Does this issue could impact on users private data?**</strong></p><p>No.</p><p><strong>**Funded by**</strong></p><p>No funding available.</p><p>But once there is big enough amount of needless work caused by the spammers, I'll refer to this issue whether we could find someone to fund the development.</p>"},"title":{"en":"Use content classification systems for better SPAM detection"}}

This fingerprint is calculated using a SHA256 hashing algorithm. In order to replicate it yourself, you can use an MD5 calculator online and copy-paste the source data.

Share:

link-intact Share link

Share link:

Please paste this code in your page:

<script src="https://meta.decidim.org/processes/roadmap/f/122/proposals/16256/embed.js"></script>
<noscript><iframe src="https://meta.decidim.org/processes/roadmap/f/122/proposals/16256/embed.html" frameborder="0" scrolling="vertical"></iframe></noscript>

Report inappropriate content

Is this content inappropriate?

Reason

1 comment

Order by:
  • Older
    • Best rated
    • Recent
    • Older
    • Most discussed
Avatar: Virgile Deville Virgile Deville
03/03/2021 19:32
  • Get link Get link

Yes ! This is definitely a growing issue.
With @moustachu we found a way to identify them quite easily, they usually :
- Have not confirmed their email
- Filled the profile description in English
- Filled the profile link field
- Have no participation

Linking these proposal as they seem related
Detect the use of spam-bots and ban non compliant users
Registro de bots (spam)

Comment moderated on 25/01/2022 09:48

Add your comment

Sign in with your account or sign up to add your comment.

Loading comments ...

  • Terms and conditions of use
  • About the community
  • Download Open Data files
  • Metadecidim at Twitter Twitter
  • Metadecidim at Instagram Instagram
  • Metadecidim at YouTube YouTube
  • Metadecidim at GitHub GitHub
Creative Commons License Website made with free software.
Decidim Logo

Confirm

OK Cancel

Please sign in

decidim Sign in with Decidim
Or

Sign up

Forgot your password?