Propose new functionalities for Decidim software
#DecidimRoadmap Designing Decidim together
Intelligent recommendations
When someone publishes a new proposal, a list of similar entries is displayed to avoid duplicates. The current recommendation algorithm calculates the similarity of each pair of proposals based on trigram (sets of 3-characters) comparison. This method, however, does not take into account the semantic aspects of the text and can be easily improved using simple Machine Learning techniques.
We suggest using a technique called word embeddings which consists of assigning to each proposal a multi-dimensional vector, in such a way that similar proposals (in terms of semantics) end up having close vectors. Therefore, the recommendations for a given proposal would be the proposals with the smallest distances between the vectors.
To calculate the vectors associated with each proposal, we suggest using pre-calculated vector embeddings for each word (of those more frequent in the Decidim vocabulary) and then calculating the average of all words appearing in the proposal. The pre-calculation of word vectors could be done offline by any person with medium knowledge of NLP (DataForGoodBCN, the community that has created this proposal, could provide these calculations).
This proposal has been accepted and is under development
List of Endorsements
Report inappropriate content
Is this content inappropriate?
6 comments
Hi, I absolutely love this :)
We at Open Source Politics also had the idea of using NLP to enhance Decidim's proposal comparison tool. I wonder if using word embeddings is the right way to do this though, and I'd love to hear your point of view.
Indeed, and I may be wrong about this, but I think word embeddings keep the semantic structure but do not scan the subjects covered in the proposal. I guess using classification tools could help us analyse what people say and not just how they say it. Classifying proposals would (again, not an NLP expert) link proposals according to the vocabulary used.
Maybe a mix of these two methods could be relevant? Would it be too heavy on Decidim, just for the sake of a comparator?
Hey there! loving it too!
I'm working with proposals recomendations in the AhoraNosTocaParticipar version of decidim.
I was thinking that there are a couple of versions of word embeddings already computed in spanish (I know of this one in Chile). As I understand it there are a few techniques for searching phase similarities (https://medium.com/@adriensieg/text-similarities-da019229c894).
What do you guys think of creating a separate engine, for doing so? Currently, it is Postgres that does the text similarities engine, so I think if we create a separate engine and simply replace the calls in decidim could work.
Hi. First of all, thank you Antoine and Felipe for your valuable comments.
About the last comment, we would need to assess the languages of the proposals in order to use the appropriate word embeddings. For the languages for which they already exist, we could surely evaluate whether the existing ones are good enough, but using them would be an option. As Felipe says, our idea was to use this embeddings to create a new method for calculating the distances to replace the current Postgres methodology.
Hi ! Very fond of it too !
I've already done some experiences (at OSP) and I chose to work with CamemBERT alongside SBERT (a french version of BERT and a sentence-encoder) to spare huge training costs and capture very precise semantic information at the sentence level (not only word-level).
For the time being I did not used any fine-tuning and simply mobilized the knowledge of CamemBERT pre-training.
The final list of semantic related sentences is often linked by a common theme with the initial proposal. Yet, this system disregards argumentative structures and some of the closest pairing rely on a theme that we would not have necessarily chose for the focus of the comparison. I think th
I'd be glad to talk about it more precisely if there is anyone that is going through similar stakes.
I just have a question left: any peculiar reasons to rely upon extracted word embeddings rather than Transformer Architecture?
Hello @DataForGoodBCN, as my colleagues (Quentin and Antoine) mentioned we are also working on this.
How about we sync up in a call and see if we can join forces on this ?
I'm sending you via dm my contact details.
Related to Improve automatic comparison algorithm when submiting a proposal
Add your comment
Sign in with your account or sign up to add your comment.
Loading comments ...