When someone publishes a new proposal, a list of similar entries is displayed to avoid duplicates. The current recommendation algorithm calculates the similarity of each pair of proposals based on trigram (sets of 3-characters) comparison. This method, however, does not take into account the semantic aspects of the text and can be easily improved using simple Machine Learning techniques.
We suggest using a technique called word embeddings which consists of assigning to each proposal a multi-dimensional vector, in such a way that similar proposals (in terms of semantics) end up having close vectors. Therefore, the recommendations for a given proposal would be the proposals with the smallest distances between the vectors.
To calculate the vectors associated with each proposal, we suggest using pre-calculated vector embeddings for each word (of those more frequent in the Decidim vocabulary) and then calculating the average of all words appearing in the proposal. The pre-calculation of word vectors could be done offline by any person with medium knowledge of NLP (DataForGoodBCN, the community that has created this proposal, could provide these calculations).
This proposal has been accepted and is under development
List of Endorsements
Report inappropriate content
Is this content inappropriate?