Download PDFOpen PDF in browser

An Approach for Evaluating Semantic Similarity in Research Papers via Siamese BERT Architecture

EasyChair Preprint 15448

12 pagesDate: November 20, 2024

Abstract

Document similarity analysis is critical for various NLP tasks like information retrieval and plagiarism detection. Traditional methods based on word-to-word mapping struggle with capturing contextual nuances. Existing solutions lack the capability to provide domain-specific accuracy and enriched search experiences. One such field is finding similar research papers. Often researchers struggle to find papers similar to a certain paper and have to rely on basic keyword-based search. This hinders to provide the best match based on the overall context. In this work, we propose a novel methodology that integrates BERT with a Siamese Neural Network to capture semantic textual similarity of research papers. Our approach goes beyond simple similarity evaluation by conducting a nuanced semantic search of overall context and provides a representative similarity score. This offers a more accurate and refined search experience. Furthermore, we curate a dataset of over 10,000 NLP research paper abstracts to train our model. The model excels in identifying the contextual relationships between documents, making it highly effective for domain-specific applications. This model can significantly improve the user experience in document retrieval systems, particularly for academic research and recommendation.

Keyphrases: BERT, Data Science, NLP, Siamese Neural Network, semantic similarity

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@booklet{EasyChair:15448,
  author    = {Pritam Sarkar and Soumyaneel Sarkar and Wazib Ansar and Amlan Chakrabarti},
  title     = {An Approach for Evaluating Semantic Similarity in Research Papers via Siamese BERT Architecture},
  howpublished = {EasyChair Preprint 15448},
  year      = {EasyChair, 2024}}
Download PDFOpen PDF in browser