Download PDFOpen PDF in browserAn Enhanced LSI based Search Engine for Arabic Medical DocumentsEasyChair Preprint 806 pages•Date: April 21, 2018AbstractVector space model (VSM) is widely used for representing text documents in data mining and information retrieval (IR) systems. However, this technique poses some challenges such as high dimensional space and semantic loss representation. Therefore, latent semantic indexing (LSI) proposed to reduce the feature dimensions and to generate semantic rich features that represent conceptual term-document associations. In particular, LSI successfully implemented in search engines and text classification tasks. In this paper, we propose a novel approach to enhance the standard LSI method based on cosine measures instead of words occurrences to form LSI term-by-document matrix. We empirically evaluated the performance using an Arabic medical data collection that contains 800 documents with 47,222 unique words. A testing set contains five medical keywords used to evaluate the quality of the top-20 retrieved documents using different singular values (i.e. different number of dimensions). The results shows that the performance of the proposed method outperforms the standard LSI. Keyphrases: Arabic text, Latent Semantic Indexing, dimensionality reduction, search engine
|