Keywords Extraction from Persian Thesis Using Statistical Features and Bayesian Classification

Document Type : مقالات علمی پژوهشی

Authors
1 M.Sc Student of Artificial Intelligence, Islamic Azad University North Tehran Branch, Tehran, Iran
2 Department of Computational Linguistics, Information Science Research Department, Iranian Research Institute for Information Science and Technology (IRANDOC), Tehran, Iran
Abstract
Keyword extraction aims to extract words that are able to represent the corpus meaning. Keyword extraction has a crucial role in information retrieval, recommendation systems and corpora classification. In Persian language, keyword extraction is known as hard task due to Persian’s inherent complication. In this research work, we aim to address keyword extraction with a combination of statistical and Machine Learning as a novel approach to this problem. First the required preprocessing is applied to the corpora. Then three statistical methods and Bayesian classifier was utilized to the corpora to extract the keywords pattern. Also, a post processing methods was used to decrease the number of True Positive outputs. It should be pointed out that the built model can extract up to 20 keywords and they will be compared with keywords in the corresponding corpus. The evaluation results indicate that the proposed method, could extract keywords from scientific corpora (Specifically Thesis and Dissertations) with a good accuracy.



1. Introduction

Automated keyword extraction is the process of identifying document terms and phrases that can appropriately represent the subject of our writing. With the proliferation of digital documents today, extracting keywords manually can be impractical. Many applications such as auto-indexing, summarization, auto-classification, and text filtering can benefit from this process since the keywords provide a compact display of the text. Automated keyword generation can be broadly classified into two categories: keyword allocation and keyword extraction.

In keyword allocation, a set of potential keywords is selected from a set of controlled vocabularies, while keyword extraction examines the words in the text. Keyword extraction methods can be broadly classified into four groups: statistical approaches, linguistic approaches, machine learning approaches, and hybrid approaches.



2. Literature Review

working on Persian words is a big challenge for the paucity of sufficient research. The inadequacy of text pre-processing programs has made it more complex than the Latin language. Also, the presence of large dimensions of input data is one of the challenges that has always arisen in such researches and this problem becomes more apparent due to the variety of Persian written forms (Gandomkar, 2017, p. 233:256). In Moin Maedi's article (2015, p. 34:42) A method for extracting keywords in Persian language is presented. This article extracts keywords from each text separately and without seeing another text as training data.

In the article by Mohammad Razaghnouri (2017, P. 16:27) using the Word2Vec method and the TIF-IDIF frequency, they created a question and answer system in Persian, which is a new work due to the use of Word2Vec in Persian. However, with size reduction techniques and Word2Vec, this 72% success rate can be enhanced in the future.
3. Methodology

Accordingly, the current paper examines the integration of statistical keyword extraction methods with the Naive Bayes Classifier. Initially, we integrated input texts which are dissertations in Persian by using preprocessing (deletion of stop words, etymology, etc.) methods. Then, using the available statistical features, each word has been given a certain weight. Then, the valuable words of each text were selected and the proposed model was taught using the selected category, then the selected words were processed by the trained model, and at the end, the words extracted from the final model were evaluated using the keywords suggested by the authors themselves. Figure 1 depicts all the steps performed.



4. Results

Literature review shows that this is the first time that these combinations are used to extract Persian keywords, so that unlike other studies, each text is as a sample for category input and words as its properties, however, in this paper the words of each text input are categorized and words are extracted using statistical methods that are considered as features. The choice of keywords by the authors has always been a personal decision and people may not make a single decision to choose a set of words for a single text.















Figure 1

Proposed research framework for keyword extraction




























Create unigrams









Rooting & normalization








and bygrams

























The current paper attempts to create a model and program with a new approach, due to the small number of input documents, which to extract keywords without dependence on the orientation of dissertations and the meaning of their words and only by using statistical features of words in each text. According to Tables 1 and 2, the developed model is able to extract a maximum of 20 keywords from each dissertation with an overall accuracy of 98.1%, in best condition which that is the use of a maximum frequency feature. The keywords written in each dissertation with 84% and 98% accuracy, correspond to one-word and two-word expressions, respectively.

Table 1

Evaluation criteria for Bayesian outputs in different states of statistical Features




Precision
F1-Score
Recall
Accuracy
Statistical Features


0.98
0.98
0.98
97.2%
Tf_Idf, Most Frequent, Tf_Isf


0.99
0.99
0.982
98.1%
Most Frequent


0.99
0.94
0.91
99.8%
Tf_Idf, Tf_Isf





Table 2

Evaluation of post-processing test data for outputs that have been categorized by keyword






Number of keywords that selected by writers
Number of words
Precision
F1-Score
Recall
Statistical Features
Step


42
210
0.2
0.323
0.84
Most Frequent
Uni-Grams


34
158
0.8
0.888
0.98
Most Frequent
By-Grams





.

Keywords

Subjects


1. Aytug Onan, Serdar Korukoglu , Hasan Bulut. 2016.“Ensemble of keyword extraction methods and classifiers in text classification”. Expert Systems with Applications. Pp. 232-247.
2. Atanu Dey, Mamata Jenamani, Jitesh J.Thakkar. 2018.“Senti-N-Gram: An n-gram lexicon for sentiment analysis”. Expert Systems with Applications. Pp. 92–105.
3. Rafea, S. El-Beltagy and A. 2010.“Kp-miner: Participation in semeval-2” Proceedings of the 5th international workshop. ACL 2010, Pp. 190–193
4. Neumann, K. E. and G. 2010.“Dfki keywe: Ranking keyphrases extracted from scientific articles” in Proceedings of the 5th International Workshop on Semantic Evaluation. ACL 2010, Pp. 150–153.
5. Luong, T. Nguyen and M. 2010.“WINGNUS: Keyphrase extraction utilizing document logical structure” Proceedings of the 5th international workshop. Pp. 21–26.
6. Turney, P. D, 2002.“Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: Learning from Labeled and Unlabeled Data P.D”. ArXiv.
7. Witten, I. H. , Paynter, G. W. , Frank, E. , Gutwin, C. 1999.“KEA: Practical automatic keyphrase extraction”. In Proceedings of the fourth ACM conference on digital libraries.
8. HaCohen-Kerner. 2005.“Automatic extraction and learning of keyphrases from scientific articles”. Lecture Notes in Computer Science. Pp. 657-669.
9. Moien Maadi, Kazim Fouladi. 2015.“Providing a method for extracting keywords in the Persian language”. International Academic Journal of Innovative Research. Pp. 34-42.
10. Mohammad Razzaghnoori, Hedieh Sajedi , Iman Khani Jazani. 2017.“Question classification in Persian using word vectors and frequencies”.Journal Cognitive Systems Reaserch. Pp. 16-27.
11. Behnam Sabeti, Hossein Abedi Firouzjaee, Ali Janalizadeh Choobbasti. 2018.“MirasText: An Automatically Generated Text Corpus for Persian”. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
11. Morteza Okhovvat, Behrouz Minaei Bidgoli. 2010.“A Hidden Markov Model for Persian Part-of-Speech Tagging”. Pp. 94–101.
12. Sayyid Mohammad Hoseini Khozani, Hosein Bayat. 2011.“Specialization of keyword extraction approach to Persian texts”. International Conference of Soft Computing and Pattern Recognition (SoCPaR). Pp. 112-116.
13. Beliga, S. , Mestrovic, A. , & Martincic-Ipsic, S. 2015.“An overview of graph-based keyword extraction methods and approaches”. Journal of Information and Organi- zational Sciences. NO. 39 (1), Pp. 1–20.
14. Lott, B. 2012. “Survey of keyword extraction techniques”. UNM Education.
15. Rossi, R. G. , Maracini, R. M. , & Rezende, S. O. 2014. “Analysis of domain independent statistical keyword extraction methods for incremental clustering”. Learning and Nonlinear Models. Pp. 17-37.
16. Neto, L. J. , Santos, A . D. , Kaestner, C. A. 2000.“Document cluster- ing and text summarization”. In Proceedings of the 4th international conference on practical applications of knowledge discovery and data mining. Pp. 41-55.
17. Fiori, A. 2014.“Innovative document summarization techniques: Revolutionizing knowledge understanding”. Advances in Data Mining and Database Management.
18. Litvak, M. , & Last, M. 2008.“Graph-based keyword extraction for single-document summarization”. In Proceedings of the workshop on multi-source multilingual infor- mation extraction and summarization. Pp. 17-24.
19. Huan, C. , Tian, Y. 2006.“Keyphrase extraction us- ing semantic network structure analysis”. In Proceedings of the sixth international conference on data mining. Pp. 275-284.
20. راحله گندمکار (1396). «توسیع معنایی در زبان فارسی؛ مطالعه موردی افعال». د 10، ش 5 (پیاپی 53)، صص 233- 256.