Evaluating the Effectiveness of Artificial Intelligence Techniques and Computational Theories in the Authorship Attribution of Arabic Texts

Authors
1 PhD in Arabic Language and Literature, Kharazmi University, Tehran, Iran
2 Professor, Department of Arabic Language and Literature, Tarbiat Modares University, Tehran,
Abstract
Authorship Attribution is an application of stylometry that assigns authors to anonymous texts based on writing features. Several models have been developed for languages, such as English, Chinese, and Dutch. Although many studies address authorship attribution in Arabic, most of them do not critically assess whether the applied theories and techniques fit Arabic’s unique linguistic features. This study aims to identify the most suitable techniques for Arabic authorship attribution by evaluating computational theories and artificial intelligence methods. Using a descriptive-analytical methodology, it reviews and compares empirical studies on Arabic texts, including literary and online materials. Data were collected from published studies, experiments, and corpora applying authorship attribution to Arabic, focusing on method effectiveness relative to Arabic’s structural features. Results show that among computational theories, only the K equation reliably determines Arabic text authorship. Among AI’s Machine Learning methods, SVM outperforms KNN, AdaBoost, and Naïve Bayes, but the master-slave technique performs significantly better. In NLP approaches, ARBERT and AraELECTRA achieve up to 96% accuracy, with POS tagging outperforming LSA. Further research is needed to determine the most accurate technique for Arabic authorship attribution.

Keywords

Subjects


Abaker, Y., & Rshwan, M. )2017(. Semantic based Arabic Question Answering: Core and Recent Techniques. International Journal of U- and e- Service, Science and Technology, (1)10, 201-214.
Abbasi, A., & Chen, H. (2005). Visualizing Authorship for Identification. Conference in Computer Science, 60-71.
Abutiheen, Z., Aliwy Ahmed, A., & Aljanabi, K. (2018). Arabic text classification using master-slaves technique. The Sixth Scientific Conference “Renewable Energy and its Applications”, 1032, 1-9.
Aggarwal, Ch., & Zhai, C.X. (2012). A survey of text classification algorithms. Mining text data, 163-222.
Alammary, A. (2022). BERT Models for Arabic Text Classification: A Systematic Review. Applied Sciences, 12, 1-20.
Al-Harbi, S., Almuhareb, Al-Thubaity, K., & Al-Rajeh. (2008). Automatic Arabic Text Classification. 9es Journees internationals d’Analyse statistiqe des Données Textuelles, 77-83.
Alhutaish, R., & Omar, N. (2015). Arabic text classification using k-nearest neighbour algorithm. Int. Arab J. Inf. Technol. (IAJIT), 12, 190-195.
Alhuqail, N. (2021). Author Identification Based on NLP. European Journal of Computer Science and Information Technology, 9, 1-26.
Aliwy, A. (2015). Combining POS taggers in master-slaves technique for highly inflected languages as Arabic. Cognitive Computing and Information Processing (CCIP), 1-5.
Al-Sabahi, K., Zhang, Z., Long, J., & Alwesabi, Kh. (2018). An Enhanced Latent Semantic Analysis Approach for Arabic Document Summarization. Arabic Journal for Science and Engineering, 43, 8079-8094.
Altamimi, A., Alotaibi, S., & Alruban, A. (2019). Surveying the Development of Authorship Identification of Text Messages. International Journal of Intelligent Computing Research (IJICR), 1, 953-966.
Alsager, H. (2020). Towards a Stylometric Authorship Recognition Model for the Social Media Texts in Arabic. Arab World English Journal (AWEJ), 11, 490-507.
Alsaleem, S. (2011). Automated Arabic Text Categorization Using SVM and NB. International Arab Journal of e-Technolog, 2, 124-128.
Al-Salemi, B., & Juzaiddin, A. (2011). Statistical Bayesian Learning for Automatic Arabic Text Categorization. Journal of Computer Science, 7, 39-45.
Al-Sarem, M., & Emara, A.H. (2019). The effect of training set size in authorship attribution: application on short Arabic texts. International Journal of Electrical and Computer Engineering (IJECE), 9, 652-659.
Altheneyan, A., & El Bachir Menai, M. (2014). Naïve Bayes classifiers for authorship attribution of Arabic texts. Journal of King Saud University, Computer and Information Sciences, 26, 473-484.
AlZahrani, F., & Al-Yahya, M. (2023). A Transformer-Based Approach to Authorship Attribution in Classical Arabic Texts. Applied Sciences, 13, 1-15.
Badry, R., Ali, M., Rsalan, E., & Kaseb, M. (2023). Automatic Arabic Grading System for Short Answer Questions. Digital Object Identifier, 11, 39457-39465.
Baharudin, B., Hong Lee, L., & Khan, Kh. (2010). A review of machine learning algorithms for text-documents classication. Technol, 1, 4-20.
Bakly, A., Darwish, N.R., & Hefny, H. (2020). Using Ontology for Revealing Authorship Attribution of Arabic Text. International Journal of Engineering and Advanced Technology (IJEAT), 9, 143-151.
Baron, G. (2014). Influence of data discretization on efficiency of Bayesian classifier for authorship attribution. 18th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems- KES2014, 35, 1112-1121.
Bekkali, M., & Lachkar. (2020). Arabic Sentiment Analysis using Different Representation Models. International Journal of Emerging Trends in Engineering Research, 8, 3368-3372.
Brocardo, M., Traore, I., Saad, Sh., & Woungang. I. (2013). Authorship verification for short messages using stylometry. Internationa Conference on Computer, Information and Telecommunication Systems (CITS), 1-6.
Boukhaled, M. (2022). A Machine Learning based Study on Classical Arabic Authorship Identification. 14th International Conference on Agents and Artificial Intelligence, 1, 489-495.
Bozkurt, I.N., Baghioglu, O., & Uyar, E. (2007). Authorship Attribution, performance of various features and classification methods. Conference: Computer and information sciences, 1-5.
Brunet, E. (1978). Vocabulaire de Jean Giraudoux: Structure et Evolution [Vocabulary of Jean Giraudoux: Structure and Evolution]. Geneve: Slatkine.
Carroll, J. (1964). Language and Thought. Englewood Cliffs: Prentice-Hall, Inc.
Chiche, A., & Yitagesu, B. (2022). Part of speech tagging: a systematic review of deep learning and machine learning approaches. Journal of big data, 9, 1-25.
Covington, M., & McFall, J. (2010). Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics, 17, 94–100.
Crystal, D. (2008). Dictionary of linguistics and phonetics. Sixth Edition ed. Blackwell.
Deerwester, S., Dumais, S., W. Furnas, G., Landauer, Th., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the Society for Information Science, 41, 391-407.
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Dugast, D. (1978). Sur quoi se fonde la notion d’étendue théoretique du vocabulaire [What is the basis of the notion of theoretical scope of vocabulary?]. Le francais modern, 46, 25–32.
El-Halees, A. (2022). Arabic Poetry Authorship Attribution and Verification Using Transfer Learning. Egyptian Computer Science Journal, 46, 18-27.
El Rifai, H., Al Qadi, L., & Elnagar, A. (2022). Arabic text classification: the need for multi-labeling systems. Neural Computing and Applications, 34, 1135-1159.
Foltz, P. (1996). Latent semantic analysis for text-based research. Behav. res. meth. instrum. Comput, 28, 197–202.
Good, I. J. (1953). The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika, 40, 237–264.
Gudivada, V., & Rao, C. R. (2018). Computational Analysis and Understanding of Natural Languages: Principles, Methods and Applications. Handbook of Statistics, 38, 2-515.
Guiraud, H. (1954). Les Caractères Statistiques du Vocabulaire [Statistical Characteristics of Vocabulary]. Paris: Presses Universitaires de France.
Guiraud, P. (1960). Problèmes et Méthodes de la Statistique Linguistique [Problems and methods of statistical linguistics]. Paris: Presses universitaires de France.
Hadi, W., & Eljinini, M.A. (2010). Automated Arabic Text Categorization Using SVM and KNN. The International Business Information Management Conference (15th IBIMA), 1-6.
Hadjadj, H., & Sayoud, H. (2021). Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents. International Journal of Cognitive Informatics and Natural Intelligence, 15, 1-17.
Hassina, H., & Halim, S. (2018). Authorship Attribution of Seven Arabic Religious Books –A Fusion Approach-. Conference Internationale en Automatique & Traitement de Signal, 35, 808-88.
Herdan, G. (1955). A New Derivation and Interpretation of Yule’s Characteristic K. Zeitschrift für Angewandte Mathematik und Physik.
Hijjawi, M., & Elsheikh, Y. (2015). Arabic Language Challenges in Text Based Conversational Agents Compared to the English Language. International Journal of Computer Science & Information Technology, 7, 1-13.
Honore, A. (1979). Some Simple Measures of Richness of Vocabulary. Association for Literary and Linguistic Computing Bulletin, 7, 172-177.
Howedi, F., & Mohd, M. (2014). Text Classification for Authorship Attribution Using Naïve Bayes Classifier with Limited Training Data. Computer engineering and intelligent systems, 5, 48-56.
Howedi, F., Mohd, M., Aborawi, Z., & Jowan, S. (2020). Authorship Attribution of Short Historical Arabic Texts using Stylometric Features and a KNN Classifier with Limited Training Data. Journal of Computer Science, 16, 1334-1345.
Huang, Y., & Li, L. (2011). Naive Bayes classification algorithm based on small sample set. International Conference on Cloud Computing and Intelligence Systems (CCIS), 34–39.
Jain, A., & Mandowara, J. (2016). Text Classification by Combining Text Classifiers to Improve the Efficiency of Classification. International Journal of Computer Application, 6.
Jambi, K., Hussain Khan, I., & Alhaj, M. (2021). Towards Authorship Attribution in Arabic Short-Microblog Text. 9, 128507-128520.
Johnson, W. (1946). People in quandaries; the semantics of personal adjustment. New York, Harper.
Khan, A, Baharudin, L., & Khan. (2010). A review of machine learning algorithms for text-documents classification. Journal of advances in information technology, 1, 4-20.
Keselj, V., Fuchun Peng, C., & Thomas, C. (2003). Pacific Association for Computational Linguistics, 255-264.
Khalil, H., Taha, A., & El-Shishtawy, T. (2021). Modeling of Arabic Language for Authorship Identification. International Journal of Scientific & Technology Research, 10, 157-162.
Kim, S., Soo Han, K., Rim, H., & Hyon Myaeng, S. (2006). Some effective techniques for naive Bayes text classification. IEEE Trans. Knowledge Data Eng, 18, 1457–1466.
Kusakci, A. (2012). Authorship attribution using committee machines with k-nearest neighbors rated voting. 11th symposium on neural network applications in electrical engineering, 161-166.
Maas, H. (1972). Zusammenhang zwischen wortschatzumfang und länge eines textes [Relationship between vocabulary size and length of a text]. Zeitschrift für Literaturwissenschaft und Linguistik, 8, 73–79.
Malvern, D.D., & Richards, B.J. (1997). A new measure of lexical diversity. Evolving Models of Language, 58-71.
Manning, Ch., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. 1st ed. Cambridge University Press.
Maslouh, S. (1992). Style, Statistical linguistic study. Cairo: The world of the books. [In Arabic].
Mitchell, T. (1997). Machine Learning. 1st ed. McGraw-Hill.
Motaghizadeh, I., & Modiri, S. (2022). The development of statistical stylistics during the last century and its use in the field of literature and technology. The 6th Shahriar International Conference, 389-408. [In Arabic].
Nirkhi, S., & Dharaskar, R.V. (2013). Comparative Study of Authorship Identification Techniques for Cyber Forensics Analysis. International Journal of Advanced Computer Science and Applications, 4, 32-35.
Oliver, W., Justino, E., & Olivera. (2013). Comparing compression models for authorship attribution. Forensic Science International, 100-104.
Orlov, Y. K. (1983). Ein modell der häufigkeitsstruktur des vokabulars [A model of the frequency structure of vocabulary]. Studies on Zipf’s Law. Bochum, 154–233.
Otoom, A., Abdullah, E., Jaafer, Sh., Hamdallh, A., & Amer. (2014). Towards Author Identification of Arabic Text Articles. 5th International Conference on Information and Communication Systems (ICICS), 1-4.
Perelman, C. (1971). The new Rhetoric. Holland: Reidel publishing company.
Rabab’ah, A., Al-Ayyoub, M., Jararweh, Y., & Aldwairi, M. (2016). Authorship Attribution of Arabic Tweets. 13th International Conference og Computer Systems and Applications. Agadir, Morocco, 1-6.
Russell, S., & Norvig, P. (2010). Artificial Intelligence, A Modern Approach. Third Edition. New Jersey: Pearson Education.
Saad, M. (2010). The impact of text preprocessing and term weighting on arabic text classification. Gaza: Computer Engineering, the Islamic University.
Sayoud, H., & Ouamour, S. (2017). Score Fusion Based Authorship Attribution of Ancient Arabic Texts. Thirtieth International Florida Artificial Intelligence Research Society Conference, 607-612.
Simpson, E.H. (1949). Measurement of Diversity. Nature, 163- 688.
Snider, N., & Diab, M. (2006). Unsupervised Induction of Modern Standard Arabic Verb Classes Using Syntactic Frames and LSA. Main Conference Poster Sessions, 795-802.
Sodhar, I., & Buller, A.H. (2020). Natural Language Processing: Applications, Techniques and Challenges. Advances in computer science, 7, 1-25.
Stamatatos, E., Nikos, F., & Kokkinakis, G. (2000). Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics, 26, 471-495.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60, 538-556.
Suleman, R., & Korkontzelos, I. (2021). Extending latent semantic analysis to manage its syntactic blindness. Expert Systems with Applications, 165, 1-9.
Templin, M. (1957). Certain Language Skills in Children: Their Development and Inter-Relationships. Minneapolis: University of Minnesota Press.
Tuldava, J. (1977). Quantitative Relations between the Size of the Text and the Size of Vocabulary. SMIL Quarterly, Journal of Linguistic Calculus, 4.
Vapnik, V. (1998). Statistical Learning Theory. Wiley: New York.
Winston, P. (1993). Artificial Intelligence. Third Edition. United States of America: Addison-Wesley publishing.
Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR’99, 42–49.
Yule, G.U. (1944). The Statistical Study of Literary Vocabulary. Cabbridge at the University Press, University Printing House, United Kingdom.
Yülüce, İ., & Dalkılıҫ, F. (2022). Author Identification with Machine Learning Algorithms. International Journal of Multidisciplinary Studies and Innovative Technologies, 6, 45-50.
Zahedi, M.H., & Kahani, M. (2013). SREC: discourse-level semantic relation extraction from text. Neural Comput, 23, 1573–1582.
Zhang, Ch., & Abdul-Mageed, M. (2019). BERT Based Arabic Social Media Author Profiling. arXiv, 3, 1-8.
Zhao, Y. (2007). Effective authorship attribution in Large Document Collections. PhD Thesis, School of Computer Science and Information Technology. RMIT University. Melbourne. Victoria. Australia.
Zhao, Y., & Zobel, J. (2005). Effective and Scalable Authorship Attribution Using Function Words. Information Retrieval Technology, 174-189.

Articles in Press, Accepted Manuscript
Available Online from 16 September 2025