Function Words as Idiolect Markers: A Corpus-based Approach to Authorship Attribution in Farsi

Author
Assistant Professor of Linguistics- Al-Zahra University –Tehran –Iran.
Abstract
The problem of discovering the identity of anonymous authors has engaged humans' attention during the ages. In present times, with the revolution brought about by digital computing and electronic corpora, and also with the applications made available by stylometry research in forensic linguistics, systematic analysis of texts in different languages has expanded the understanding of researchers on the different aspects of linguistic styles.

In the present study, the possibility of authorship attribution based on idiolect has been investigated in Farsi. One of the linguistic elements that is claimed to be the seat for idiolect is function words. Function words have been the focus of attention in the authorship attribution research since it has been shown that they are processed unconsciously, have high frequency in texts, and remain independent of text topic. In this paper, the possibility of differentiating texts written by different authors has been studied using Farsi function words. The research questions were: 1) Are Farsi functions words capable of differentiating authors in Farsi prose? 2) Of monograms, bigrams, and trigrams, which one is the most efficient in differentiating author styles? 3) What is the minimum cut-off point for successful differentiation of author styles in Farsi?

First, a corpus of five Iranian scholars’ writings was compiled, normalized and divided into different sample texts. Then 20 most frequent words were extracted from different author samples and n-gram sequences (up to tri-grams) were analyzed using principal component analysis and cluster analysis in the Stylo package of R.

Findings showed that function words in Farsi were capable of differentiating authors’ writings with monogram words performing better than bi-gram and tri-grams in small size samples. Findings also indicated that under the experimental conditions used in this study, the minimum number of words for a text to be successfully attributed to an author is about 4000 words. This cut-off point is reached using 20 most frequent function words. It is concluded that different authors don't use function words in the same manner. In fact, while some high-frequency function words appear in the writings of all authors, they are given different priorities by different authors.

Keywords

Subjects


• Argamon, S., & Levitan, S. (2005). “Measuring the usefulness of function words for authorship attribution”. In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing.
• Binongo, J. (2003). “Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution”. Chance, 16. Pp. 9-17.
• Barlow, M. (2010). “Individual usage: a corpus-based study of idiolects”. In LAUD Symposium. Landau, Germany.
• Barthes, R. (1977). Elements of Semiology. Hill and Wang: New York.
• Burrows, J. F. (1987). “Word patterns and story shapes: The statistical analysis of narrative style”. Literary and Linguistic Computing. 2. Pp. 61–70.
• Burrows, J. F. (2002). “Delta: A measure of stylistic difference and a guide to likely authorship”. Literary and Linguistic Computing. 17. Pp. 267–287.
• Carroll, D. (2008). Psychology of Language (5rd ed.). Wadsworth.
• Eder, M. (2013). “Does size matter? Authorship attribution, small samples, big problem”. Literary and Linguistic Computing. DOI: http://dx.doi.org/10.1093 /llc/fqt066.
• Eder, M.; M. Kestemont & J. Rybicki, (2013), “Stylometry with R: a suite of tools”. In Digital Humanities 2013: Conference Abstracts. University of Nebraska-Lincoln, NE. Pp. 487-89.
• Faili, H.; N.; Ehsan; M. Montazery & M. M. Pilehvar, (2016), “Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language”. Digital Scholarship in Humanities. 31 (1). Pp. 95-117.
• Farahmandpour, Z. & H. Nikmehr, (2015), “A study on intelligent authorship methods in Persian language”. Journal of Computing and Security, 2(1).Pp. 63-76.
• Frontini, F.; G. Lynch & C. Vogel, (2008), “Revisiting the ‘Donation of Constantine’”. In Proceedings of AISB 2008. Pp. 1–9.
• Gamon, M. (2004). “Linguistic correlates of style: Authorship classification with deep linguistic analysis features”. In Proceedings of the 20th International Conference on Computational Linguistics. Pp. 611-617.
• Hedegaard, S. & J. G. Simonsen, (2011), “Lost in translation: Authorship attribution using frame semantics”. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. June 19-24. Portland. Oregon. Pp. 65-70.
• Holmes, D. I.; L.J. Gordon & C. Wilson, (2001), “A widow and her soldier: stylometry and the American civil war”. Literary and Linguistic Computing. 16(4). Pp. 403-420.
• Hubert, L. and Arabie, P. (1985). “Comparing partitions”. Journal of Classification. 2(1).Pp. 193-218.
• Jakobson, R. (1971). Studies on Child Language and Aphasia. The Hague: Mouton.
• Johansson, V. (2008). “Lexical diversity and lexical density in speech and writing: A developmental perspective”. Lund University, Department of Linguistics and Phonetics: Working Papers. 53.Pp. 61–79.
• Johnson, A. & D. Wright, (2014), “Identifying idiolect in forensic authorship attribution”. Language and Law/Linguagem e Direito. Vol. 1(1). Pp. 37-69.
• Kestemont, M. (2014). “Function words in authorship attribution: from black magic to theory?” In Proceedings of the 3rd Workshop on Computational Linguistics for Literature, April 27, Gothenburg, Sweden. Pp. 59-66.
• Koppel, M. & J. Schler, (2003), “Exploiting stylistic idiosyncrasies for authorship attribution”. In Proceedings of IJCAI'03 Workshop on Computational Approaches to Style Analysis and Synthesis. Pp. 69-72.
• Modaber Dabagh, R. (2007). “Authorship attribution and statistical text analysis”. Metodološki zvezki. 4(2). Pp. 149-163.
• Mosteller, F. & D.L. Wallace, (1964), Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley.
• Rand, W. M. (1971). “Objective criteria for the evaluation of clustering methods”. Journal of the American Statistical Association. 66. Pp. 846-850.
• R Core Team, (2015), “R: A language and environment for statistical computing". R Foundation for Statistical Computing, Vienna, Austria. URL https://R-project.org/.
• Segarra, S.; M. Eisen & A.Ribeiro, (2015), “Authorship attribution through function word adjacency networks”. In IEEE Transactions on Signal Processing. Vol. 63, No. 20. Oct. 15.Pp. 5464-5478.
• Stamatatos, E. (2009). “A survey of modern authorship attribution methods”. Journal of the American Society for Information Science and Technology. 60(3). Pp. 538-556.
• Stein, B. & S. Meyer zu Eissen, (2007), “Intrinsic plagiarism analysis with meta- learning”. In B. Stein, M. Koppel, & E. Stamatatos (Eds.), SIGIR workshop on plagiarism analysis, authorship identification, and near-duplicate detection (PAN 07) (Pp. 45–50). CEUR-WS.org.
• Totty, R. N. & J. P Hardcastle, (1987), “Forensic linguistics: the determination of authorship from habits of style”. Journal of the Forensic Science Society. 27. Pp. 13-28.
• Wardhaugh, R. & J. M. Fuller, (2015), An Introduction to Sociolinguistics (7th ed.). Wiley-Blackwell.
• Whitelaw, C. & S. Argamon, (2004), “Systemic functional features in stylistic text classification”. In Proceedings of AAAI Fall Symposium on Style and Meaning in Language, Art, and Music.
• Whitelaw, C. & J. Patrick, (2004), “Selecting systemic features for text classification”. In Proceedings of Australian Language Technology Workshop, Sydney, Australia. Pp. 93-100.
• Zhao, Y. & J. Zobel (2005). “Effective and scalable authorship attribution using function words”. In Information Retrieval Technology (Pp. 174–189). Springer.