An Overview of Text Mining in Language Studies: The Computational Approach to Text Analytics

Document Type : مقالات مروری تحلیلی

Authors
1 PhD in TEFL, Hakim Sabzevari University, Sabzevar, Iran
2 Associate Professor of TEFL, Hakim Sabzevari University, Sabzevar,Iran
3 Associate Professor of TEFL, Hakim Sabzevari University, Sabzevar, Iran
Abstract
Text mining’ refers to the computational process of unstructured text analytics for extracting latent linguistic layers and themes. It is especially significant as content or thematic analysis in descriptive and interpretive studies. This process begins with structuring simple texts and proceeds with summarizing, classifiing, modelling, evaluating and interpreting the inherent textual concepts and patterns. Given that this method counts as an interdisciplinary innovation especially in discoursal studies, it is to be pursued more intensively in academic studies. Despite the multitude of English studies in this area, there has been little interest to date in text mining amongst Iranian researchers as evidenced by the critically limited number of local Persian and English studies. Thus looking into the theory and practice of text mining and its major analytic tools and methods in Persian and English, this paper aims to prepare the ground for utilizing this methodology in language studies.






The last two decades faced a major increase in the rate and accuracy of knowledge generation in language studies due to advances in interdisciplinary studies of applied linguistics and computer sciences. At the heart of methodological innovations especially in discourse studies lies ‘text mining’ whose merits have only recently been appreciated by researchers. ‘Text mining’, ‘text data mining’ or ‘Text Analysis’ is the use of different data mining algorithms and methods like natural language processing and linguistic as well as statistical techniques to derive linguistic features, significant patterns and valuable themes from the unstructured texts through collecting unstructured data, pre-processing and cleansing them to detect and remove anomalies and processing and controlling operations (Zhou et al, 2012). These processes are further broken down into feature extraction, structural analysis, text summary, text classification, text clustering, and association analysis. Text mining is actually a complicated procedure of extracting valuable, significant patterns and trends from a large number of textual data used for such functions as product suggestion analysis, social media opinion mining, and sentiment or trend analysis (He, 2013).

Dating back to Feldman and Dagan (1995), text mining is an innovative methodology with a relatively short history which is often integrated with corpus analysis to computationally analyze a large body of unstructured texts as potential inormatieofinsight. As a subfield of data mining in computer sciences and an interdisciplinary method, text mining borrows from corpus and computational linguistics, whose main purpose is to extract the meta-characters representing textual features (Pons-Porrata et al, 2007). Zhou et al (2017) believe that despite its short history, text mining has been remarkably evolved into the mainstream research methodology in many interdisciplinary areas in the wake of increasingly rapid developments in data mining.

Hashimi et al (2015) explained the steps involved in text mining as a semi-automated process of collecting, structuring and then analyzing textual data as follows: (a) collecting unstructured data from a variety of sources like textual documents, social media, web pages, mails, blogs, etc. using specialized corpora for organization, (b) pre-processing and cleansing the data for removing the anomalies to unveil latent valuable information using text mining tools, (c) unstructured data conversion into relevant structured formats, (d) discovering the underlying data patterns using word structures, sequences and frequency, and (e) extracting useful knowledge and storing them in a secure database for evaluation, later retrieval, trend analysis and possible decision-making. Text mining aslso makes use of lexicometrics dealing with frequency and co-occurrence analysis of vocabulary to derive structures from texts; sentiment analysis is an application of lexicometrics looking for positive or negative emotions in documents and has been used in social media analysis for evaluating public opinion (Shangzhen & Lemen, 2016).

Text mining is an area of inquiry that in itself deserves to be pursued more intensively in future studies and this paper, thus, is an attempt to review its basic principles, procedures and top analytic tools and to raise researchers’ awareness of the virtues of text mining.

Keywords

Subjects


8. منابع
• امامی, احمد ورضا قائمی. (۱۳۹۴). «دسته بندی موضوعی متون فارسی با استفاده از تکنیک های یادگیری انتقالی». کنفرانس بین المللی پژوهش های کاربردی در فناوری اطلاعات، کامپیوتر و مخابرات، تربت حیدریه، شرکت مخابرات خراسان رضوی. https://www.civilica.com/Paper-ITCC01-ITCC01_445.html.
• بنائی، مجتبی (1394). «مقدمه ای بر پردازش متون فارسی با پایتون». http://www.bigdata.ir/1394/08/.
• بهرام‌پور، اکبر و همکاران(1394). «چالش های استفاده از پردازش زبان طبیعی (NLP) در زبان فارسی». دومین کنفرانس ملی توسعه علوم مهندسی.
• . مجله نقد کتاب علوم اجتماعی. ش 5.6. صص 103-113.
• تیمورپور، بابک و همکاران (1388).«روشی نوین برای دسته بندی هوشمند متون علمی- مطالعه موردی». مقالات فناوری نانو متخصصان ایران. ش 6. صص: 1-14.
• خاصه، علی اکبر، و نورالله کرمی (1387). «وجین در محیط های الکترونیک». فصلنامه مطالعات ملی کتابداری و سازماندهی اطلاعات. ش 74 .صص 229-238.
• مزینانی، ابوالفضل و همکاران (1397). «تحلیل گفتمان موضوعی: تلفیقی از رویکرد گفتمانی ـ تاریخی و تفسیر موضوعی قرآن کریم». جستارهای زبانی. د 9.ش 3. صص: ۱-۳۳
• هاشمی, سیدمحسن و همکاران (1393). «استفاده از تکنیک های متن کاوی برای دسته بندی متون فارسی با مجموعه داده همشهری». کنفرانس بین المللی مهندسی، هنر و محیط زیست، کشور لهستان. https://www.civilica.com/Paper-CEAE01-CEAE01_091.html
References
• Achtert, E., Böhm, C., Kriegel, H. P., Kröger, P., Gorman, M., & Zimek, I. (2006). “Finding hierarchies of subspace clusters”. LNCS: Knowledge discovery in databases: PKDD. Lecture Notes in Computer Science (Vol. 4213, pp. 446–453).
• Agrawal, R., & Batra, M. (2013). “A Detailed Study on Text Mining Techniques”. International Journal of Soft Computing and Engineering (IJSCE). ISSN: 2231-2307, Vol. 2, Issue-6.
• Alfiani, P.A., & Wulandari, A. F. (2015). “Mapping student’s performance based on data mining approach (a case study)”. Ital. Oral Surg. 3, 173–177.
• Arslan, A.A. (2011). Türkçe Metinlerden Anlamsal Bilgi Çikarimi Için Bir Veri Madenciligi Uygulamasi. Baskent Üniversitesi Fen Bilimleri Enstitüsü, Yüksek Lisans Tezi.
• Aydin, C.R., Erkan, A., Güngör, T., & Takçi, H. (2013). “Sözlük Tabanli Kavram Madenciligi”. Türkçeiçin bir Uygulama, 30. Ulusal Bilisim Kurultayi, November 2013, Ankara.
• Baker, R.S., & Inventado, P.S. (2014). “Educational data mining and learning analytics”. Learning Analytics: From Research to Practice, pp. 61–75. Springer, New York.
• Bhushan, J., Pushkar, W., Shivaji, K., & Nikhil, K. (2014). “Searching research papers using clustering and text mining”. International Journal of Emerging Technology and Advanced Engineering, 4(4).
• Bilgin, T.T., & ve Çamurcu, A.Y. (2008). Çok Boyutlu Veri Görsellestirme Teknikleri, Akademik.
• Blei, D., Ng, A., & Jordan, M. (2003). “Latent dirichlet allocation”. Journal of Machine Learning Research, 3:993–1022.
• Burley, D. (2010). “Information visualization as a knowledge integration tool”.
International Journal of Knowledge Management Practice, 11(4).
• Cobo, M.J., López-Herrera, A.G., Herrera-Viedma, E. & Herrera, F. (2011). “Science Mapping Software Tools: Review, Analysis, and Cooperative Study among Tools”. Journal of the American Society for Information Science and Technology, 62(7). 1382–1402.
• Cohen, W.W. (1999). “What can we learn from the web?” In proceedings of the Sixteenth International Conference on Machine Learning (ICML’99), 515-521.
• Cohen, K.B. & Hunter, L. (2008). “Getting Started in Text Mining”. PLoS Comput Biol. 4(1): e20.
• Çalis, K., Gazdagi, O., & Yildiz, O. (2013), “Reklam Içerikli Epostalarin Metin Madenciligi Yöntemleriile Otomatik Tespiti”. Bilisim Teknolojileri Dergisi, 6(1), 1-7.
• Cortez, P., & Silva, A. (2008). “Using data mining to predict secondary school student performance”. 5th Annual Future Business Technology Conference, vol. 2003, No. 2000, pp. pp.5–12.
• De Bellis, N. (2009). Bibliometrics and citation analysis: from the Science citation index to cybermetrics. Scarecrow Press.
• Dolgun, M.Ö., Özdemir, T.G., & Oguz, D. (2009). “Veri madenciliginde yapisal olmayan verininanalizi: Metin ve web madenciligi”. Istatistikçiler Dergisi, 2, 48-58.
• ERGÜN, M. (2016). “Using the Techniques of Data Mining and Text Mining in Educational Research”. Participatory Educational Research (PER) Special Issue 2016-III, pp., 140-151. http://www.partedres.com ISSN: 2148-6123.
• Feldman, R., & Dagan, I. (1995). “KDT knowledge discovery in texts”. In Proceedings of the First International Conference on Knowledge Discovery (KDD95).
• Feng, Q. (2008). “A Comprehensive Review of Data Excavation Tools in China” [J]. (10): 11-13.
• Gemert, J.V. (2000). “Text Mining Tools on the Internet: an overview”.ISIS technical report series, Vol. 25. Department of Computer Science, University of Amsterdam,The Netherlands. http://www.science.uva.nl/research/isis.
• Gharehchopogh, F. S., & Abbasi Khalifehlou, Z. (2011). “Study on information
extraction methods from text mining and natural language processing perspectives”. AWER Procedia information technology & computer science, 2nd world conference on information technology.
• Gupta, V. (2009). “A survey of text mining techniques and applications”. Journal of
Emerging Technologies in Web Intelligence, 1(1).
• Gupta, V. & Lehal, S. (2009). “a survey of text mining technique and applications”, journal of emerging technologies in web intelligence, vol.1, No. 1. doi: 10.4304/jetwi.1.1.60-76
• Hashimi, H., Hafez A., & Mathkour, H. (2015). “Selection criteria for text mining approaches.” Computers in Human Behavior 51 (B): 729–733.
• He, Q. (1999). “Knowledge Discovery through Co-Word Analysis”. Library Trends, 48(1). 133-159.
• Hearst, M. (2009). “What is text mining?” http://www.sims.berkeley.edu/~hearst/.
• Hearst, M.A. & Rosner, D., (2008). “Tag Clouds: Data Analysis Tool or Social Signaller?”. Hawaii International Conference on System Sciences, Proceedings of the 41st Annual, 7-10 Ocak, 160-160. DOI: 10.1109/HICSS.2008.422.
• Hsieh, H. F., & Shannon, S.E. (2005). “Three Approaches to Qualitative Content Analysis”. Qualitative Health Research, 15(9), 1277-1288. doi: 10.1177/1049732305276687.
• Kaur, A. & Chopra, D. (2016). “Comparison of text mining tools.” 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO). pp. 186-192. doi:10.1109/icrito.2016.7784950.
• Kriegel, H. P., Kröger, P., & Zimek, A. (2009). “Clustering high-dimensional data: A
survey on subspace clustering, pattern-based clustering, and correlation
clustering”. Transactions on Knowledge Discovery from Data (New York, NY:
ACM), 3(1), 1–58.
• Kumar, S.A., & Vijayalakshmi, M. (2011). “A novel approach in data mining techniques for educational data”. In: 3rd International Conference on Machine Learning Computing (ICMLC 2011) A, no. Icmlc, pp. 152–154.
• Lee, S.J., Liu, Y., & Popović, Z. (2014). “Learning individual behavior in an educational game: a data-driven approach”. In: Proceedings of 7th International Conference on Educational Data Mining, pp. 114–121.
• Leong, C. K., ee, H., & Ak, K. (2012). “Mining Sentiments in SMS Texts for Teaching Evaluation” [J]. Expert Systems with Applications, 39(3): 2584~2589.
• Liptha, L. R., Raja, K., & Tholkappia A. G. (2010). “Text clustering using
concept-based mining model”. International Journal of Electronics and Computer
Science Engineering (ISSN: 2277-1956).
• Marquez, B. & Moya, L. (2011). “An Automatic Text Comprehension Classifier Based on Mental Models and Latent Semantic Features [A]”. Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies [C]. New York: ACM Press, 2011:158-162.
• Patel, R., & Sharma, G. (2014). “A survey on text mining techniques”. InternationalJournal of Engineering and Computer Science, 3(5).
• Pena-Ayala, A. (Ed.) (2014). Educational Data Mining: Applications and Trends (Vol. 524). Springer.
• Pons-Porrata. A., Berlanga-Llavori, R., & Ruiz-Shulcloper, J. (2007). “Topic discovery based on text mining techniques” [J]. Information Processing & Manmanagement, 43(3): 752-768.
• Shangzhen, L., & Lemen, C. (2016). “Research on the Application of Text Mining in Chinese Information Analysis Comment” [J]. Information Science, (08): 153-159.
• Shi, C., Verhagen, M. & Pustejovsky, J. (2014). “A Conceptual Framework of Online Natural Language Processing Pipeline Application”. Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT, Dublin, Ireland. 53–59.
• Sin, K., & Muthu, L. (2015). “Application of big data in education data mining and learning analytics—a literature review”. ICTACT J. Soft Comput. : Special Issue Soft Comput. Models Big Data 5(4), 1035–1049.
• Swamy, M., & Hanumanthappa, M. (2012). “Predicting academic success from student enrolment data using decision tree technique”. Int. J. Appl. Inf. Syst. 4(3), 1–6.
• Vashishta, S., & Jain, Y. K. (2011). “Efficient retrieval of text for biomedical domain
using data mining algorithm”. International Journal of Advanced Computer Science
and Applications, 2(4).
• Wang, L., Wang, G. & Alexander, C.A. (2015). “Big Data and Visualization: Methods, Challenges and Technology Progress”. Digital Technologies, 1(1), 33-38.
• Weiss S.M., Indurkhya N., & Zhang T. (2010). “Information Retrieval and Text Mining”. Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer, London.
• Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition. San Francisco: Morgan Kaufmann.
• Wu, S. T., Li, Y., & Xu, Y. (2006). “Deploying approaches for pattern refinement in text mining”. In Proceedings of the sixth international conference on data mining.
• Xu, Y., & Zhao, R. (2017). “The Literature Review of Text Data Mining”. Science Discovery. Vol. 5, No. 6, 2017, pp. 438-443. doi: 10.11648/j.sd.20170506.18
• Yassine, M., & Hajj, H. (2010). “A framework for emotion mining from text in online
social networks”. In IEEE international conference on data mining workshops
(pp. 1136–1142). Sydney, NSW: IEEE publications.
• Zhou, Y., Zhang, Y., Vonortas, N., & Williams, J. (2012). A text mining model for
strategic alliance discovery. 45th Hawaii International Conference on System Sciences.