Current Implementation and Future Prospects of Santi-Morf V.1.0

Prihantoro Prihantoro

Abstract


SANTI-Morf (Prihantoro, 2021) is a new morphological analyser for Indonesian. In SANTI-Morf annotation scheme (Prihantoro, 2019), morpheme tokens are linked to their annotations. The tokens are presented in their orthographic and citation forms to allow (allo)morph or morpheme-based searches. Users can also perform retrievals on the basis of formal and functional morphological criteria as SANTI-Morf tagset encodes the analyses of morphemes’ forms (e.g. roots, clitics, affix type) and functions (e.g. passive voice, active voice, adjective degrees, etc.). Currently, the scheme is implemented in Nooj (Silberztein, 2003), a linguistic development environment. It enables users to index and annotate Indonesian texts in their local PC, and later perform searches based on morphological criteria and or tokens defined by the SANTI-Morf scheme.

 

Abstrak

SANTI-Morf (Prihantoro, 2021) adalah sebuah program analisis morfologi terbaru untuk bahasa Indonesia. Dalam skema anotasi SANTI-morf (Prihantoro, A new tagset for morphological analysis of Indonesian, 2019), setiap token morfem terhubung dengan anotasinya. Token-token ini direpresentasikan dalam bentuk ortografis dan bentuk sitasi sehingga memungkinkan pengguna untuk melakukan penelusuran berbasis (alo)morf atau morfem. Selain itu, pengguna juga bisa melakukan penelusuran berbasiskan bentuk atau fungsi morfem. Ini karena tagset analitik yang digunakan di SANTI-morf mencakup bentuk (di antaranya: akar, klitik, jenis afiksasi) dan fungsi (di antaranya: aktif, pasif, derajat ajektiva). Saat ini, SANTI-morf diimplementasikan menggunakan NooJ (Silberztein, 2003), sebuah program pengembangan aplikasi linguistik. Pengguna dapat mengindeks dan menganotasi teks berbahasa Indonesia di komputer mereka, dan selanjutnya melakukan penelusuran menggunakan kriteria morfologi dan skema tokenisasi yang digunakan di skema anotasi SANTI-morf.


Keywords


annotation; retrieval; morphology; scheme; SANTI-Morf; Nooj

Full Text:

PDF

References


Adriani, M., & Hamam, R. (2009). Research Report Phase 3.2: Final Report on Statistical Machine Translation for Bahasa Indonesia - English and English to Bahasa Indonesia. Jakarta: BPPT.

Alwi, H., Dardjowidjojo, S., Lapoliwa, H., & Moeliono, M. (1998). Tata Bahasa Baku Bahasa Indonesia (3rd Edition). Jakarta: Balai Pustaka.

Anthony, L. (2006). Concordancing with AntConc: An introduction to tools and techniques in corpus linguistics. JACET Newsletter, 155-185.

Beesley, K. R., & Karttunen, L. (2003). Finite State Morphology. Stanford: CSLI.

Brezina, V., Timperley, M., & McEnery, T. (2018). #LancsBox v. 4.x [software]. Available at: http://corpora.lancs.ac.uk/lancsbox.

Denistia, K., & Bayeen, H. (2019). The Indonesian prefixes PE-and PEN-: A study in Productivity and Allomorphy. Morphology 29(3), 385-407.

https://doi.org/10.1007/s11525-019-09340-7

Gallop, A. T. (2013). The language of Malay manuscript art: a tribute to Ian Proudfoot and the Malay Concordance Project. International Journal of the Malay World and Civilisation 1(3), 11-27.

Garside, R. (1987). The CLAWS Word-tagging System. In R. Garside, G. Leech, & G. Sampson (eds.), The Computational Analysis of English: A Corpus-based Approach (pp. 31-41). London: Longman.

Gerstenberger, C., Partanen, N., & Rießler, M. (2017). Instant annotations in ELAN corpora of spoken and written Komi, an endangered language of the Barents Sea region. Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages, (pp. 57-66).

https://doi.org/10.18653/v1/W17-0109

Goldhahn, D., Eckart, T., & Quasthoff, U. (2012). Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In N. Calzolari, K. Choukri, T. Declerck, M. Doğan, B. Maegaard, J. Mariani, & S. Piperidis (Eds.), Proceedings of LREC Vol. 29 (pp. 31-43). Istambul: European Language Resources Association (ELRA).

Hardie, A. (2012). CQPweb-combining power, flexibility and usability in a corpus analysis tool. International journal of corpus linguistics, 17(3), 380-409.

https://doi.org/10.1075/ijcl.17.3.04har

Hu, C., & Tan, J. (2017). Using UAM Corpus tool to Explore the Language of Evaluation in Interview Program. English Language Teaching, 10(7), 8-20.

https://doi.org/10.5539/elt.v10n7p8

Hulden, M. (2009). Foma: a Finite-State Compiler and Library. In A. Lascarides (Ed.), Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL '09) (pp. 29-32). Stroudsburg, PA: EACL.

https://doi.org/10.3115/1609049.1609057

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., & Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography (1) 1, 7-36.

https://doi.org/10.1007/s40607-014-0009-9

Larasati, S.-D., Kuboň, V., & Zeman, D. (2011). Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus. Systems and Frameworks for Computational Morphology (pp. 119-129). Zurich: Springer.

https://doi.org/10.1007/978-3-642-23138-4_8

Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319-344.

https://doi.org/10.1075/ijcl.22.3.02lov

Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a Large Annotated Corpus of English: The Penn Treebank. Computational linguistics 19(2), 313-330.

https://doi.org/10.21236/ADA273556

McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. Milton Park: Taylor & Francis.

Nomoto, H., Akasegawa, S., & Shiohara, A. (2018). Building an open online concordancer for Malay/Indonesian. Paper presented at the 22nd International Symposium on Malay/Indonesian Linguistics (ISMIL). University of California, Los Angeles.

Pisceldo, F., Mahendra, R., Manurung, R., & Arka, I. W. (2008). A Two Level Morphological Analyser for the Indonesian Language. In N. Stokes, & D. Powers (Eds.), Proceedings of Australasia Technology Association Workshop (pp. 142-150). Hobart: ACL.

Prentice, S., Taylor, P. J., Rayson, P., Hoskins, A., & O'Loughlin, B. (2011). Analyzing the semantic content and persuasive composition of extremist media: A case study of texts produced during the Gaza conflict. . Information Systems Frontiers, 13(1), 61-73.

https://doi.org/10.1007/s10796-010-9272-y

Prihantoro. (2019). A new tagset for morphological analysis of Indonesian. International corpus linguistics conference. Cardiff.

Prihantoro. (2021). A new morphological annotation system for Indonesian (PhD Thesis). Lancaster: Lancaster University Press.

Prihantoro. (2021). An Evaluation of the Morphological Annotation Scheme for Indonesian Used in MorphInd Program. Corpora, in Press . Corpora 16 (3), in press.

https://doi.org/10.3366/cor.2021.0221

Scott, M. (1996). WordSmith manual. Gloucestershire: Lexical Analysis Software ltd.

Silberztein, M. (2003). NooJ Manual. Available for download at: www.nooj4nlp.net.

Sneddon, J. N., Adelaar, A., Djenar, D.-N., & Ewing, M.-C. (2010). Indonesian Reference Grammar:2nd Edition. New South Wales: Allen & Unwin.

Ting, K. M., & Geoffrey, W. (2011). Precision and Recall. In S. C, Encyclopedia of Machine Learning (p. 781). Boston: Springer.




DOI: https://doi.org/10.26499/rnh.v10i2.4189

Refbacks

  • There are currently no refbacks.