References

Note

Please modify this document if anything is erroneous or not included. Last updated at January 18, 2015.

Korean morpheme analyzer tools

When you’re analyzing Korean text, the most basic task you need to perform is morphological analysis. There are several libraries in various programming languages to achieve this:

C/C++

  • KTS (1995) GPL v2
    • By 이상호, 서정연, 오영환 (KAIST & 서강대)
    • code
  • MACH (2002) custom
    • By Prof. Kwangseob Shim (성신여대)
  • MeCab-ko (2013) GPL LGPL BSD
    • By Yong-woon Lee and Youngho Yoo

Java

  • Arirang (2009) Apache v2
    • By SooMyung Lee
    • code
  • Hannanum (1999) GPL v3
    • By Prof. Key-Sun Choi Key’s research team (KAIST)
    • code, docs
  • KKMA (2010) GPL v2
    • By Prof. Sang-goo Lee’s research team (서울대)
    • Generates morpheme candidates using dynamic programming
    • Tags morphemes by checking neighbors, and employing some heuristics and HMM models
    • Developer blog: Dongjoo Lee
  • KOMORAN (2013) Apache v2
    • By shineware

Python

  • KoNLPy (2014) GPL v3
    • By Lucy Park (서울대)
  • UMorpheme (2014) MIT
    • By Kyunghoon Kim (UNIST)

R

  • KoNLP (2011) GPL v3
    • By Heewon Jeon

Others

Other NLP tools

Language parser

  • KoreanParser - By DongHyun Choi, Jungyeul Park, Key-Sun Choi (KAIST)

Corpora

  • Yonsei Corpus, 연세대, 1987.
    • 42M tokens of Korean since the 1960s
  • Korea University Korean Corpus, 1995.
    • 10M tokens of Korean of 1970-90s
  • HANTEC 2.0, KISTI & 충남대, 1998-2003.
    • 120,000 test documents (237MB)
    • 50 TREC-type questions for QA (48KB)
  • HKIB-40075, KISTI & 한국일보, 2002.
    • 40,075 test documents for text categorization (88MB)
  • KAIST Corpus, KAIST, 1997-2005.

  • Sejong Corpus, National Institute of the Korean Language, 1998-2007.

comments powered by Disqus
Fork me on GitHub

Table Of Contents

Related Topics