References

Note

Please modify this document if anything is erroneous or not included. Last updated at March 10, 2015.

Korean morpheme analyzer tools

When you’re analyzing Korean text, the most basic task you need to perform is morphological analysis. There are several libraries in various programming languages to achieve this:

C/C++

  • KTS (1995) GPL v2
    • By 이상호, 서정연, 오영환 (KAIST & 서강대)
    • code
  • MACH (2002) custom
    • By Prof. Kwangseob Shim (성신여대)
  • MeCab-ko (2013) GPL LGPL BSD
    • By Yong-woon Lee and Youngho Yoo

Java

  • Arirang (2009) Apache v2
    • By SooMyung Lee
    • code
  • Hannanum (1999) GPL v3
    • By Prof. Key-Sun Choi Key’s research team (KAIST)
    • code, docs
  • KKMA (2010) GPL v2
    • By Prof. Sang-goo Lee’s research team (서울대)
    • Generates morpheme candidates using dynamic programming
    • Tags morphemes by checking neighbors, and employing some heuristics and HMM models
    • Developer blog: Dongjoo Lee
  • KOMORAN (2013) Apache v2
    • By shineware

Python

  • KoNLPy (2014) GPL v3
    • By Lucy Park (서울대)
  • UMorpheme (2014) MIT
    • By Kyunghoon Kim (UNIST)

R

  • KoNLP (2011) GPL v3
    • By Heewon Jeon

Others

Other NLP tools

  • Hangulize - By Heungsub Lee Python
    • Hangul transcription tool to 38+ languages
  • Hanja - By Sumin Byeon Python
    • Hanja to hangul transcriptor
  • Jamo - By Joshua Dong Python
    • Hangul syllable decomposition and synthesis
  • KoreanParser - By DongHyun Choi, Jungyeul Park, Key-Sun Choi (KAIST) Java
    • Language parser
  • Korean - By Heungsub Lee Python
    • Package for attaching particles (josa) in sentences

Corpora

  • Yonsei Corpus, 연세대, 1987.
    • 42M tokens of Korean since the 1960s
  • Korea University Korean Corpus, 1995.
    • 10M tokens of Korean of 1970-90s
  • HANTEC 2.0, KISTI & 충남대, 1998-2003.
    • 120,000 test documents (237MB)
    • 50 TREC-type questions for QA (48KB)
  • HKIB-40075, KISTI & 한국일보, 2002.
    • 40,075 test documents for text categorization (88MB)
  • KAIST Corpus, KAIST, 1997-2005.

  • Sejong Corpus, National Institute of the Korean Language, 1998-2007.

comments powered by Disqus