References¶

Note

Please modify this document if anything is erroneous or not included. Last updated at Sep 25, 2017.

Korean morpheme analyzer tools¶

When you’re analyzing Korean text, the most basic task you need to perform is morphological analysis. There are several libraries in various programming languages to achieve this:

C/C++¶

MeCab-ko (2013) - By Yong-woon Lee and Youngho Yoo GPL LGPL BSD
UTagger (2012) - By Joon-Choul Shin, Cheol-Young Ock* (Ulsan University) GPL custom
- 신준철, 옥철영, 기분석 부분 어절 사전을 활용한 한국어 형태소 분석기 (A Korean Morphological Analyzer using a Pre-analyzed Partial Word-phrase Dictionary), 정보과학회논문지: 소프트웨어 및 응용, 제39권 제5호, 2012.
- 신준철, 옥철영, 한국어 품사 및 동형이의어 태깅을 위한 단계별 전이모델 (A Stage Transition Model for Korean Part-of-Speech and Homograph Tagging), 정보과학회논문지: 소프트웨어 및 응용, 제39권 제11호, 2012.
- slides
MACH (2002) - By Kwangseob Shim (성신여대) custom
- Kwangseob Shim, Jaehyung Yang, MACH: A Supersonic Korean Morphological Analyzer, ACL, 2002.
KTS (1995) - By 이상호, 서정연, 오영환 (KAIST) GPL v2
- 이상호, KTS: Korean Tagging System Manual (Version 0.9)
- 김재훈, 서정연, 자연언어 처리를 위한 한국어 품사 태그 (A Korean part-of-speech tag set for natural language processing), 1993.
- Created at 1995, released at 2002. [1]

Java/Scala¶

twitter-korean-text (2014) - By Will Hohyon Ryu (Twitter) Apache v2
KOMORAN (2013) - By 신준수 (shineware) Apache v2
KKMA (2010) - By Sang-goo Lee*, Dongjoo Lee, et al. (Seoul National University) GPL v2
- 이동주, 연종흠, 황인범, 이상구, 꼬꼬마: 관계형 데이터베이스를 활용한 세종 말뭉치 활용 도구, 정보과학회논문지: 컴퓨팅의 실제 및 레터, Volume 16, No.11, 2010.
Arirang (2009) - By SooMyung Lee Apache v2
- code
HanNanum (1999) - By Key-Sun Choi* et al. (KAIST) GPL v3
- code, docs

Python¶

KoNLPy (2014) GPL v3+
- By Lucy Park (Seoul National University)
- Wrapper for Hannanum, KKMA, KOMORAN, twitter-korean-text, MeCab-ko
- Tools for Hangul/Korean manipulation
UMorpheme (2014) MIT
- By Kyunghoon Kim (UNIST)
- Wrapper for MeCab-ko for online usage

R¶

KoNLP (2011) GPL v3
- By Heewon Jeon
- Wrapper for Hannaum

Others¶

K-LIWC (아주대)
KRISTAL-IRMS (KISTI)
- Development history
Korean XTAG (UPenn)
HAM (국민대)
POSTAG/K (POSTECH)

Corpora¶

Korea University Korean Corpus, 1995.
- 10M tokens of Korean of 1970-90s
HANTEC 2.0, KISTI & 충남대, 1998-2003.
- 120,000 test documents (237MB)
- 50 TREC-type questions for QA (48KB)
HKIB-40075, KISTI & 한국일보, 2002.
- 40,075 test documents for text categorization (88MB)
KAIST Corpus, KAIST, 1997-2005.
Sejong Corpus, National Institute of the Korean Language, 1998-2007.
Yonsei Corpus, 연세대, 1987.
- 42M tokens of Korean since the 1960s
BoRA 언어자원은행, KAIST

Other NLP tools¶

Hangulize - By Heungsub Lee Python
- Hangul transcription tool to 38+ languages
Hanja - By Sumin Byeon Python
- Hanja to hangul transcriptor
Jamo - By Joshua Dong Python
- Hangul syllable decomposition and synthesis
KoreanParser - By DongHyun Choi, Jungyeul Park, Key-Sun Choi (KAIST) Java
- Language parser
Korean - By Heungsub Lee Python
- Package for attaching particles (josa) in sentences
go_hangul (2012) - By Homin Lee Go BSD
- Tools for Hangul manipulation [docs]
Speller (부산대)

[1]	https://wiki.kldp.org/wiki.php/KTS