KoNLPy (pronounced “ko en el PIE”) is a Python package for natural language processing (NLP) of the Korean language. For installation directions, see here.
>>> from konlpy.tag import Kkma
>>> from konlpy.utils import pprint
>>> kkma = Kkma()
>>> pprint(kkma.sentences(u'네, 안녕하세요. 반갑습니다.'))
[네, 안녕하세요..,
반갑습니다.]
>>> pprint(kkma.nouns(u'질문이나 건의사항은 깃헙 이슈 트래커에 남겨주세요.'))
[질문,
건의,
건의사항,
사항,
깃헙,
이슈,
트래커]
>>> pprint(kkma.pos(u'오류보고는 실행환경, 에러메세지와함께 설명을 최대한상세히!^^')
[(오류, NNG),
(보고, NNG),
(는, JX),
(실행, NNG),
(환경, NNG),
(,, SP),
(에러, NNG),
(메세지, NNG),
(와, JKM),
(함께, MAG),
(설명, NNG),
(을, JKO),
(최대한, NNG),
(상세히, MAG),
(!, SF),
(^^, EMO)]
For more on how to use KoNLPy, go see the API.
Korean, the 13th most widely spoken language in the world, is a beautiful, yet complex language. Myriad Korean morpheme analyzer tools were built by numerous researchers, to computationally extract meaningful features from the labyrinthine text.
KoNLPy is not just to create another, but to unify and build upon their shoulders, and see one step further. It is built particularly in the Python (programming) language, not only because of the language’s simplicity and elegance, but also the powerful string processing modules and applicability to various tasks - including crawling, Web programming, and data analysis.
The three main philosophies of this project are:
Please report when you think any have gone stale.
KoNLPy isn’t perfect, but it will continuously evolve and you are invited to participate!
Found a bug? Have a good idea for improving KoNLPy? Visit the KoNLPy GitHub page and suggest an idea or make a pull request.
You are also welcome to join the #koreannlp channel at the Ozinger IRC Network, and the mailing list. The IRC channel is more focused on development discussions and the mailing list is a better place to ask questions, but nobody stops you from going the other way around.
Please note that asking questions through these channels is also a great contribution, because it give the community feedback as well as ideas. Don’t hesitate to ask.
$ pip install JPype1
$ pip install konlpy
$ bash <(curl -s https://raw.githubusercontent.com/e9t/konlpy/master/scripts/mecab.sh) # (Optional) Install MeCab
C:\> pip install konlpy
(Optional) Download, extract [2], and install the most recent version of MeCab from the following links:
[1] | win-amd64 for 64-bit Windows, win32 for 32-bit Windows. |
[2] | Having MinGW/MSYS or Cygwin installed may be more convenient. Otherwise, you can use 7zip for the extraction of tar files. |
Morphological analysis is the identification of the structure of morphemes and other linguistic units, such as root words, affixes, or parts of speech.
POS (part-of-speech) tagging is the process of marking up morphemes in a phrase, based on their definitions and contexts. For example.:
가방에 들어가신다 -> 가방/NNG + 에/JKM + 들어가/VV + 시/EPH + ㄴ다/EFN
In KoNLPy, there are several different options you can choose for POS tagging. All have the same input-output structure; the input is a phrase, and the output is a list of tagged morphemes.
For detailed usage instructions see the tag Package.
Now, we do time and performation analysis for executing the pos method for each of the classes in the tag Package.
The performance evaluation is replaced with result comparisons for several sample sentences.
“저는 대학생이구요. 소프트웨어 관련학과 입니다.”
“아버지가방에들어가신다”
“140823 Tofu Music Festival 존잘러에서 귀요미들로 변신ㅋㅋ #GOT7”
Kkma Hannanum Mecab 140823 / NR 140823 / N 140823 / SN Tofu / OL Tofu / F Tofu / SL Music / OL Music / F Music / SL Festival / OL Festival / F Festival / SL 존 / NNP 존잘러 / N 존 / VA+JX 잘 / MAG 잘 / VA 러 / NNP 러 / EC 에서 / JKM 에서 / J 에서 / JKB 귀요 / NNG 귀요미들 / N 귀요미 / NNG 미들 / NNG 들 / XSN 로 / JKM 로 / J 로 / JKB 변신 / NNG 변신ㅋㅋ / N 변신 / NNG ㅋㅋ / EMO ㅋㅋ / UNKNOWN # / SW #GOT7 / N # / SY GOT / OL GOT / SL 7 / NR 7 / SN
[1] | All time analyses in this document were performed with time on a Thinkpad X1 Carbon (2013) and KoNLPy v0.3. |
[2] | Average of five consecutive runs. |
[3] | Average of ten consecutive runs. |
[4] | The current Hannanum class raises a java.lang.ArrayIndexOutOfBoundsException: 10000 exception if the number of characters is too large. |
See also
Korean POS tags comparison chart
Compare POS tags between several Korean analytic projects. (In Korean)
Dictionaries are used for Morphological analysis and POS tagging, and are built with Corpora.
A dictionary created with the KAIST corpus. (4.7MB)
Located at ./konlpy/java/data/kE/dic_system.txt. Part of this file is shown below.:
...
나라경제 ncn
나라기획 nqq
나라기획회장 ncn
나라꽃 ncn
나라님 ncn
나라도둑 ncn
나라따르 pvg
나라링링프로덕션 ncn
나라말 ncn
나라망신 ncn
나라박물관 ncn
나라발전 ncpa
나라별 ncn
나라부동산 nqq
나라사랑 ncn
나라살림 ncpa
나라시 nqq
나라시마 ncn
...
You can add your own terms, modify ./konlpy/java/data/kE/dic_user.txt.
A dictionary created with the Sejong corpus. (32MB)
It is included within the Kkma .jar file, so in order to see dictionary files, check out the KKMA’s mirror. Part of kcc.dic is shown below.:
아니/IC
후우/IC
그래서/MAC
그러나/MAC
그러니까/MAC
그러면/MAC
그러므로/MAC
그런데/MAC
그리고/MAC
따라서/MAC
하지만/MAC
...
A CSV formatted dictionary created with the Sejong corpus. (346MB)
The compiled version is located at /usr/local/lib/mecab/dic/mecab-ko-dic (or the path you assigned during installation), and you can see the original files in the source code. Part of CoinedWord.csv is shown below.:
가오티,0,0,0,NNG,*,F,가오티,*,*,*,*,*
갑툭튀,0,0,0,NNG,*,F,갑툭튀,*,*,*,*,*
강퇴,0,0,0,NNG,*,F,강퇴,*,*,*,*,*
개드립,0,0,0,NNG,*,T,개드립,*,*,*,*,*
갠소,0,0,0,NNG,*,F,갠소,*,*,*,*,*
고퀄,0,0,0,NNG,*,T,고퀄,*,*,*,*,*
광삭,0,0,0,NNG,*,T,광삭,*,*,*,*,*
광탈,0,0,0,NNG,*,T,광탈,*,*,*,*,*
굉천,0,0,0,NNG,*,T,굉천,*,*,*,*,*
국을,0,0,0,NNG,*,T,국을,*,*,*,*,*
귀요미,0,0,0,NNG,*,F,귀요미,*,*,*,*,*
...
To add your own terms, see here.
Note
You can add new words either to the system dictionaries or user dictionaries. However, there is a slight difference in the two choices.:
Exploring a document can consist of various components:
#! /usr/bin/python2.7
# -*- coding: utf-8 -*-
from collections import Counter
from konlpy.corpus import kolaw
from konlpy.tag import Hannanum
from konlpy.utils import concordance, pprint
from matplotlib import pyplot
def draw_zipf(count_list, filename, color='blue', marker='o'):
sorted_list = sorted(count_list, reverse=True)
pyplot.plot(sorted_list, color=color, marker=marker)
pyplot.xscale('log')
pyplot.yscale('log')
pyplot.savefig(filename)
doc = kolaw.open('constitution.txt').read()
pos = Hannanum().pos(doc)
cnt = Counter(pos)
print('nchars :', len(doc))
print('ntokens :', len(doc.split()))
print('nmorphs :', len(set(pos)))
print('\nTop 20 frequent morphemes:'); pprint(cnt.most_common(20))
print('\nLocations of "대한민국" in the document:')
concordance(u'대한민국', doc, show=True)
draw_zipf(cnt.values(), 'zipf.png')
Console:
nchars : 19240
ntokens : 4178
nmorphs : 1501
Top 20 frequent morphemes:
[((의, J), 398),
((., S), 340),
((하, X), 297),
((에, J), 283),
((ㄴ다, E), 242),
((ㄴ, E), 226),
((이, J), 218),
((을, J), 211),
((은, J), 184),
((어, E), 177),
((를, J), 148),
((ㄹ, E), 135),
((/, S), 131),
((하, P), 124),
((는, J), 117),
((법률, N), 115),
((,, S), 100),
((는, E), 97),
((있, P), 96),
((되, X), 95)]
Locations of "대한민국" in the document:
0 대한민국헌법 유구한 역사와
9 대한국민은 3·1운동으로 건립된 대한민국임시정부의 법통과 불의에
98 총강 제1조 ① 대한민국은 민주공화국이다. ②대한민국의
100 ① 대한민국은 민주공화국이다. ②대한민국의 주권은 국민에게
110 나온다. 제2조 ① 대한민국의 국민이 되는
126 의무를 진다. 제3조 대한민국의 영토는 한반도와
133 부속도서로 한다. 제4조 대한민국은 통일을 지향하며,
147 추진한다. 제5조 ① 대한민국은 국제평화의 유지에
787 군무원이 아닌 국민은 대한민국의 영역안에서는 중대한
1836 파견 또는 외국군대의 대한민국 영역안에서의 주류에
3620 경제 제119조 ① 대한민국의 경제질서는 개인과
Below shows a code example that crawls a National Assembly bill from the web, extract nouns and draws a word cloud - from head to tail in Python.
You can change the bill number (i.e., bill_num), and see how the word clouds differ per bill. (ex: ‘1904882’, ‘1904883’, ‘ZZ19098’, etc)
#! /usr/bin/python2.7
# -*- coding: utf-8 -*-
from collections import Counter
import urllib
import random
import webbrowser
from konlpy.tag import Hannanum
from lxml import html
import pytagcloud # requires Korean font support
import sys
if sys.version_info[0] >= 3:
urlopen = urllib.request.urlopen
else:
urlopen = urllib.urlopen
r = lambda: random.randint(0,255)
color = lambda: (r(), r(), r())
def get_bill_text(billnum):
url = 'http://pokr.kr/bill/%s/text' % billnum
response = urlopen(url).read().decode('utf-8')
page = html.fromstring(response)
text = page.xpath(".//div[@id='bill-sections']/pre/text()")[0]
return text
def get_tags(text, ntags=50, multiplier=10):
h = Hannanum()
nouns = h.nouns(text)
count = Counter(nouns)
return [{ 'color': color(), 'tag': n, 'size': c*multiplier }\
for n, c in count.most_common(ntags)]
def draw_cloud(tags, filename, fontname='Noto Sans CJK', size=(800, 600)):
pytagcloud.create_tag_image(tags, filename, fontname=fontname, size=size)
webbrowser.open(filename)
bill_num = '1904882'
text = get_bill_text(bill_num)
tags = get_tags(text)
print(tags)
draw_cloud(tags, 'wordcloud.png')
Note
The PyTagCloud installed in PyPI may not be sufficient for drawing wordclouds in Korean. You may add eligible fonts - that support the Korean language - manually, or install the Korean supported version here.
We can find collocations with the help of NLTK.
#! /usr/bin/python2.7
# -*- coding: utf-8 -*-
from konlpy.tag import Kkma
from konlpy.corpus import kolaw
from konlpy.utils import pprint
from nltk import collocations
bigram_measures = collocations.BigramAssocMeasures()
trigram_measures = collocations.TrigramAssocMeasures()
doc = kolaw.open('constitution.txt').read()
pos = Kkma().pos(doc)
words = [s for s, t in pos]
tags = [t for s, t in pos]
print('\nCollocations among tagged words:')
finder = collocations.BigramCollocationFinder.from_words(pos)
pprint(finder.nbest(bigram_measures.pmi, 10)) # top 5 n-grams with highest PMI
print('\nCollocations among words:')
ignored_words = [u'안녕']
finder = collocations.BigramCollocationFinder.from_words(words)
finder.apply_word_filter(lambda w: len(w) < 2 or w in ignored_words)
finder.apply_freq_filter(3) # only bigrams that appear 3+ times
pprint(finder.nbest(bigram_measures.pmi, 10))
print('\nCollocations among tags:')
finder = collocations.BigramCollocationFinder.from_words(tags)
pprint(finder.nbest(bigram_measures.pmi, 5))
Console:
Collocations among tagged words:
[((가부, NNG), (동수, NNG)),
((강제, NNG), (노역, NNG)),
((경자, NNG), (유전, NNG)),
((고, ECS), (채취, NNG)),
((공무, NNG), (담임, NNG)),
((공중, NNG), (도덕, NNG)),
((과반, NNG), (수가, NNG)),
((교전, NNG), (상태, NNG)),
((그러, VV), (나, ECE)),
((기본적, NNG), (인권, NNG))]
Collocations among words:
[(현행, 범인),
(형의, 선고),
(내부, 규율),
(정치적, 중립성),
(누구, 든지),
(회계, 연도),
(지체, 없이),
(평화적, 통일),
(형사, 피고인),
(지방, 자치)]
Collocations among tags:
[(XR, XSA),
(JKC, VCN),
(VCN, ECD),
(ECD, VX),
(ECD, VXV)]
KoNLPy has tests to evaulate its quality. To perform a test, use the code below.
$ pip install pytest
$ cd konlpy
$ py.test
KoNLPy was tested on the below environments:
Note
To see known bugs/issues, see here.
Note
Please modify this document if anything is erroneous or not included. Last updated at September 14, 2014.
K-LIWC (아주대)
Korean XTAG (UPenn)
HAM (국민대)
POSTAG/K (포스텍)
Speller (부산대)
UTagger (울산대)
(No name) (고려대)
KAIST corpus, KAIST, 1997-2005.
Sejong corpus, National Institute of the Korean Language, 1998-2007.
Initializes the Java virtual machine (JVM).
Parameters: | jvmpath – The path of the JVM. If left empty, inferred by jpype.getDefaultJVMPath(). |
---|
Bases: pprint.PrettyPrinter
Overrided method to enable Unicode pretty print.
Converts a unicode character to hex.
>>> char2hex(u'음')
'0xc74c'
Concatenates lines into a unified string.
Find concordances of a phrase in a text.
The farmost left numbers are indices, that indicate the location of the phrase in the text (by means of tokens). The following string, is part of the text surrounding the phrase for the given index.
Parameters: |
|
---|
>>> from konlpy.corpus import kolaw
>>> from konlpy.tag import Mecab
>>> from konlpy import utils
>>> constitution = kolaw.open('constitution.txt').read()
>>> idx = utils.concordance(u'대한민국', constitution, show=True)
0 대한민국헌법 유구한 역사와
9 대한국민은 3·1운동으로 건립된 대한민국임시정부의 법통과 불의에
98 총강 제1조 ① 대한민국은 민주공화국이다. ②대한민국의
100 ① 대한민국은 민주공화국이다. ②대한민국의 주권은 국민에게
110 나온다. 제2조 ① 대한민국의 국민이 되는
126 의무를 진다. 제3조 대한민국의 영토는 한반도와
133 부속도서로 한다. 제4조 대한민국은 통일을 지향하며,
147 추진한다. 제5조 ① 대한민국은 국제평화의 유지에
787 군무원이 아닌 국민은 대한민국의 영역안에서는 중대한
1836 파견 또는 외국군대의 대한민국 영역안에서의 주류에
3620 경제 제119조 ① 대한민국의 경제질서는 개인과
>>> idx
[0, 9, 98, 100, 110, 126, 133, 147, 787, 1836, 3620]
Converts a hex character to unicode.
>>> print hex2char('c74c')
음
>>> print hex2char('0xc74c')
음
Text file loader.
Partitions a list to several parts using indices.
Parameters: |
|
---|
Unicode pretty printer.
>>> import pprint, konlpy
>>> pprint.pprint([u"Print", u"유니코드", u"easily"])
[u'Print', u'\uc720\ub2c8\ucf54\ub4dc', u'easily']
>>> konlpy.utils.pprint([u"Print", u"유니코드", u"easily"])
['Print', '유니코드', 'easily']
Replaces some ambiguous punctuation marks to simpler ones.
Note
Initial runs of each class method may require some time to load dictionaries (< 1 min). Second runs should be faster.
Wrapper for JHannanum.
JHannanum is a morphological analyzer and POS tagger written in Java, and developed by the Semantic Web Research Center (SWRC) at KAIST since 1999.
from konlpy.tag import Hannanum
hannanum = Hannanum()
print hannanum.analyze(u'롯데마트의 흑마늘 양념 치킨이 논란이 되고 있다.')
print hannanum.nouns(u'다람쥐 헌 쳇바퀴에 타고파')
print hannanum.pos(u'웃으면 더 행복합니다!')
print hannanum.morphs(u'웃으면 더 행복합니다!')
Parameters: | jvmpath – The path of the JVM passed to init_jvm(). |
---|
Phrase analyzer.
This analyzer returns various morphological candidates for each token. It consists of two parts: 1) Dictionary search (chart), 2) Unclassified term segmentation.
Parse phrase to morphemes.
Noun extractor.
POS tagger.
This tagger is HMM based, and calculates the probability of tags.
Parameters: | ntags – The number of tags. It can be either 9 or 22. |
---|
Wrapper for Kkma.
Kkma is a morphological analyzer and natural language processing system written in Java, developed by the Intelligent Data Systems (IDS) Laboratory at SNU.
from konlpy.tag import Kkma
kkma = Kkma()
print kkma.sentences(u'저는 대학생이구요. 소프트웨어 관련학과 입니다.')
print kkma.nouns(u'대학에서 DB, 통계학, 이산수학 등을 배웠지만...')
print kkma.morph(u'자주 사용을 안하다보니 모두 까먹은 상태입니다.')
print kkma.pos(u'어쩌면 좋죠?')
Parameters: | jvmpath – The path of the JVM passed to init_jvm(). |
---|
Parse phrase to morphemes.
Noun extractor.
POS tagger.
Sentence detection.
Warning
Mecab is not supported for Python 3 and Windows 7.
Wrapper for MeCab-ko morphological analyzer.
MeCab, originally a Japanese morphological analyzer and a POS tagger developed by the Graduate School of Informatics in Kyoto University, was modified to MeCab-ko by the Eunjeon Project to adapt to the Korean language.
In order to use MeCab-ko within KoNLPy, follow the directions in optional-installations.
from konlpy.tag import Mecab
# MeCab installation needed
mecab = Mecab()
print mecab.pos(u'자연주의 쇼핑몰은 어떤 곳인가?')
print mecab.morphs(u'영등포구청역에 있는 맛집 좀 알려주세요.')
print mecab.nouns(u'우리나라에는 무릎 치료를 잘하는 정형외과가 없는가!')
Parameters: | dicpath – The path of the MeCab-ko dictionary. |
---|
Parse phrase to morphemes.
Noun extractor.
POS tagger.
See also
Korean POS tags comparison chart
Compare POS tags between several Korean analytic projects. (In Korean)
Loader for corpora. The following corpora are currently available:
>>> from konlpy.corpus import kolaw
>>> fids = kolaw.fileids()
>>> fobj = kolaw.open(fids[0])
>>> print fobj.read(140)
대한민국헌법
유구한 역사와 전통에 빛나는 우리 대한국민은 3·1운동으로 건립된 대한민국임시정부의 법통과 불의에 항거한 4·19민주이념을 계승하고, 조국의 민주개혁과 평화적 통일의 사명에 입각하여 정의·인도와 동포애로써 민족의 단결을 공고히 하고, 모든 사회적 폐습과 불의를 타파하며, 자율과 조화를 바 바
Absolute path of corpus file. If filename is None, returns absolute path of corpus.
Parameters: | filename – Name of a particular file in the corpus. |
---|
List of file IDs in the corpus.
Method to open a file in the corpus. Returns a file object.
Parameters: | filename – Name of a particular file in the corpus. |
---|
[1] | With clear and brief documents. |
[2] | No, I’m not extremely fond of this either. However, some important depedencies - such as Hannanum, Kkma, MeCab-ko - are GPL licensed, and we want to honor their licenses. (It is also an inevitable choice. We hope things may change in the future.) |