tag Package¶
Note
Initial runs of each class method may require some time to load dictionaries (< 1 min). Second runs should be faster.
Hannanum
Class¶
-
class
konlpy.tag._hannanum.
Hannanum
(jvmpath=None)¶ Wrapper for JHannanum.
JHannanum is a morphological analyzer and POS tagger written in Java, and developed by the Semantic Web Research Center (SWRC) at KAIST since 1999.
>>> from konlpy.tag import Hannanum >>> hannanum = Hannanum() >>> print(hannanum.analyze(u'롯데마트의 흑마늘 양념 치킨이 논란이 되고 있다.')) [[[('롯데마트', 'ncn'), ('의', 'jcm')], [('롯데마트의', 'ncn')], [('롯데마트', 'nqq'), ('의', 'jcm')], [('롯데마트의', 'nqq')]], [[('흑마늘', 'ncn')], [('흑마늘', 'nqq')]], [[('양념', 'ncn')]], [[('치킨', 'ncn'), ('이', 'jcc')], [('치킨', 'ncn'), ('이', 'jcs')], [('치킨', 'ncn'), ('이', 'ncn')]], [[('논란', 'ncpa'), ('이', 'jcc')], [('논란', 'ncpa'), ('이', 'jcs')], [('논란', 'ncpa'), ('이', 'ncn')]], [[('되', 'nbu'), ('고', 'jcj')], [('되', 'nbu'), ('이', 'jp'), ('고', 'ecc')], [('되', 'nbu'), ('이', 'jp'), ('고', 'ecs')], [('되', 'nbu'), ('이', 'jp'), ('고', 'ecx')], [('되', 'paa'), ('고', 'ecc')], [('되', 'paa'), ('고', 'ecs')], [('되', 'paa'), ('고', 'ecx')], [('되', 'pvg'), ('고', 'ecc')], [('되', 'pvg'), ('고', 'ecs')], [('되', 'pvg'), ('고', 'ecx')], [('되', 'px'), ('고', 'ecc')], [('되', 'px'), ('고', 'ecs')], [('되', 'px'), ('고', 'ecx')]], [[('있', 'paa'), ('다', 'ef')], [('있', 'px'), ('다', 'ef')]], [[('.', 'sf')], [('.', 'sy')]]] >>> print(hannanum.morphs(u'롯데마트의 흑마늘 양념 치킨이 논란이 되고 있다.')) ['롯데마트', '의', '흑마늘', '양념', '치킨', '이', '논란', '이', '되', '고', '있', '다', '.'] >>> print(hannanum.nouns(u'다람쥐 헌 쳇바퀴에 타고파')) ['다람쥐', '쳇바퀴', '타고파'] >>> print(hannanum.pos(u'웃으면 더 행복합니다!')) [('웃', 'P'), ('으면', 'E'), ('더', 'M'), ('행복', 'N'), ('하', 'X'), ('ㅂ니다', 'E'), ('!', 'S')]
Parameters: jvmpath – The path of the JVM passed to init_jvm()
.-
analyze
(phrase)¶ Phrase analyzer.
This analyzer returns various morphological candidates for each token. It consists of two parts: 1) Dictionary search (chart), 2) Unclassified term segmentation.
-
morphs
(phrase)¶ Parse phrase to morphemes.
-
nouns
(phrase)¶ Noun extractor.
-
pos
(phrase, ntags=9, flatten=True)¶ POS tagger.
This tagger is HMM based, and calculates the probability of tags.
Parameters: - ntags – The number of tags. It can be either 9 or 22.
- flatten – If False, preserves eojeols.
-
Kkma
Class¶
-
class
konlpy.tag._kkma.
Kkma
(jvmpath=None)¶ Wrapper for Kkma.
Kkma is a morphological analyzer and natural language processing system written in Java, developed by the Intelligent Data Systems (IDS) Laboratory at SNU.
>>> from konlpy.tag import Kkma >>> kkma = Kkma() >>> print(kkma.morphs(u'공부를 하면할수록 모르는게 많다는 것을 알게 됩니다.')) ['공부', '를', '하', '면', '하', 'ㄹ수록', '모르', '는', '것', '이', '많', '다는', '것', '을', '알', '게', '되', 'ㅂ니다', '.'] >>> print(kkma.nouns(u'대학에서 DB, 통계학, 이산수학 등을 배웠지만...')) ['대학', '통계학', '이산', '이산수학', '수학', '등'] >>> print(kkma.pos(u'다 까먹어버렸네요?ㅋㅋ')) [('다', 'MAG'), ('까먹', 'VV'), ('어', 'ECD'), ('버리', 'VXV'), ('었', 'EPT'), ('네요', 'EFN'), ('?', 'SF'), ('ㅋㅋ', 'EMO')] >>> print(kkma.sentences(u'그래도 계속 공부합니다. 재밌으니까!')) ['그래도 계속 공부합니다.', '재밌으니까!']
Warning
There are reports that
Kkma()
is weak for long strings with no spaces between words. See issue #73 for details.Parameters: jvmpath – The path of the JVM passed to init_jvm()
.-
morphs
(phrase)¶ Parse phrase to morphemes.
-
nouns
(phrase)¶ Noun extractor.
-
pos
(phrase, flatten=True)¶ POS tagger.
Parameters: flatten – If False, preserves eojeols.
-
sentences
(phrase)¶ Sentence detection.
-
Komoran
Class¶
-
class
konlpy.tag._komoran.
Komoran
(jvmpath=None, dicpath=None)¶ Wrapper for KOMORAN.
KOMORAN is a relatively new open source Korean morphological analyzer written in Java, developed by Shineware, since 2013.
>>> from konlpy.tag import Komoran >>> komoran = Komoran() >>> print(komoran.morphs(u'우왕 코모란도 오픈소스가 되었어요')) ['우왕', '코', '모란', '도', '오픈소스', '가', '되', '었', '어요'] >>> print(komoran.nouns(u'오픈소스에 관심 많은 멋진 개발자님들!')) ['오픈소스', '관심', '개발자'] >>> print(komoran.pos(u'원칙이나 기체 설계와 엔진·레이더·항법장비 등')) [('원칙', 'NNG'), ('이나', 'JC'), ('기체', 'NNG'), ('설계', 'NNG'), ('와', 'JC'), ('엔진', 'NNG'), ('·', 'SP'), ('레이더', 'NNG'), ('·', 'SP'), ('항법', 'NNP'), ('장비', 'NNG'), ('등', 'NNB')]
Parameters: - jvmpath – The path of the JVM passed to
init_jvm()
. - dicpath – The path of dictionary files. The KOMORAN system dictionary is loaded by default.
-
morphs
(phrase)¶ Parse phrase to morphemes.
-
nouns
(phrase)¶ Noun extractor.
-
pos
(phrase, flatten=True)¶ POS tagger.
Parameters: flatten – If False, preserves eojeols.
- jvmpath – The path of the JVM passed to
Mecab
Class¶
Warning
Mecab()
is not supported on Windows 7
.
-
class
konlpy.tag._mecab.
Mecab
(dicpath='/usr/local/lib/mecab/dic/mecab-ko-dic')¶ Wrapper for MeCab-ko morphological analyzer.
MeCab, originally a Japanese morphological analyzer and POS tagger developed by the Graduate School of Informatics in Kyoto University, was modified to MeCab-ko by the Eunjeon Project to adapt to the Korean language.
In order to use MeCab-ko within KoNLPy, follow the directions in optional-installations.
>>> # MeCab installation needed >>> from konlpy.tag import Mecab >>> mecab = Mecab() >>> print(mecab.morphs(u'영등포구청역에 있는 맛집 좀 알려주세요.')) ['영등포구', '청역', '에', '있', '는', '맛집', '좀', '알려', '주', '세요', '.'] >>> print(mecab.nouns(u'우리나라에는 무릎 치료를 잘하는 정형외과가 없는가!')) ['우리', '나라', '무릎', '치료', '정형외과'] >>> print(mecab.pos(u'자연주의 쇼핑몰은 어떤 곳인가?')) [('자연', 'NNG'), ('주', 'NNG'), ('의', 'JKG'), ('쇼핑몰', 'NNG'), ('은', 'JX'), ('어떤', 'MM'), ('곳', 'NNG'), ('인가', 'VCP+EF'), ('?', 'SF')]
Parameters: dicpath – The path of the MeCab-ko dictionary. -
morphs
(phrase)¶ Parse phrase to morphemes.
-
nouns
(phrase)¶ Noun extractor.
-
pos
(phrase, flatten=True)¶ POS tagger.
Parameters: flatten – If False, preserves eojeols.
-
Twitter
Class¶
-
class
konlpy.tag._twitter.
Twitter
(jvmpath=None)¶ Wrapper for Twitter Korean Text.
Twitter Korean Text is an open source Korean tokenizer written in Scala, developed by Will Hohyon Ryu.
>>> from konlpy.tag import Twitter >>> twitter = Twitter() >>> print(twitter.morphs(u'단독입찰보다 복수입찰의 경우')) ['단독', '입찰', '보다', '복수', '입찰', '의', '경우', '가'] >>> print(twitter.nouns(u'유일하게 항공기 체계 종합개발 경험을 갖고 있는 KAI는')) ['유일하', '항공기', '체계', '종합', '개발', '경험'] >>> print(twitter.phrases(u'날카로운 분석과 신뢰감 있는 진행으로')) ['분석', '분석과 신뢰감', '신뢰감', '분석과 신뢰감 있는 진행', '신뢰감 있는 진행', '진행', '신뢰'] >>> print(twitter.pos(u'이것도 되나욬ㅋㅋ')) [('이', 'Determiner'), ('것', 'Noun'), ('도', 'Josa'), ('되나욬', 'Noun'), ('ㅋㅋ', 'KoreanParticle')] >>> print(twitter.pos(u'이것도 되나욬ㅋㅋ', norm=True)) [('이', 'Determiner'), ('것', 'Noun'), ('도', 'Josa'), ('되', 'Verb'), ('나요', 'Eomi'), ('ㅋㅋ', 'KoreanParticle')] >>> print(twitter.pos(u'이것도 되나욬ㅋㅋ', norm=True, stem=True)) [('이', 'Determiner'), ('것', 'Noun'), ('도', 'Josa'), ('되다', 'Verb'), ('ㅋㅋ', 'KoreanParticle')]
Parameters: jvmpath – The path of the JVM passed to init_jvm()
.-
morphs
(phrase, norm=False, stem=False)¶ Parse phrase to morphemes.
-
nouns
(phrase)¶ Noun extractor.
-
phrases
(phrase)¶ Phrase extractor.
-
pos
(phrase, norm=False, stem=False)¶ POS tagger. In contrast to other classes in this subpackage, this POS tagger doesn’t have a flatten option, but has norm and stem options. Check the parameter list below.
Parameters: - norm – If True, normalize tokens.
- stem – If True, stem tokens.
-
See also
Korean POS tags comparison chart
Compare POS tags between several Korean analytic projects. (In Korean)