gensim学习笔记

LDA主题模型

准备数据

1、中文维基百科数据
2、gensim中Corpus类处理数据(*xml.bz2)

1
2
>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2')
>>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# -*- coding: utf-8 -*-
import logging
import sys
from gensim.corpora import WikiCorpus
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s', level=logging.INFO)
def help():
print "Usage: python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.txt"
if __name__ == '__main__':
if len(sys.argv) < 3:
help()
sys.exit(1)
logging.info("running %s" % ' '.join(sys.argv))
inp, outp = sys.argv[1:3]
i = 0
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(" ".join(text) + "\n")
i = i + 1
if (i % 10000 == 0):
logging.info("Save "+str(i) + " articles")
output.close()
logging.info("Finished saved "+str(i) + "articles")
process_wiki_1.py

3、数据预处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
# Traditional Chinese to Simplified Chinese
echo "opencc: Traditional Chinese to Simplified Chinese..."
#time opencc -i wiki.zh.txt -o wiki.zh.chs.txt -c zht2zhs.ini
time opencc -i wiki.zh.txt -o wiki.zh.chs.txt -c t2s.json
# Cut words
echo "jieba: Cut words..."
time python -m jieba -d ' ' wiki.zh.chs.txt > wiki.zh.seg.txt
# Change encode
echo "iconv: ascii to utf-8..."
time iconv -c -t UTF-8 < wiki.zh.seg.txt > wiki.zh.seg.utf.txt
process_wiki_2.sh

处理完之后的数据,已经分好词:
http://pan.baidu.com/s/1gfMhkcV 密码:gdua

LDA实验

1、去掉停用词后即可训练lda模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import codecs
from gensim.models import LdaModel
from gensim.corpora import Dictionary
train = []
stopwords = codecs.open('stopwords.txt','r',encoding='utf8').readlines()
stopwords = [ w.strip() for w in stopwords ]
fp = codecs.open('wiki.zh.word.txt','r',encoding='utf8')
for line in fp:
line = line.split()
train.append([ w for w in line if w not in stopwords ])
dictionary = corpora.Dictionary(train)
corpus = [ dictionary.doc2bow(text) for text in train ]
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=100)

停用词下载:http://pan.baidu.com/s/1qYnsSLe 密码:s0hc

此外,gensim也提供了对wiki压缩包直接进行抽取并保存为稀疏矩阵的脚本 make_wiki,可在bash运行下面命令查看用法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
python -m gensim.scripts.make_wiki
#USAGE: make_wiki.py WIKI_XML_DUMP OUTPUT_PREFIX [VOCABULARY_SIZE]
USAGE: make_wiki.py WIKI_XML_DUMP OUTPUT_PREFIX [VOCABULARY_SIZE]
Convert articles from a Wikipedia dump to (sparse) vectors. The input is a
bz2-compressed dump of Wikipedia articles, in XML format.
This actually creates three files:
* `OUTPUT_PREFIX_wordids.txt`: mapping between words and their integer ids
* `OUTPUT_PREFIX_bow.mm`: bag-of-words (word counts) representation, in
Matrix Matrix format
* `OUTPUT_PREFIX_tfidf.mm`: TF-IDF representation
* `OUTPUT_PREFIX.tfidf_model`: TF-IDF model dump
python -m gensim.scripts.make_wiki zhwiki-latest-pages-articles.xml.bz2 zhwiki

将文章变成清晰的文本,并以稀疏TF-IDF向量存储。在具体情况可以看gensim官网,mm后缀表示Matrix Market格式保存的稀疏矩阵.

2、实验部分
利用 tfidf.mm 及wordids.txt 训练LDA模型

1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
from gensim import corpora, models
# 语料导入
id2word = corpora.Dictionary.load_from_text('zhwiki_wordids.txt')
mm = corpora.MmCorpus('zhwiki_tfidf.mm')
# 模型训练
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100)

3、模型结果
训练过程指定参数 num_topics=100, 即训练100个主题,通过print_topics() 和print_topic() 可查看各个主题下的词分布,也可通过save/load 进行模型保存加载。

1
2
3
4
5
6
7
8
# 打印前20个topic的词分布
lda.print_topics(20)
# 打印id为20的topic的词分布
lda.print_topic(20)
#模型的保存/ 加载
lda.save('zhwiki_lda.model')
lda = models.ldamodel.LdaModel.load('zhwiki_lda.model')

4、主题预测
对新文档,转换成bag-of-word后,可进行主题预测。模型差别主要在于主题数的设置,以及语料本身,wiki语料是全领域语料,主题分布并不明显,而且这里使用的语料没有去停止词,得到的结果差强人意。

1
2
3
4
5
6
7
test_doc = list(jieba.cut(test_doc))   #新文档进行分词
doc_bow = id2word.doc2bow(test_doc) #文档转换成bow
doc_lda = lda[doc_bow] #得到新文档的主题分布
#输出新文档的主题分布
print doc_lda
for topic in doc_lda:
print "%s\t%f\n"%(lda.print_topic(topic[0]), topic[1])

如果文章对您有用请随意打赏,谢谢支持!
0%