구글 뉴스 word2vec 임베딩 데이터와 gensim 사용 예
구글 뉴스에서 생성하고 이미 학습된 단어 임베딩이 있다. GoogleNews-vectors-negative300.bin으로 이 파일은 아래 링크에서 다운 받을 수 있다. 파일은 약 1.5G정도이고 압축을 풀면 3.4G정도 된다.
GoogleNews-vectors-negative300.bin 다운로드 위치https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
이 데이터 셋은 gensim 패키지를 사용해 읽고 사용할 수 있다.
- gensim 패키지
python 에서 pip install -upgrade gensim
아나콘다에서 conda install -c conda-forge gensim
gensim의 핵심 컨셉은 다음과 같다고 한다. 더 자세한 내용은 gensim 페이지를 참조.
- Document: 일부 텍스트
- Corpus: 문서 모음.
- Vector: 문서의 수학적으로 편리한 표현.
- Model: 한 표현에서 다른 표현으로 벡터를 변환하는 알고리즘.
- 사용 예
1. 구글 뉴스 임베딩 데이터 불러오기
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
2. 'summer'와 관련 있는 10개의 단어를 찾을 경우 아래와 같이 할 수 있다.
model.most_similar(positive=['summer'],topn=10)
[('spring', 0.7650764584541321),
('winter', 0.7155519127845764),
('summertime', 0.691234290599823),
('summers', 0.6734901070594788),
('autumn', 0.6497201919555664),
('weekend', 0.6279284954071045),
('week', 0.6263920068740845),
('Summer', 0.6091794967651367),
('springtime', 0.5996415019035339),
('month', 0.5987645387649536)]
3. 관련도가 유사한 어떤 단어를 찾고 싶을 때는 아래와 같이 할 수도 있다.
Superman의 영화 Superman_returns, IronMan의 영화 Iron_Man_Triathlon이 있을 때, 그럼 Batman은?
model.most_similar(positive=['Batman']+['Superman_Returns','Iron_Man_Triathlon'],negative=['Superman','IronMan'],topn=10)
[('Dark_Knight', 0.5379644632339478),
('Batman_Begins', 0.5177570581436157),
('Shutter_Island', 0.4835358262062073),
('Christopher_Nolan', 0.47584158182144165),
('Christopher_Nolan_Inception', 0.46696215867996216),
('Tim_Burton', 0.45563334226608276),
('Transformers_sequel', 0.4532979130744934),
('Dead_Man_Chest', 0.44873523712158203),
('threequel', 0.4482341408729553),
('Little_Fockers', 0.4472888112068176)]
4. 단어의 vector는 단순히 model['원하는 단어']를 넣어 사용하면 된다.
model['carrot']
array([ 7.37304688e-02, 2.91015625e-01, 1.17675781e-01, 2.85156250e-01,
5.63964844e-02, -1.16699219e-01, 1.66015625e-01, -2.37304688e-01,
-4.45312500e-01, 1.96289062e-01, -1.79687500e-01, -1.00097656e-01,
7.81250000e-02, 1.52343750e-01, -4.60937500e-01, 3.57421875e-01,
-4.39453125e-02, -5.95703125e-02, -2.15820312e-01, -3.57421875e-01,
4.54101562e-02, 3.26171875e-01, 2.05078125e-01, 3.69140625e-01,
7.61718750e-02, -1.57226562e-01, -3.08593750e-01, 1.51367188e-01,
-1.01074219e-01, 1.29882812e-01, 3.93066406e-02, 1.83105469e-02,
6.54296875e-02, -2.05078125e-01, 1.46484375e-01, 5.34667969e-02,
1.10351562e-01, 2.44140625e-01, 5.22460938e-02, 1.27929688e-01,
2.85156250e-01, -5.63964844e-02, -1.92382812e-01, -5.83496094e-02,
-8.60595703e-03, -2.08007812e-01, 1.72119141e-02, -1.18164062e-01,
2.87109375e-01, 1.67968750e-01, -2.06054688e-01, 2.63671875e-01,
2.16064453e-02, 2.28271484e-02, 1.47460938e-01, 7.03125000e-02,
-2.23632812e-01, -2.17773438e-01, -7.76367188e-02, -1.30859375e-01,
-1.57226562e-01, 3.24707031e-02, 1.68945312e-01, 2.71484375e-01,
-9.91210938e-02, -2.17773438e-01, -5.24902344e-02, 7.42187500e-02,
9.47265625e-02, -4.76074219e-02, 4.76074219e-02, -1.41601562e-01,
-1.11328125e-01, 4.32128906e-02, 4.66308594e-02, -9.96093750e-02,
-1.66015625e-02, 2.69531250e-01, 1.55273438e-01, -1.97265625e-01,
7.17773438e-02, 1.26953125e-01, -3.85742188e-02, 8.69140625e-02,
5.43212891e-03, -3.58886719e-02, -8.30078125e-02, -1.08398438e-01,
1.83593750e-01, -5.29785156e-02, -9.81445312e-02, -2.11914062e-01,
-1.10839844e-01, -1.33789062e-01, 4.68750000e-02, -2.79541016e-02,
2.18750000e-01, 1.75781250e-02, 3.44238281e-02, -1.38671875e-01,
-2.43164062e-01, -2.47070312e-01, -1.69921875e-01, 1.42578125e-01,
-2.28271484e-02, -8.20312500e-02, 2.63671875e-01, -1.83593750e-01,
2.09960938e-02, -3.61328125e-01, -5.56640625e-02, -1.56250000e-01,
1.53320312e-01, 3.82812500e-01, -1.92382812e-01, 1.24023438e-01,
2.75390625e-01, -9.17968750e-02, -1.04980469e-01, -3.58886719e-02,
1.38671875e-01, 5.00488281e-02, -2.09960938e-01, -1.44531250e-01,
-2.44140625e-01, -5.20019531e-02, 3.67187500e-01, 1.60156250e-01,
1.61132812e-01, 8.15429688e-02, -2.14843750e-01, 5.68847656e-02,
-1.69677734e-02, -2.55859375e-01, -2.61718750e-01, 3.75000000e-01,
-2.02148438e-01, 3.14941406e-02, -1.84326172e-02, 3.59375000e-01,
1.34765625e-01, -1.16699219e-01, -1.16577148e-02, -2.50000000e-01,
2.53906250e-01, 2.91015625e-01, -3.30078125e-01, -4.22363281e-02,
-8.10546875e-02, -1.69921875e-01, 7.22656250e-02, -7.17773438e-02,
-1.12792969e-01, 1.23901367e-02, 8.10546875e-02, -2.32421875e-01,
-2.86865234e-02, -2.20703125e-01, -8.05664062e-02, 1.76757812e-01,
-1.53320312e-01, -1.68945312e-01, -1.04980469e-01, -1.12304688e-01,
-5.71289062e-02, 6.29882812e-02, 3.24707031e-02, -2.55859375e-01,
-9.32617188e-02, 3.36914062e-02, -1.40625000e-01, 2.46093750e-01,
-3.32031250e-01, 1.01562500e-01, -1.84570312e-01, -1.08886719e-01,
3.16406250e-01, -4.71191406e-02, -3.22265625e-02, 4.37500000e-01,
-1.72851562e-01, -1.51367188e-01, -1.70898438e-02, 1.44531250e-01,
-2.67578125e-01, 6.15234375e-02, -6.00585938e-02, 2.85156250e-01,
-1.33789062e-01, -9.96093750e-02, -3.18359375e-01, 3.20312500e-01,
-9.76562500e-02, -2.75390625e-01, 3.28125000e-01, 1.39648438e-01,
-3.04687500e-01, 1.48437500e-01, -2.65625000e-01, 1.55273438e-01,
5.78613281e-02, -9.27734375e-02, 4.76074219e-02, 3.03955078e-02,
-7.56835938e-02, -1.66015625e-01, -4.55856323e-04, 1.81640625e-01,
-7.42187500e-02, 1.26953125e-01, -2.79296875e-01, 2.80761719e-02,
3.35693359e-04, 1.58203125e-01, -7.51953125e-02, 9.57031250e-02,
5.15136719e-02, 1.37695312e-01, 2.57812500e-01, 4.86328125e-01,
7.42187500e-02, -1.25000000e-01, 7.37304688e-02, 6.78710938e-02,
-1.65039062e-01, -5.93261719e-02, 1.10839844e-01, -1.12304688e-01,
3.43750000e-01, -1.22558594e-01, -1.31835938e-01, -2.71484375e-01,
4.08203125e-01, 1.20117188e-01, -4.61425781e-02, -1.51367188e-01,
-6.79016113e-04, -1.20605469e-01, 4.21875000e-01, 1.74804688e-01,
9.08203125e-02, 2.92968750e-02, 5.03540039e-03, 6.62231445e-03,
9.08203125e-02, -2.94921875e-01, -1.51367188e-01, 1.30859375e-01,
-8.97216797e-03, 1.68945312e-01, -1.56250000e-01, -2.05078125e-01,
1.53320312e-01, -9.86328125e-02, -7.95898438e-02, 1.40380859e-02,
-1.31835938e-01, -3.53515625e-01, -1.15722656e-01, -8.30078125e-02,
2.91015625e-01, 3.08593750e-01, -4.08203125e-01, 1.37695312e-01,
-7.86132812e-02, -8.05664062e-02, -3.96484375e-01, -2.67333984e-02,
-4.62890625e-01, -2.59765625e-01, 1.43554688e-01, -4.29687500e-02,
4.17968750e-01, -1.51367188e-01, -8.10546875e-02, 1.11816406e-01,
4.12597656e-02, -8.60595703e-03, -9.42382812e-02, 8.15429688e-02,
-3.00781250e-01, 3.59375000e-01, -1.72119141e-02, -1.47460938e-01,
-3.75000000e-01, -8.05664062e-02, -2.41210938e-01, 8.98437500e-02,
1.81640625e-01, 4.66308594e-02, -3.98437500e-01, 3.36914062e-02,
1.35742188e-01, 6.44531250e-02, 3.32031250e-02, -6.16455078e-03,
-3.58886719e-02, 8.72802734e-03, -5.15136719e-02, -1.05285645e-03],
dtype=float32)
5. t-SNE를 사용해 단어들의 임베딩 위치를 이미지화 하면 아래와 같다.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
items = ['apple','pear','orange','peach','hockey','soccer','handball','baseball','spring','summer','autumn','winter']
item_vectors = [(item, model[item]) for item in items if item in model]
vectors = np.asanyarray([x[1] for x in item_vectors])
tsne = TSNE(n_components=2, perplexity=10, verbose=2).fit_transform(vectors)
'''
array([[-6.13828850e+01, -9.42033539e+01],
[-7.22851791e+01, -1.65445816e+02],
[ 6.05180450e-02, -1.38084763e+02],
[-1.24179794e+02, -1.21860931e+02],
[-1.59633362e+02, 3.07134132e+01],
[-1.13330589e+02, -1.97091560e+01],
[-1.80118347e+02, -5.03099327e+01],
[-8.97173386e+01, 4.90914650e+01],
[-5.21244586e-01, 4.67340393e+01],
[ 3.19449387e+01, -5.50076752e+01],
[ 5.97267227e+01, 7.03188896e+00],
[-2.42969208e+01, -1.73803673e+01]], dtype=float32)
'''
x = tsne[:,0]
y = tsne[:,1]
fig,ax = plt.subplots()
ax.scatter(x,y)
for item, x1, y1 in zip(item_vectors, x, y):
ax.annotate(item[0],(x1,y1),size=14)
plt.show()
댓글
댓글 쓰기