gensim

自然言語処理(NLP)ライブラリの1つ「gensim」の説明です。

gensimとは？

gensimとは？

gensimはPythonの自然言語処理(NLP)ライブラリの1つで、トピックモデリング、テキスト分類、類似性検索などの機能を提供します。

大規模なテキストデータの処理に適しており、多様な種類のテキストデータを扱うことが可能です。

gensimの導入方法

以下のコマンドを実行することでgensimをインストールすることができます。

pip install gensim

gensimの使い方

gensimを使用する前に、テキストデータをトークン化し、文書ごとに分割することが必要です。

gensimは、次のような機能を提供しています。

Word2Vec
Doc2Vec
LDA (Latent Dirichlet Allocation)
TF-IDF (Term Frequency - Inverse Document Frequency)

gensimのサンプルコード

以下は、gensimを使用して単語の類似性を計算する例です。

from gensim.models import Word2Vec
import numpy as np

# テキストデータの準備
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
            ['this', 'is', 'the', 'second', 'sentence'],
            ['yet', 'another', 'sentence'],
            ['one', 'more', 'sentence'],
            ['and', 'the', 'final', 'sentence']]

# Word2Vecモデルの作成
model = Word2Vec(sentences, min_count=1)

# 単語の類似性を計算
similarity = model.wv.similarity('first', 'second')
print("Similarity between 'first' and 'second':", similarity)

# 単語ベクトルの取得
vector = model.wv['sentence']
print("Vector for 'sentence':", vector)

サンプルコードの出力結果

上記のサンプルコードを実行すると、以下のような出力結果が得られます。

Similarity between 'first' and 'second': 0.089306
Vector for 'sentence': [ 3.6481785e-03 -1.0167462e-03 -2.5234916e-03 -1.6847862e-03
4.9742264e-03 -4.2414437e-03 -1.8586559e-03 -1.1807683e-03
…
…
3.0481311e-03 2.2055961e-03 -4.9911914e-03 4.3099104e-03
-1.7455659e-03 -3.5057513e-03 -3.0229077e-03 2.5836352e-03]

最初のprint文
- 単語 'first' と単語 'second' の類似性
  - 0.089306と計算されている
2番目のprint文
- 単語 'sentence' のベクトル表現が取得されている