Klasifikacija teksta 1¶

Istreniranti model koji predviđa da li je teki članak o Kini ili nije.

In [1]:
import pandas as pd

Podaci¶

U nizu corpus veštački kreiramo tekstove. U nizu classes smeštamo informaciju o tome da je dati tekst o Kini ili nije.

In [2]:
corpus = ["Chinese Beijing Chinese", 
         "Chinese Chinese Changhai",
          "Chinese Macao",
         "Tokyo Japan Chinese"]

classes = ["yes", "yes", "yes", "no"]

Preprocesiranje¶

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

CountVectorizer - kolekciju tekstova transformiše u matricu frekvencije reči

In [4]:
vectorizer = CountVectorizer()
vectorizer.fit(corpus)
Out[4]:
CountVectorizer()
In [5]:
X_train = vectorizer.transform(corpus)
X_train
Out[5]:
<4x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>
In [6]:
words = vectorizer.get_feature_names()
words
/home/ppc/.local/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
Out[6]:
['beijing', 'changhai', 'chinese', 'japan', 'macao', 'tokyo']
In [7]:
X_train = pd.DataFrame(X_train.toarray(), columns=words)
X_train
Out[7]:
beijing changhai chinese japan macao tokyo
0 1 0 2 0 0 0
1 0 1 2 0 0 0
2 0 0 1 0 1 0
3 0 0 1 1 0 1

Drugi način za predstavljanje teksta je da se koristi TFIDF reprezentacija:

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [9]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(corpus)
Out[9]:
TfidfVectorizer()
In [10]:
X_train2 = tfidf_vectorizer.transform(corpus)
X_train2 = pd.DataFrame(X_train2.toarray(), columns=words)
X_train2
Out[10]:
beijing changhai chinese japan macao tokyo
0 0.691835 0.000000 0.722056 0.000000 0.000000 0.000000
1 0.000000 0.691835 0.722056 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.462637 0.000000 0.886548 0.000000
3 0.000000 0.000000 0.346182 0.663385 0.000000 0.663385

Treniranje modela¶

Za treniranje klasifikatora teksta, najbolje performanse donosi multinomijalni najivni Bajesov algoritam.

In [11]:
from sklearn.naive_bayes import MultinomialNB
In [12]:
model = MultinomialNB()
model.fit(X_train, classes)
Out[12]:
MultinomialNB()

Primer klasifikacije¶

In [13]:
test_doc = ["Chinese Chinese Chinese Tokyo Japan"]
In [14]:
X_test = vectorizer.transform(test_doc)
X_test = pd.DataFrame(X_test.toarray(), columns=words)
In [15]:
model.predict(X_test)
Out[15]:
array(['yes'], dtype='<U3')