Istreniranti model koji predviđa da li je teki članak o Kini ili nije.
import pandas as pd
U nizu corpus veštački kreiramo tekstove. U nizu classes smeštamo informaciju o tome da je dati tekst o Kini ili nije.
corpus = ["Chinese Beijing Chinese",
"Chinese Chinese Changhai",
"Chinese Macao",
"Tokyo Japan Chinese"]
classes = ["yes", "yes", "yes", "no"]
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer - kolekciju tekstova transformiše u matricu frekvencije reči
vectorizer = CountVectorizer()
vectorizer.fit(corpus)
CountVectorizer()
X_train = vectorizer.transform(corpus)
X_train
<4x6 sparse matrix of type '<class 'numpy.int64'>' with 9 stored elements in Compressed Sparse Row format>
words = vectorizer.get_feature_names()
words
/home/ppc/.local/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
['beijing', 'changhai', 'chinese', 'japan', 'macao', 'tokyo']
X_train = pd.DataFrame(X_train.toarray(), columns=words)
X_train
beijing | changhai | chinese | japan | macao | tokyo | |
---|---|---|---|---|---|---|
0 | 1 | 0 | 2 | 0 | 0 | 0 |
1 | 0 | 1 | 2 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 0 | 1 | 0 |
3 | 0 | 0 | 1 | 1 | 0 | 1 |
Drugi način za predstavljanje teksta je da se koristi TFIDF reprezentacija:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(corpus)
TfidfVectorizer()
X_train2 = tfidf_vectorizer.transform(corpus)
X_train2 = pd.DataFrame(X_train2.toarray(), columns=words)
X_train2
beijing | changhai | chinese | japan | macao | tokyo | |
---|---|---|---|---|---|---|
0 | 0.691835 | 0.000000 | 0.722056 | 0.000000 | 0.000000 | 0.000000 |
1 | 0.000000 | 0.691835 | 0.722056 | 0.000000 | 0.000000 | 0.000000 |
2 | 0.000000 | 0.000000 | 0.462637 | 0.000000 | 0.886548 | 0.000000 |
3 | 0.000000 | 0.000000 | 0.346182 | 0.663385 | 0.000000 | 0.663385 |
Za treniranje klasifikatora teksta, najbolje performanse donosi multinomijalni najivni Bajesov algoritam.
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train, classes)
MultinomialNB()
test_doc = ["Chinese Chinese Chinese Tokyo Japan"]
X_test = vectorizer.transform(test_doc)
X_test = pd.DataFrame(X_test.toarray(), columns=words)
model.predict(X_test)
array(['yes'], dtype='<U3')