Klasifikacija teksa 2¶

Istrenirati model koji klasifikuje novinske članke.

In [1]:
import pandas as pd
In [2]:
import os

Podaci¶

Skup podataka se sastoji od datoteka koje sadže informacije o novinskim člancima. Svakom novinskom članku odgovara jedna datoteka iz skupa podataka, a u svakoj datoteci se nalazi spisak reči i broj njihovih pojavljivanja u odgovarajućem novinskom članku. Svaka reč i broj njenih pojavljaivanja se nalazi u posebnom redu. Svaka dateka se nalazi u direktorijumu sa nazivom kategorije novinskog članka koje ta datoteka odgovara.

Pravimo funkicju koja prolazi kroz svaki direktorijum i kroz svaku datoteku i za svaki članak čuva rečnik pojavljivanja reči i klasu kojoj taj članak pripada.

In [3]:
def read_date(root_dir):
    corpus = []
    classes = []
    
    for class_name in os.listdir(root_dir):
        class_dir = os.path.join(root_dir, class_name)
        
        for file_name in os.listdir(class_dir):
            file_path = os.path.join(class_dir, file_name)
            
            with open(file_path) as f:
                word_counts = {}
                
                for line in f:
                    word, count = line.split()
                    count = int(count)
                    word_counts[word] = count
                    
                corpus.append(word_counts)
                classes.append(class_name)
                
    return corpus, classes
In [4]:
X_train, y_train = read_date("../data/tekstovi/Trening/")
print(len(X_train))
print(len(y_train))
3492
3492

Preprocesiranje¶

Rečnik pojavljivanja treba pretvoriti u matricu frkvencije reči

In [5]:
from sklearn.feature_extraction import DictVectorizer
In [6]:
dv = DictVectorizer()
dv.fit(X_train)
Out[6]:
DictVectorizer()
In [7]:
dv.feature_names_[:5]
Out[7]:
['ab', 'abasu', 'abati', 'abc', 'abdul']
In [8]:
len(dv.feature_names_)
Out[8]:
36830
In [9]:
X_train = dv.transform(X_train)
In [10]:
X_train.toarray()
Out[10]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
In [11]:
X_train = pd.DataFrame(X_train.toarray(), columns=dv.feature_names_)
In [12]:
X_train.head()
Out[12]:
ab abasu abati abc abdul abdulah abe aberdin abhaziji abida ... zxurno zxustel zxustrine zxustro zxuticx zxutih zxutilovine zxuto zxutra zxuzxa
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 36830 columns

Treniranje modela¶

In [13]:
from sklearn.naive_bayes import MultinomialNB
In [14]:
model = MultinomialNB()
In [15]:
model.fit(X_train, y_train)
Out[15]:
MultinomialNB()
In [16]:
class_names = model.classes_
class_names
Out[16]:
array(['Ekonomija', 'HronikaKriminal', 'KulturaZabava', 'Politika',
       'Sport'], dtype='<U15')

Evaluacija modela¶

In [17]:
from sklearn.metrics import accuracy_score
In [18]:
from sklearn.metrics import confusion_matrix
In [19]:
y_train_pred = model.predict(X_train)
In [20]:
accuracy_score(y_train, y_train_pred)
Out[20]:
0.9401489117983963
In [21]:
X_test, y_test = read_date("../data/tekstovi/Testing/")
X_test = dv.transform(X_test)
X_test = pd.DataFrame(X_test.toarray(), columns=dv.feature_names_)
In [22]:
X_test.head()
Out[22]:
ab abasu abati abc abdul abdulah abe aberdin abhaziji abida ... zxurno zxustel zxustrine zxustro zxuticx zxutih zxutilovine zxuto zxutra zxuzxa
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 36830 columns

In [23]:
y_test_pred = model.predict(X_test)
In [24]:
accuracy_score(y_test, y_test_pred)
Out[24]:
0.8995983935742972
In [25]:
pd.DataFrame(confusion_matrix(y_test, y_test_pred), index=class_names, columns=class_names)
Out[25]:
Ekonomija HronikaKriminal KulturaZabava Politika Sport
Ekonomija 152 0 1 13 0
HronikaKriminal 10 226 6 66 1
KulturaZabava 2 0 301 6 4
Politika 8 36 9 411 3
Sport 2 1 1 6 478

Upoređivanje sa drugim modelima¶

In [26]:
from sklearn.neighbors import KNeighborsClassifier
In [27]:
model = KNeighborsClassifier()
model.fit(X_train,y_train)
Out[27]:
KNeighborsClassifier()
In [28]:
y_test_pred = model.predict(X_test)
In [29]:
accuracy_score(y_test, y_test_pred)
Out[29]:
0.5783132530120482
In [30]:
from sklearn.tree import DecisionTreeClassifier
In [31]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
Out[31]:
DecisionTreeClassifier()
In [32]:
y_test_pred = model.predict(X_test)
In [33]:
accuracy_score(y_test, y_test_pred)
Out[33]:
0.7504302925989673