Naivni Bajesov algoritam¶

In [1]:
import pandas as pd

Podaci¶

In [2]:
df = pd.read_csv("../data/balloons.csv")
In [3]:
df.head()
Out[3]:
color size act age inflated
0 YELLOW SMALL STRETCH ADULT T
1 YELLOW SMALL STRETCH ADULT T
2 YELLOW SMALL STRETCH CHILD F
3 YELLOW SMALL DIP ADULT F
4 YELLOW SMALL DIP CHILD F
In [4]:
df.describe()
Out[4]:
color size act age inflated
count 76 76 76 76 76
unique 2 2 2 2 2
top YELLOW SMALL STRETCH ADULT F
freq 40 40 38 38 41
In [5]:
features = list(df.columns[:-1])
print(features)
['color', 'size', 'act', 'age']

Preprocesiranje¶

In [6]:
from sklearn.model_selection import train_test_split
In [7]:
from sklearn.preprocessing import OrdinalEncoder

Ordinal encoder - kotegoričke atribute označene simbolima pretvara u numeričke vrednosti

In [8]:
X = df[features]
y = df["inflated"]
In [9]:
print(X.shape)
print(y.shape)
(76, 4)
(76,)

Uvek je korisno da se provere dimenzije karakteristika i klasa kako bismo utvrdili da su dimenzije usklađene.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=13, stratify=y)
In [11]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(50, 4)
(26, 4)
(50,)
(26,)
In [12]:
oe = OrdinalEncoder()
oe.fit(X_train)
X_train = oe.transform(X_train)
X_test = oe.transform(X_test)
In [13]:
pd.DataFrame(X_train, columns=features).head()
Out[13]:
color size act age
0 1.0 0.0 0.0 0.0
1 1.0 1.0 1.0 1.0
2 0.0 1.0 1.0 1.0
3 1.0 0.0 1.0 0.0
4 1.0 0.0 0.0 0.0

Treniranje¶

In [14]:
from sklearn.naive_bayes import CategoricalNB

CategoricalNB - algoritam naivnog Bajesa za kategoričke atribute

In [15]:
model = CategoricalNB()
model.fit(X_train, y_train)
Out[15]:
CategoricalNB()
In [16]:
classes = model.classes_
classes
Out[16]:
array(['F', 'T'], dtype='<U1')
In [17]:
model.class_count_
Out[17]:
array([27., 23.])
In [18]:
model.category_count_
model.category_count_
Out[18]:
[array([[13., 14.],
        [ 7., 16.]]),
 array([[17., 10.],
        [ 7., 16.]]),
 array([[19.,  8.],
        [ 6., 17.]]),
 array([[ 9., 18.],
        [17.,  6.]])]

Performanse modela na trening skupu¶

In [19]:
from sklearn.metrics import confusion_matrix
In [20]:
y_train_pred = model.predict(X_train)
pd.DataFrame(confusion_matrix(y_train, y_train_pred), columns=classes, index=classes)
Out[20]:
F T
F 21 6
T 6 17

Performase modela na test skupu¶

In [21]:
y_test_pred = model.predict(X_test)
pd.DataFrame(confusion_matrix(y_test, y_test_pred), columns=classes, index=classes)
Out[21]:
F T
F 12 2
T 2 10

Pajplajn¶

Pajplajn možemo da koristimo da automatizujemo proces preprocesiranja ulaznih podataka i treniranja modela. Kada kreiramo pajplajn, navodimo niz transformacija, a poslednji elemenat niza mora da bude klasifikacioni model.

In [22]:
from sklearn.pipeline import Pipeline
In [23]:
pipe = Pipeline([("ordinal encoder", OrdinalEncoder()), ("classifier", CategoricalNB())])
In [24]:
pipe.fit(X_train, y_train)
Out[24]:
Pipeline(steps=[('ordinal encoder', OrdinalEncoder()),
                ('classifier', CategoricalNB())])
In [25]:
pipe["ordinal encoder"]
Out[25]:
OrdinalEncoder()
In [26]:
y_test_pred = pipe.predict(X_test)
pd.DataFrame(confusion_matrix(y_test, y_test_pred), columns=classes, index=classes)
Out[26]:
F T
F 12 2
T 2 10