"
],
"text/plain": [
" lemma id_text ... 2 archive\n",
"pn id_line ... \n",
"100041 3 6(diš)[]NU P100041 ... NU \n",
" 3 udu[sheep]N P100041 ... N domesticated_animal\n",
" 4 kišib[seal]N P100041 ... N \n",
" 4 Lusuen[0]PN P100041 ... PN \n",
" 5 ki[place]N P100041 ... N \n",
"\n",
"[5 rows x 32 columns]"
]
},
"execution_count": 43,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"for archive in labels.keys():\n",
" data.loc[data.loc[:, 1].str.contains('|'.join([re.escape(x) for x in labels[archive]])), 'archive'] = archive\n",
"\n",
"data.loc[:, 'archive'] = data.loc[:, 'archive'].fillna('')\n",
"\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"id": "bd6d513c",
"metadata": {
"id": "-5so2KJ5bLin"
},
"source": [
"The function get_set has a dataframe row as an input and returns a dictionary where each key is a word type like NU and PN. The values are its corresponding lemmas."
]
},
{
"cell_type": "markdown",
"id": "5e180b98",
"metadata": {
"id": "IFeAuQmDInDY"
},
"source": [
"### 1.2 Data Structuring"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63afc925",
"metadata": {
"id": "5kTJp996bNaz"
},
"outputs": [],
"source": [
"def get_set(df):\n",
" \n",
" d = {}\n",
"\n",
" seals = df[df['label'].str.contains('seal')]\n",
" df = df[~df['label'].str.contains('seal')]\n",
"\n",
" for x in df[2].unique():\n",
" d[x] = set(df.loc[df[2] == x, 0])\n",
"\n",
" d['SEALS'] = {}\n",
" for x in seals[2].unique():\n",
" d['SEALS'][x] = set(seals.loc[seals[2] == x, 0])\n",
"\n",
" return d"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bff4408f",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "dgLTwrRebO-7",
"outputId": "dd3479a1-5d36-4235-8864-e6c92daa070e"
},
"outputs": [
{
"data": {
"text/plain": [
"{'': {''},\n",
" 'MN': {'Šueša'},\n",
" 'N': {'itud', 'maš', 'mu', 'mu.DU', 'udu'},\n",
" 'NU': {'1(diš)', '2(diš)'},\n",
" 'PN': {'Apilatum', 'Ku.ru.ub.er₃', 'Šulgisimti'},\n",
" 'SEALS': {},\n",
" 'SN': {'Šašrum'},\n",
" 'V/i': {'hulu'},\n",
" 'V/t': {'dab'}}"
]
},
"execution_count": 45,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"get_set(data.loc[100271])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eb30aa35",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 230
},
"id": "rNidxpgcbQVq",
"outputId": "0abc700c-248e-42b6-99c6-e989458c3dbe"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
archive
\n",
"
set
\n",
"
\n",
"
\n",
"
pn
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
100041
\n",
"
{domesticated_animal}
\n",
"
{'NU': {'6(diš)'}, 'N': {'ki', 'kišib', 'udu'}...
\n",
"
\n",
"
\n",
"
100189
\n",
"
{dead_animal}
\n",
"
{'NU': {'1(diš)', '5(diš)-kam', '2(diš)'}, 'N'...
\n",
"
\n",
"
\n",
"
100190
\n",
"
{dead_animal}
\n",
"
{'NU': {'3(u)', '1(diš)', '5(diš)', '1(diš)-ka...
\n",
"
\n",
"
\n",
"
100191
\n",
"
{dead_animal}
\n",
"
{'NU': {'1(diš)', '4(diš)', '4(diš)-kam', '2(u...
\n",
"
\n",
"
\n",
"
100211
\n",
"
{dead_animal}
\n",
"
{'NU': {'1(diš)', '1(u)', '1(diš)-kam', '2(diš...
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" archive set\n",
"pn \n",
"100041 {domesticated_animal} {'NU': {'6(diš)'}, 'N': {'ki', 'kišib', 'udu'}...\n",
"100189 {dead_animal} {'NU': {'1(diš)', '5(diš)-kam', '2(diš)'}, 'N'...\n",
"100190 {dead_animal} {'NU': {'3(u)', '1(diš)', '5(diš)', '1(diš)-ka...\n",
"100191 {dead_animal} {'NU': {'1(diš)', '4(diš)', '4(diš)-kam', '2(u...\n",
"100211 {dead_animal} {'NU': {'1(diš)', '1(u)', '1(diš)-kam', '2(diš..."
]
},
"execution_count": 46,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"archives = pd.DataFrame(data.groupby('pn').apply(lambda x: set(x['archive'].unique()) - set(['']))).rename(columns={0: 'archive'})\n",
"archives.loc[:, 'set'] = data.reset_index().groupby('pn').apply(get_set)\n",
"archives.loc[:, 'archive'] = archives.loc[:, 'archive'].apply(lambda x: {'dead_animal'} if 'dead_animal' in x else x)\n",
"archives.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "08da24cb",
"metadata": {
"id": "rOKf3s6qbSrG"
},
"outputs": [],
"source": [
"def get_line(row, pos_lst=['N']):\n",
" words = {'pn' : [row.name]} #set p_number\n",
" for pos in pos_lst:\n",
" if pos in row['set']:\n",
" #add word entries for all words of the selected part of speech\n",
" words.update({word: [1] for word in row['set'][pos]})\n",
" return pd.DataFrame(words)"
]
},
{
"cell_type": "markdown",
"id": "c844a607",
"metadata": {
"id": "D-fon0TCLhxA"
},
"source": [
"Each row represents a unique P-number, so the matrix indicates which word are present in each text."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27c47c98",
"metadata": {
"id": "wgZkM0MCQzu3"
},
"outputs": [],
"source": [
"sparse = words_df.groupby(by=['id_text', 'lemma']).count()\n",
"sparse = sparse['id_word'].unstack('lemma')\n",
"sparse = sparse.fillna(0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "978fdb70",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 461
},
"id": "yHn9mTTVMOsy",
"outputId": "6ba8c00f-d3ef-4073-f98f-f26a176aba22"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
ki
\n",
"
kišib
\n",
"
udu
\n",
"
itud
\n",
"
mu
\n",
"
ud
\n",
"
ga
\n",
"
sila
\n",
"
šu
\n",
"
mu.DU
\n",
"
maškim
\n",
"
ekišibak
\n",
"
zabardab
\n",
"
u
\n",
"
maš
\n",
"
šag
\n",
"
lugal
\n",
"
mašgal
\n",
"
kir
\n",
"
a
\n",
"
en
\n",
"
ensik
\n",
"
egia
\n",
"
igikar
\n",
"
ŋiri
\n",
"
ragaba
\n",
"
dubsar
\n",
"
mašda
\n",
"
saŋŋa
\n",
"
amar
\n",
"
mada
\n",
"
akiti
\n",
"
lu
\n",
"
ab
\n",
"
gud
\n",
"
ziga
\n",
"
uzud
\n",
"
ašgar
\n",
"
gukkal
\n",
"
šugid
\n",
"
...
\n",
"
šaŋanla
\n",
"
pukutum
\n",
"
lagaztum
\n",
"
bangi
\n",
"
imdua
\n",
"
KU.du₃
\n",
"
batiʾum
\n",
"
niŋna
\n",
"
sikiduʾa
\n",
"
gudumdum
\n",
"
šuhugari
\n",
"
šutur
\n",
"
gaguru
\n",
"
nindašura
\n",
"
ekaskalak
\n",
"
usaŋ
\n",
"
nammah
\n",
"
egizid
\n",
"
nisku
\n",
"
gara
\n",
"
saŋ.DUN₃
\n",
"
muhaldimgal
\n",
"
šagiagal
\n",
"
šagiamah
\n",
"
kurunakgal
\n",
"
ugulaʾek
\n",
"
šidimgal
\n",
"
kalam
\n",
"
enkud
\n",
"
in
\n",
"
kiʾana
\n",
"
bahar
\n",
"
hurizum
\n",
"
lagab
\n",
"
ibadu
\n",
"
balla
\n",
"
šembulug
\n",
"
li
\n",
"
niŋsaha
\n",
"
ensi
\n",
"
\n",
"
\n",
"
pn
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
100041
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
100189
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
100190
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
100191
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
100211
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
519650
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
519658
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
519792
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
519957
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
519959
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
15139 rows × 1076 columns
\n",
"
"
],
"text/plain": [
" ki kišib udu itud mu ... balla šembulug li niŋsaha ensi\n",
"pn ... \n",
"100041 1.0 1.0 1.0 NaN NaN ... NaN NaN NaN NaN NaN\n",
"100189 1.0 NaN 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN\n",
"100190 1.0 NaN 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN\n",
"100191 1.0 NaN 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN\n",
"100211 1.0 NaN 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN\n",
"... ... ... ... ... ... ... ... ... .. ... ...\n",
"519650 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN\n",
"519658 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN\n",
"519792 NaN NaN 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN\n",
"519957 NaN NaN NaN 1.0 NaN ... NaN NaN NaN NaN NaN\n",
"519959 1.0 1.0 NaN 1.0 NaN ... NaN NaN NaN NaN NaN\n",
"\n",
"[15139 rows x 1076 columns]"
]
},
"execution_count": 49,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"sparse = pd.concat(archives.apply(get_line, axis=1).values).set_index('pn')\n",
"\n",
"sparse"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aeb77970",
"metadata": {
"id": "Cps3vRT8Xc7f"
},
"outputs": [],
"source": [
"sparse = sparse.fillna(0)\n",
"sparse = sparse.join(archives.loc[:, 'archive'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95e8af2d",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 279
},
"id": "j43addrEfqrB",
"outputId": "d73bcf04-e30d-4fb0-dd19-c8f83425eed3"
},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(pca_archive.explained_variance_ratio_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "107b64d4",
"metadata": {
"id": "5Ey9jWsOd3LL"
},
"outputs": [],
"source": [
"known_reindexed = known.reset_index()\n",
"known_reindexed"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "632abd1e",
"metadata": {
"id": "vojBl0tgd53L"
},
"outputs": [],
"source": [
"plt.figure()\n",
"plt.figure(figsize=(10,10))\n",
"plt.xticks(fontsize=12)\n",
"plt.yticks(fontsize=14)\n",
"plt.xlabel('Principal Component 1',fontsize=20)\n",
"plt.ylabel('Principal Component 2',fontsize=20)\n",
"plt.title(\"Principal Component Analysis of Archives\",fontsize=20)\n",
"targets = ['domesticated_animal', 'wild_animal', 'dead_animal', 'leather_object', 'precious_object', 'wool']\n",
"colors = ['red', 'orange', 'yellow', 'green', 'blue', 'violet']\n",
"for target, color in zip(targets,colors):\n",
" indicesToKeep = known_reindexed.index[known_reindexed['archive_class'] == target].tolist()\n",
" plt.scatter(principal_archive_Df.loc[indicesToKeep, 'principal component 1']\n",
" , principal_archive_Df.loc[indicesToKeep, 'principal component 2'], c = color, s = 50)\n",
"\n",
"plt.legend(targets,prop={'size': 15})\n",
"\n",
"# import seaborn as sns\n",
"# plt.figure(figsize=(16,10))\n",
"# sns.scatterplot(\n",
"# x=\"principal component 1\", y=\"principal component 2\",\n",
"# hue=\"y\",\n",
"# palette=sns.color_palette(\"hls\", 10),\n",
"# data=principal_cifar_Df,\n",
"# legend=\"full\",\n",
"# alpha=0.3\n",
"# )"
]
},
{
"cell_type": "markdown",
"id": "176c95eb",
"metadata": {
"id": "pXxYqfyRLilb"
},
"source": [
"## 2 Simple Modeling Methods"
]
},
{
"cell_type": "markdown",
"id": "36cb6ba4",
"metadata": {
"id": "0NfR51-5M29w"
},
"source": [
"### 2.1 Logistic Regression\n",
"\n",
"Here we will train our model using logistic regression to predict archives based on the features made in the previous subsection."
]
},
{
"cell_type": "markdown",
"id": "1614eadb",
"metadata": {
"id": "pwuS7jLwcReG"
},
"source": [
"#### 2.1.1 Logistic Regression by Archive\n",
"\n",
"Here we will train and test a set of 1 vs all Logistic Regression Classifiers which will attempt to classify tablets as either a part of an archive, or not in an archive."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c516db73",
"metadata": {
"id": "rGmHTOlvd-Nl"
},
"outputs": [],
"source": [
"clf_da = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_da.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'domesticated_animal'])\n",
"clf_da.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'domesticated_animal'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9536cdff",
"metadata": {
"id": "T1tHDbt8eBFZ"
},
"outputs": [],
"source": [
"clf_wa = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_wa.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'wild_animal'])\n",
"clf_wa.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'wild_animal'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "66f815ff",
"metadata": {
"id": "eyr-ZmqReDZ_"
},
"outputs": [],
"source": [
"clf_dea = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_dea.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'dead_animal'])\n",
"clf_dea.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'dead_animal'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "362a3834",
"metadata": {
"id": "ibPqJEyjeEdJ"
},
"outputs": [],
"source": [
"clf_lo = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_lo.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'leather_object'])\n",
"clf_lo.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'leather_object'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59486eec",
"metadata": {
"id": "P4VI2NUXeFnL"
},
"outputs": [],
"source": [
"clf_po = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_po.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'precious_object'])\n",
"clf_po.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'precious_object'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2a87a9db",
"metadata": {
"id": "RVStvp5aeGtD"
},
"outputs": [],
"source": [
"clf_w = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_w.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'wool'])\n",
"clf_w.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'wool'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0cc28ad5",
"metadata": {
"id": "dL56mGZUTCTx"
},
"outputs": [],
"source": [
"known.loc[:, 'AN.bu.um':'šuʾura']"
]
},
{
"cell_type": "markdown",
"id": "b0929509",
"metadata": {
"id": "bFvtNRjlGtsS"
},
"source": [
"As we can see the domesticated animal model has the lowest accuracy while the leather object, precious_object, and wool classifiers work fairly well."
]
},
{
"cell_type": "markdown",
"id": "5616d480",
"metadata": {
"id": "YopZHGeAb9LB"
},
"source": [
"#### 2.1.2 Multinomial Logistic Regression\n",
"\n",
"Here we will be using multinomial logistic regression as we have multiple archive which we could classify each text into. We are fitting our data onto the tablets with known archives and then checking the score to see how accurate the model is.\n",
"\n",
"Finally, we append the Logistic Regression prediction as an archive prediction for the tablets without known archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d55e23b",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "G_Fx9IEBcqof",
"outputId": "56380118-5d77-4c2a-946d-d66d5e152273"
},
"outputs": [
{
"data": {
"text/plain": [
"0.6918291862811029"
]
},
"execution_count": 131,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"clf_archive = LogisticRegression(random_state=42, solver='lbfgs', max_iter=300)\n",
"clf_archive.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
"log_reg_score = clf_archive.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
"model_weights['LogReg'] = log_reg_score\n",
"log_reg_score"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6416122f",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 529
},
"id": "BQtotAI5csuL",
"outputId": "64c9ab17-0f5b-4aa2-b599-f0a3a601f716"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
ki
\n",
"
kišib
\n",
"
udu
\n",
"
itud
\n",
"
mu
\n",
"
ud
\n",
"
ga
\n",
"
sila
\n",
"
šu
\n",
"
mu.DU
\n",
"
maškim
\n",
"
ekišibak
\n",
"
zabardab
\n",
"
u
\n",
"
maš
\n",
"
šag
\n",
"
lugal
\n",
"
mašgal
\n",
"
kir
\n",
"
a
\n",
"
en
\n",
"
ensik
\n",
"
egia
\n",
"
igikar
\n",
"
ŋiri
\n",
"
ragaba
\n",
"
dubsar
\n",
"
mašda
\n",
"
saŋŋa
\n",
"
amar
\n",
"
mada
\n",
"
akiti
\n",
"
lu
\n",
"
ab
\n",
"
gud
\n",
"
ziga
\n",
"
uzud
\n",
"
ašgar
\n",
"
gukkal
\n",
"
šugid
\n",
"
...
\n",
"
sikiduʾa
\n",
"
gudumdum
\n",
"
šuhugari
\n",
"
šutur
\n",
"
gaguru
\n",
"
nindašura
\n",
"
ekaskalak
\n",
"
usaŋ
\n",
"
nammah
\n",
"
egizid
\n",
"
nisku
\n",
"
gara
\n",
"
saŋ.DUN₃
\n",
"
muhaldimgal
\n",
"
šagiagal
\n",
"
šagiamah
\n",
"
kurunakgal
\n",
"
ugulaʾek
\n",
"
šidimgal
\n",
"
kalam
\n",
"
enkud
\n",
"
in
\n",
"
kiʾana
\n",
"
bahar
\n",
"
hurizum
\n",
"
lagab
\n",
"
ibadu
\n",
"
balla
\n",
"
šembulug
\n",
"
li
\n",
"
niŋsaha
\n",
"
ensi
\n",
"
archive
\n",
"
domesticated_animal
\n",
"
wild_animal
\n",
"
dead_animal
\n",
"
leather_object
\n",
"
precious_object
\n",
"
wool
\n",
"
LogReg Predicted Archive
\n",
"
\n",
"
\n",
"
pn
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
100217
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{domesticated_animal, wild_animal}
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
100229
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{domesticated_animal, wild_animal}
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
100284
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{domesticated_animal, wild_animal}
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
100292
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
100301
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
519647
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
519650
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
519658
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
519957
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
519959
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
" \n",
"
\n",
"
3243 rows × 1084 columns
\n",
"
"
],
"text/plain": [
" ki kišib udu ... precious_object wool LogReg Predicted Archive\n",
"pn ... \n",
"100217 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"100229 1.0 0.0 1.0 ... 0.0 0.0 domesticated_animal\n",
"100284 0.0 0.0 1.0 ... 0.0 0.0 domesticated_animal\n",
"100292 1.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"100301 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"... ... ... ... ... ... ... ...\n",
"519647 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"519650 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"519658 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"519957 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"519959 1.0 1.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"\n",
"[3243 rows x 1084 columns]"
]
},
"execution_count": 129,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"#Predictions for Unknown\n",
"unknown[\"LogReg Predicted Archive\"] = clf_archive.predict(unknown.loc[:, 'AN.bu.um':'šuʾura'])\n",
"unknown"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe64bfed",
"metadata": {
"id": "D2fSDs6AcvWs"
},
"outputs": [],
"source": [
"known['archive_class'].unique()"
]
},
{
"cell_type": "markdown",
"id": "1d3769c3",
"metadata": {
"id": "gNWP2GzecAim"
},
"source": [
"### 2.2 K Nearest Neighbors\n",
"\n",
"Here we will train our model using k nearest neighbors to predict archives based on the features made in the previous subsection. We are fitting our data onto the tablets with known archives and then checking the score to see how accurate the model is.\n",
"\n",
"We then append the KNN prediction as an archive prediction for the tablets without known archives.\n",
"\n",
"Then, we use different values for K (the number of neighbors we take into consideration when predicting for a tablet) to see how the accuracy changes for different values of K. This can be seen as a form of hyperparameter tuning because we are trying to see which K we should choose to get the highest training accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8de92c39",
"metadata": {
"id": "5iHBQzoQc3_i"
},
"outputs": [],
"source": [
"#takes long time to run, so don't run again\n",
"list_k = [3, 5, 7, 9, 11, 13]\n",
"max_k, max_score = 0, 0\n",
"for k in list_k:\n",
" knn = KNeighborsClassifier(n_neighbors=k)\n",
" knn.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
" knn_score = knn.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
" print(\"Accuracy for k = %s: \" %(k), knn_score)\n",
" if max_score <= knn_score:\n",
" max_score = knn_score\n",
" max_k = k\n",
" "
]
},
{
"cell_type": "markdown",
"id": "a7c1116b",
"metadata": {
"id": "FeBx_TVR-Aww"
},
"source": [
"As we can see here, k = 5 and k = 9 have the best training accuracy performance which falls roughly in line with the Logistic Regression classification training accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3478dc3b",
"metadata": {
"id": "D5m3ZZIrcxun"
},
"outputs": [],
"source": [
"knn = KNeighborsClassifier(n_neighbors=max_k)\n",
"knn.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
"knn_score = knn.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
"model_weights['KNN'] = knn_score"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "239e3978",
"metadata": {
"id": "DXC8SgIyc09u"
},
"outputs": [],
"source": [
"#Predictions for Unknown\n",
"unknown[\"KNN Predicted Archive\"] = knn.predict(unknown.loc[:, 'AN.bu.um':'šuʾura'])\n",
"unknown"
]
},
{
"cell_type": "markdown",
"id": "33e11888",
"metadata": {
"id": "Pmiq9-8N9Zys"
},
"source": [
"As we can see in the output from the previous cell, we can get different predictions depending on the classifier we choose."
]
},
{
"cell_type": "markdown",
"id": "4c58e4cb",
"metadata": {
"id": "JTPzHB1U-cXx"
},
"source": [
"Next we will split the data we have on tablets with known archives into a test and training set to further understant the atraining accuracy. For the next two sections, we will use `X_train` and `y_train` to train the data and `X_test` and `y_test` to test the data. As the known set was split randomly, we presume that both the training and test set are representative of the whole known set, so the two sets are reasonably comparable."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db472f06",
"metadata": {
"id": "-iAwXCVkc548"
},
"outputs": [],
"source": [
"#Split known into train and test, eventually predict with unknown \n",
"X_train, X_test, y_train, y_test = train_test_split(known.loc[:, 'AN.bu.um':'šuʾura'], \n",
" known.loc[:, 'archive_class'], \n",
" test_size=0.2,random_state=0) "
]
},
{
"cell_type": "markdown",
"id": "af0f7284",
"metadata": {
"id": "kWzkxLtOcFNv"
},
"source": [
"### 2.3 Naive Bayes\n",
"\n",
"Here we will train our model using a Naive Bayes Model to predict archives based on the features made in the previous subsection. Here, we make the assumption that the features are independent of each other, from which the descriptor _naive_ comes from. So:\n",
"\n",
"$$P(x_i|y; x_1, x_2, ..., x_{i-1}, x_{i+1}, ..., x_n) = P(x_i| y)$$\n",
"\n",
"and:\n",
"\n",
"$$P(x_1, x_2, ..., x_n | y) = \\prod_{i=1}^{n} P(x_i | y)$$\n",
"\n",
"Moreover, we will be using Bayes' Law, which in this case states:\n",
"\n",
"$$P(y|x_1, x_2, ..., x_n) = \\frac{P(y)P(x_1, x_2, ..., x_n | y)}{P(x_1, x_2, ..., x_n)}$$\n",
"\n",
"eg. the probability of a particular tablet (defined by features $x_1, x_2, ..., x_n$) is in archive $y$, is equal to the probability of getting a tablet from archive $y$ times the probability you would get a particular set of features $x_1, x_2, ..., x_n$ divided by the probability of getting a particular set of features $x_1, x_2, ..., x_n$.\n",
"\n",
"Applying our assumption of independence from before, we can simplify this to:\n",
"\n",
"$$P(y|x_1, x_2, ..., x_n) = \\frac{P(y)\\prod_{i=1}^{n} P(x_i | y)}{P(x_1, x_2, ..., x_n)}$$\n",
"\n",
"Which means the probability of a particular tablet (defined by features $x_1, x_2, ..., x_n$) is in archive $y$ is _proportional_ to \n",
"\n",
"$$P(y|x_1, x_2, ..., x_n) \\propto P(y)\\prod_{i=1}^{n} P(x_i | y)$$ probability of getting a tablet from archive $y$ times the product of probabilities of getting a feature $x_i$ given an archive $y$.\n",
"\n",
"We can then use this to calculate the maximizing archive.\n",
"\n",
"$$\\hat{y} = \\underset{y}{argmax} \\; P(y)\\prod_{i=1}^{n} P(x_i | y)$$\n",
"\n",
"We are training two models where the first assumes the features are Gaussian random variables and the second assumes the features are Bernoulli random variables.\n",
"\n",
"We are fitting our data onto the tablets with known archives and then checking the score to see how accurate the model is.\n",
"\n",
"Finally, we append the two Naive Bayes predictions as archive predictions for the tablets without known archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9cea5277",
"metadata": {
"id": "tCIlLq7Jc7Hs"
},
"outputs": [],
"source": [
"#Gaussian\n",
"gauss = GaussianNB()\n",
"gauss.fit(X_train, y_train)\n",
"gauss_nb_score = gauss.score(X_test, y_test)\n",
"model_weights['GaussNB'] = gauss_nb_score\n",
"gauss_nb_score"
]
},
{
"cell_type": "markdown",
"id": "a4204922",
"metadata": {
"id": "BkK71TVzErST"
},
"source": [
"We can see than the Gaussian assumption does quite poorly."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cbf74cda",
"metadata": {
"id": "Ei_I_lWMc9Ar"
},
"outputs": [],
"source": [
"#Predictions for Unknown\n",
"unknown[\"GaussNB Predicted Archive\"] = gauss.predict(unknown.loc[:, 'AN.bu.um':'šuʾura'])\n",
"unknown"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00ac7360",
"metadata": {
"id": "DtJQjvWSc_Ym"
},
"outputs": [],
"source": [
"#Bernoulli\n",
"bern = BernoulliNB()\n",
"bern.fit(X_train, y_train)\n",
"bern_nb_score = bern.score(X_test, y_test)\n",
"model_weights['BernoulliNB'] = bern_nb_score\n",
"bern_nb_score"
]
},
{
"cell_type": "markdown",
"id": "8851e94c",
"metadata": {
"id": "k0ZneRNOEwcg"
},
"source": [
"However the Bernoulli assumption does quite well."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f8fae852",
"metadata": {
"id": "91Gk1c5EdAka"
},
"outputs": [],
"source": [
"#Predictions for Unknown\n",
"unknown[\"BernoulliNB Predicted Archive\"] = bern.predict(unknown.loc[:, 'AN.bu.um':'šuʾura'])\n",
"unknown"
]
},
{
"cell_type": "markdown",
"id": "e5062b38",
"metadata": {
"id": "rlwzgIeicIB3"
},
"source": [
"### 2.4 SVM\n",
"\n",
"Here we will train our model using Support Vector Machines to predict archives based on the features made earlier in this section. We are fitting our data onto the tablets with known archives and then checking the score to see how accurate the model is.\n",
"\n",
"Finally, we append the SVM prediction as an archive prediction for the tablets without known archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95e9a98d",
"metadata": {
"id": "3EZL1pkKdCMq"
},
"outputs": [],
"source": [
"svm_archive = svm.SVC(kernel='linear')\n",
"svm_archive.fit(X_train, y_train)\n",
"y_pred = svm_archive.predict(X_test)\n",
"svm_score = metrics.accuracy_score(y_test, y_pred)\n",
"model_weights['SVM'] = svm_score\n",
"print(\"Accuracy:\", svm_score)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bbb8faf0",
"metadata": {
"id": "OBXM8WfldEMn"
},
"outputs": [],
"source": [
"unknown[\"SVM Predicted Archive\"] = svm_archive.predict(unknown.loc[:, 'AN.bu.um':'šuʾura'])\n",
"unknown"
]
},
{
"cell_type": "markdown",
"id": "e806b8b7",
"metadata": {
"id": "ssrCDxoMOhZm"
},
"source": [
"## 3 Complex Modeling Methods"
]
},
{
"cell_type": "markdown",
"id": "f632803b",
"metadata": {
"id": "PbG6QBz36R5R"
},
"source": [
"## 4 Voting Mechanism Between Models\n",
"\n",
"Here we will use the models to determine the archive which to assign to each tablet with an unknown archive. \n",
"\n",
"We will then augment the words_df with these archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06649d31",
"metadata": {
"id": "A_Z48QeD93Jd"
},
"outputs": [],
"source": [
"model_weights"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b64cfde5",
"metadata": {
"id": "RMhkJmIJS1CL"
},
"outputs": [],
"source": [
"def visualize_archives(data, prediction_name):\n",
" archive_counts = data.value_counts()\n",
"\n",
"\n",
" plt.xlabel('Archive Class')\n",
" plt.ylabel('Frequency', rotation=0, labelpad=30)\n",
" plt.title('Frequencies of ' + prediction_name + ' Predicted Archives')\n",
" plt.xticks(rotation=45)\n",
" plt.bar(archive_counts.index, archive_counts);\n",
"\n",
" percent_domesticated_animal = archive_counts['domesticated_animal'] / sum(archive_counts)\n",
"\n",
" print('Percent of texts in Domesticated Animal Archive:', percent_domesticated_animal)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2052a8cf",
"metadata": {
"id": "7-0f5WJTSJqO"
},
"outputs": [],
"source": [
"#Log Reg Predictions\n",
"visualize_archives(unknown['LogReg Predicted Archive'], 'Logistic Regression')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bf1b0746",
"metadata": {
"id": "Qgsc2nTVSf4O"
},
"outputs": [],
"source": [
"#KNN Predictions\n",
"visualize_archives(unknown['KNN Predicted Archive'], 'K Nearest Neighbors')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0e301dd",
"metadata": {
"id": "CECF4iExSmjP"
},
"outputs": [],
"source": [
"#Gaussian Naive Bayes Predictions\n",
"visualize_archives(unknown['GaussNB Predicted Archive'], 'Gaussian Naive Bayes')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "50ed6f66",
"metadata": {
"id": "pNSBTgPBSwhj"
},
"outputs": [],
"source": [
"#Bernoulli Naive Bayes Predictions\n",
"visualize_archives(unknown['BernoulliNB Predicted Archive'], 'Bernoulli Naive Bayes')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f9a8b7a",
"metadata": {
"id": "27WesTVUUfDG"
},
"outputs": [],
"source": [
"#SVM Predictions\n",
"visualize_archives(unknown['SVM Predicted Archive'], 'Support Vector Machines Naive Bayes')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e3783714",
"metadata": {
"id": "0Xz6gARR-wq_"
},
"outputs": [],
"source": [
"def weighted_voting(row):\n",
" votes = {} # create empty voting dictionary\n",
" # tally votes\n",
" for model in row.index:\n",
" model_name = model[:-18] # remove ' Predicted Archive' from column name\n",
" prediction = row[model]\n",
" if prediction not in votes.keys():\n",
" votes[prediction] = model_weights[model_name] # if the prediction isn't in the list of voting categories, add it with a weight equal to the current model weight \n",
" else:\n",
" votes[prediction] += model_weights[model_name] # else, add model weight to the prediction\n",
" return max(votes, key=votes.get) # use the values to get the prediction with the greatest weight"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc0655b8",
"metadata": {
"id": "m6MQH0da-Aq-"
},
"outputs": [],
"source": [
"predicted_archives = unknown.loc[:, 'LogReg Predicted Archive':\n",
" 'SVM Predicted Archive'].copy() # get predictions\n",
"weighted_prediction = predicted_archives.apply(weighted_voting, axis=1) #apply voting mechanism on each row and return 'winning' prediction"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7eece68",
"metadata": {
"id": "Q38uV36MRHmJ"
},
"outputs": [],
"source": [
"weighted_prediction[weighted_prediction != 'domesticated_animal']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "872b24af",
"metadata": {
"id": "PYuISOe9Gstd"
},
"outputs": [],
"source": [
"words_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "318fdedb",
"metadata": {
"id": "Pi4VugZGEsyZ"
},
"outputs": [],
"source": [
"archive_class = known['archive_class'].copy().append(weighted_prediction)\n",
"words_df['archive_class'] = words_df.apply(lambda row: archive_class[int(row['id_text'][1:])], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a9e491b4",
"metadata": {
"id": "z22PcVk-Mw2o"
},
"outputs": [],
"source": [
"words_df"
]
},
{
"cell_type": "markdown",
"id": "9db15866",
"metadata": {
"id": "lPjcplQAQ8LX"
},
"source": [
"## 5 Sophisticated Naive Bayes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d6f9714",
"metadata": {
"id": "K60UB8wYGM6h"
},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"id": "0f844643",
"metadata": {
"id": "3RvDKrY0UmLb"
},
"source": [
"### 5.1 Feature and Model Creation"
]
},
{
"cell_type": "markdown",
"id": "6fae866b",
"metadata": {
"id": "Q-V9P376rbgB"
},
"source": [
"There are some nouns that are so closely associated with a specific archive that their presence in a text virtually guarantees that the text belongs to that archive. We will use this fact to create a training set for our classification model.\n",
"\n",
"The `labels` dictionary below contains the different archives along with their possible associated nouns."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc95e73d",
"metadata": {
"id": "MdQv15BnSJVy"
},
"outputs": [],
"source": [
"labels = dict()\n",
"labels['domesticated_animal'] = ['ox', 'cow', 'sheep', 'goat', 'lamb', '~sheep', 'equid']\n",
"dom = '(' + '|'.join(labels['domesticated_animal']) + ')'\n",
"#split domesticated into large and small - sheep, goat, lamb, ~sheep would be small domesticated animals\n",
"labels['wild_animal'] = ['bear', 'gazelle', 'mountain', 'lion'] # account for 'mountain animal' and plural\n",
"wild = '(' + '|'.join(labels['wild_animal']) + ')'\n",
"labels['dead_animal'] = ['die'] # find 'die' before finding domesticated or wild\n",
"dead = '(' + '|'.join(labels['dead_animal']) + ')'\n",
"labels['leather_object'] = ['boots', 'sandals']\n",
"leath = '(' + '|'.join(labels['leather_object']) + ')'\n",
"labels['precious_object'] = ['copper', 'bronze', 'silver', 'gold']\n",
"prec = '(' + '|'.join(labels['precious_object']) + ')'\n",
"labels['wool'] = ['wool', '~wool', 'hair']\n",
"wool = '(' + '|'.join(labels['wool']) + ')'\n",
"complete = []\n",
"for lemma_list in labels.values():\n",
" complete = complete + lemma_list\n",
"tot = '(' + '|'.join(complete) + ')'\n",
"# labels['queens_archive'] = []"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "df320e78",
"metadata": {
"id": "kNHCtQr8XY9v"
},
"outputs": [],
"source": [
"dom_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + dom + '.*\\]')]['id_text'])\n",
"wild_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + wild + '.*\\]')]['id_text'])\n",
"dead_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + dead + '.*\\]')]['id_text'])\n",
"leath_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + leath + '.*\\]')]['id_text'])\n",
"prec_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + prec + '.*\\]')]['id_text'])\n",
"wool_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + wool + '.*\\]')]['id_text'])"
]
},
{
"cell_type": "markdown",
"id": "113a4957",
"metadata": {
"id": "eQRfFFosCTQr"
},
"source": [
"Each row of the `sparse` table below corresponds to one text, and the columns of the table correspond to the words that appear in the texts. Every cell contains the number of times a specific word appears in a certain text."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c50679ba",
"metadata": {
"id": "GHyY6IbaRHYX"
},
"outputs": [],
"source": [
"# remove lemmas that are a part of a seal as well as words that are being used to determine training classes\n",
"filter = (~words_df['label'].str.contains('s')) | words_df['lemma'].str.match('.*\\[.*' + tot + '.*\\]')\n",
"sparse = words_df[filter].groupby(by=['id_text', 'lemma']).count()\n",
"sparse = sparse['id_word'].unstack('lemma')\n",
"sparse = sparse.fillna(0)\n",
"\n",
"#cleaning\n",
"del filter"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa52d78e",
"metadata": {
"id": "4etDIDVgS93i"
},
"outputs": [],
"source": [
"text_length = sparse.sum(axis=1)"
]
},
{
"cell_type": "markdown",
"id": "919684c2",
"metadata": {
"id": "B7_coDiFCiHu"
},
"source": [
"If a text contains a word that is one of the designated nouns in `labels`, it is added to the set to be used for our ML model. Texts that do not contain any of these words or that contain words corresponding to more than one archive are ignored."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cf620de3",
"metadata": {
"id": "e85qDgdYbejR"
},
"outputs": [],
"source": [
"class_array = []\n",
"\n",
"for id_text in sparse.index:\n",
" cat = None\n",
" number = 0\n",
" if id_text in dom_tabs:\n",
" number += 1\n",
" cat = 'dom'\n",
" if id_text in wild_tabs:\n",
" number += 1\n",
" cat = 'wild'\n",
" if id_text in dead_tabs:\n",
" number += 1\n",
" cat = 'dead'\n",
" if id_text in prec_tabs:\n",
" number += 1\n",
" cat = 'prec'\n",
" if id_text in wool_tabs:\n",
" number += 1\n",
" cat = 'wool'\n",
" if number == 1:\n",
" class_array.append(cat)\n",
" else:\n",
" class_array.append(None)\n",
"\n",
"class_series = pd.Series(class_array, sparse.index)"
]
},
{
"cell_type": "markdown",
"id": "1e443f84",
"metadata": {
"id": "ExKiUVAmDhB0"
},
"source": [
"Next we remove the texts from `sparse` that we used in the previous cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d196841f",
"metadata": {
"id": "Hj8bgltlEdzM"
},
"outputs": [],
"source": [
"used_cols = []\n",
"\n",
"for col in sparse.columns:\n",
" if re.match('.*\\[.*' + tot + '.*\\]', col):\n",
" used_cols.append(col)\n",
" #elif re.match('.*PN$', col) is None:\n",
" # used_cols.append(col)\n",
"\n",
"sparse = sparse.drop(used_cols, axis=1)"
]
},
{
"cell_type": "markdown",
"id": "53e5f548",
"metadata": {
"id": "ZFhS09xHCskW"
},
"source": [
"Now the `sparse` table will be updated to contain percentages of the frequency that a word appears in the text rather than the raw number of occurrences. This will allow us to better compare frequencies across texts of different lengths."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7745e6fb",
"metadata": {
"id": "ArDjsWRSYqX1"
},
"outputs": [],
"source": [
"for col in sparse.columns:\n",
" if col != 'text_length':\n",
" sparse[col] = sparse[col]/text_length*1000"
]
},
{
"cell_type": "markdown",
"id": "3960c041",
"metadata": {
"id": "Vp0R9mq1DN5F"
},
"source": [
"We must convert percentages from the previous cell into integers for the ML model to work properly."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2d7b552",
"metadata": {
"id": "JnzN2SXG5QXJ"
},
"outputs": [],
"source": [
"this sparse = sparse.round()\n",
"sparse = sparse.astype(int)"
]
},
{
"cell_type": "markdown",
"id": "2c41d431",
"metadata": {
"id": "doDVY-klDS4f"
},
"source": [
"To form X, we reduce the `sparse` table to only contain texts that were designated for use above in `class_series`. Y consists of the names of the different archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f34c355",
"metadata": {
"id": "l8TINhIO9ZuT"
},
"outputs": [],
"source": [
"X = sparse.loc[class_series.dropna().index]\n",
"X = X.drop(X.loc[X.sum(axis=1) == 0, :].index, axis=0)\n",
"y = class_series[X.index]"
]
},
{
"cell_type": "markdown",
"id": "140979dc",
"metadata": {
"id": "LKvePiz_Dus9"
},
"source": [
"Our data is split into a training set and a test set. The ML model first uses the training set to learn how to predict the archives for the texts. Afterwards, the test set is used to verify how well our ML model works."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2211eb3",
"metadata": {
"id": "o2VLiYkC1JoH"
},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, \n",
" random_state = 9)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6732b73",
"metadata": {
"id": "NrEWHgdxfyP_"
},
"outputs": [],
"source": [
"pipe = Pipeline([\n",
" ('feature_reduction', SelectPercentile(score_func = f_classif)), \n",
" ('weighted_multi_nb', MultinomialNB())\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed7a3756",
"metadata": {
"id": "TwK16rD1obnS"
},
"outputs": [],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"f = GridSearchCV(pipe, {\n",
" 'feature_reduction__percentile' : [i*10 for i in range(1, 10)],\n",
" 'weighted_multi_nb__alpha' : [i/10 for i in range(1, 10)]\n",
" }, verbose = 0, n_jobs = -1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5bfbfc18",
"metadata": {
"id": "AxpnC26SiNS9"
},
"outputs": [],
"source": [
"f.fit(X_train, y_train);"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3cfe2cfa",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "uShD8bbwkezW",
"outputId": "e47c8212-be72-45f4-ebe0-66a217597586"
},
"outputs": [
{
"data": {
"text/plain": [
"{'feature_reduction__percentile': 70, 'weighted_multi_nb__alpha': 0.1}"
]
},
"execution_count": 117,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.best_params_"
]
},
{
"cell_type": "markdown",
"id": "bebb464e",
"metadata": {
"id": "-fjlG0-vD5nC"
},
"source": [
"Our best score when run on the training set is about 93.6% accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7cc8bf2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "VDDiXkBinXcM",
"outputId": "98476187-bed2-42a6-9782-1948d59fbfd4"
},
"outputs": [
{
"data": {
"text/plain": [
"0.9359404096834265"
]
},
"execution_count": 118,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.best_score_"
]
},
{
"cell_type": "markdown",
"id": "881e2374",
"metadata": {
"id": "OyCwMd9SD_I6"
},
"source": [
"Our best score when run on the test set is very similar to above at 93.2% accuracy, which is good because it suggests that our model isn't overfitted to only work on the training set."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11588239",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "N3hcN7yvSIo-",
"outputId": "9dac856d-e097-455b-b020-90aec50ede0e"
},
"outputs": [
{
"data": {
"text/plain": [
"0.9321229050279329"
]
},
"execution_count": 119,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test, y_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "67a80b76",
"metadata": {
"id": "1AYleoeySIR2"
},
"outputs": [],
"source": [
"predicted = f.predict(sparse)"
]
},
{
"cell_type": "markdown",
"id": "0f2501f9",
"metadata": {
"id": "HdQk5tjvEJJh"
},
"source": [
"The `predicted_df` table is the same as the `sparse` table from above, except that we have added an extra column at the end named `prediction`. `prediction` contains our ML model's classification of which archive the text belongs to based on the frequency of the words that appear."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d89cb636",
"metadata": {
"id": "TKu-QXULG6Gm"
},
"outputs": [],
"source": [
"predicted_df = sparse.copy()\n",
"predicted_df['prediction'] = predicted"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9fc2c716",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 504
},
"id": "4hNchgn0HFrJ",
"outputId": "d2cd5ef7-7d84-49a2-bef1-efefd812dc9d"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
lemma
\n",
"
$AN[NA]NA
\n",
"
$GIR[NA]NA
\n",
"
$HI[NA]NA
\n",
"
$KA[NA]NA
\n",
"
$KI[NA]NA
\n",
"
$LAM[NA]NA
\n",
"
$NI[NA]NA
\n",
"
$UD[NA]NA
\n",
"
$ŠA[NA]NA
\n",
"
$ŠID[NA]NA
\n",
"
1(aš)-a[]NU
\n",
"
1(aš)-kam[]NU
\n",
"
1(aš)-še₃[]NU
\n",
"
1(aš)[]NU
\n",
"
1(aš@c)[]NU
\n",
"
1(ban₂)-bi[]NU
\n",
"
1(ban₂)-ta[]NU
\n",
"
1(ban₂)-še₃[]NU
\n",
"
1(ban₂)[]NU
\n",
"
1(barig)-ta[]NU
\n",
"
1(barig)[]NU
\n",
"
1(burʾu)[]NU
\n",
"
1(bur₃)[]NU
\n",
"
1(diš)-a-kam[]NU
\n",
"
1(diš)-a-še₃[]NU
\n",
"
1(diš)-a[]NU
\n",
"
1(diš)-am₃[]NU
\n",
"
1(diš)-bi[]NU
\n",
"
1(diš)-kam-aš[]NU
\n",
"
1(diš)-kam-ma-aš[]NU
\n",
"
1(diš)-kam-ma[]NU
\n",
"
1(diš)-kam[]NU
\n",
"
1(diš)-ta[]NU
\n",
"
1(diš)-x[]NU
\n",
"
1(diš)-še₃[]NU
\n",
"
1(diš)[]NU
\n",
"
1(diš){ša}[]NU
\n",
"
1(diš@t)-kam-aš[]NU
\n",
"
1(diš@t)-kam-ma-aš[]NU
\n",
"
1(diš@t)-kam[]NU
\n",
"
...
\n",
"
šuʾi[barber]N
\n",
"
šuʾura[goose]N
\n",
"
ʾa₃-um[NA]NA
\n",
"
Ṣa.lim.tum[00]PN
\n",
"
Ṣe.AŠ₂.da.gan[00]PN
\n",
"
Ṣe.er.ṣe.ra.num₂[00]PN
\n",
"
Ṣe.la[00]PN
\n",
"
Ṣe.li.uš.da.gan[00]PN
\n",
"
Ṣe.lu.uš.da.gan.PA[00]PN
\n",
"
Ṣe.lu.uš[00]PN
\n",
"
Ṣe.lu.uš₂.da.gan[00]PN
\n",
"
Ṣe.ra.am[00]PN
\n",
"
Ṣe.ra[00]PN
\n",
"
Ṣeherkinum[0]PN
\n",
"
Ṣeṣe[0]PN
\n",
"
Ṣe₂.la.šu[00]PN
\n",
"
Ṣi.li.sud₃.da[00]PN
\n",
"
Ṣilala[0]PN
\n",
"
Ṣillašu[0]PN
\n",
"
ṢilliAdad[0]PN
\n",
"
ṢilliSud[0]PN
\n",
"
ṢilliŠulgi[0]PN
\n",
"
ṢillušDagan[0]PN
\n",
"
Ṣillušdagan[0]PN
\n",
"
ṢillušŠulgi[0]PN
\n",
"
Ṣillušṭab[0]PN
\n",
"
Ṣipram[0]PN
\n",
"
Ṣirula[0]PN
\n",
"
Ṣummidili[0]PN
\n",
"
ṣa-bi₂-im[NA]NA
\n",
"
ṣa-bu-um[NA]NA
\n",
"
ṣi-il-x-{d}iškur[NA]NA
\n",
"
ṣi-ip-ra-am[NA]NA
\n",
"
ṣi-ra-am[NA]NA
\n",
"
Ṭabaʾili[0]PN
\n",
"
Ṭabumšar[0]PN
\n",
"
Ṭabumšarri[0]PN
\n",
"
Ṭabši[0]SN
\n",
"
Ṭahili[0]PN
\n",
"
prediction
\n",
"
\n",
"
\n",
"
id_text
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
P100041
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dom
\n",
"
\n",
"
\n",
"
P100189
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
48
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dead
\n",
"
\n",
"
\n",
"
P100190
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
33
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
100
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dom
\n",
"
\n",
"
\n",
"
P100191
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
48
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dead
\n",
"
\n",
"
\n",
"
P100211
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
30
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
152
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dead
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
P519650
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dom
\n",
"
\n",
"
\n",
"
P519658
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
wild
\n",
"
\n",
"
\n",
"
P519792
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
155
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dom
\n",
"
\n",
"
\n",
"
P519957
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
71
\n",
"
0
\n",
"
71
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dom
\n",
"
\n",
"
\n",
"
P519959
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dead
\n",
"
\n",
" \n",
"
\n",
"
15132 rows × 9174 columns
\n",
"
"
],
"text/plain": [
"lemma $AN[NA]NA $GIR[NA]NA $HI[NA]NA ... Ṭabši[0]SN Ṭahili[0]PN prediction\n",
"id_text ... \n",
"P100041 0 0 0 ... 0 0 dom\n",
"P100189 0 0 0 ... 0 0 dead\n",
"P100190 0 0 0 ... 0 0 dom\n",
"P100191 0 0 0 ... 0 0 dead\n",
"P100211 0 0 0 ... 0 0 dead\n",
"... ... ... ... ... ... ... ...\n",
"P519650 0 0 0 ... 0 0 dom\n",
"P519658 0 0 0 ... 0 0 wild\n",
"P519792 0 0 0 ... 0 0 dom\n",
"P519957 0 0 0 ... 0 0 dom\n",
"P519959 0 0 0 ... 0 0 dead\n",
"\n",
"[15132 rows x 9174 columns]"
]
},
"execution_count": 124,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"predicted_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a31a9350",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "i7ImpZnL3U5C",
"outputId": "6da08322-0707-491a-c4b4-dd067b56a4fc"
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['P100041', 'P100189', 'P100190', 'P100191', 'P100211', 'P100214',\n",
" 'P100215', 'P100217', 'P100218', 'P100219',\n",
" ...\n",
" 'P519534', 'P519613', 'P519623', 'P519624', 'P519647', 'P519650',\n",
" 'P519658', 'P519792', 'P519957', 'P519959'],\n",
" dtype='object', name='id_text', length=15132)"
]
},
"execution_count": 125,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"predicted_df.index"
]
},
{
"cell_type": "markdown",
"id": "6e77a951",
"metadata": {
"id": "p5U8vZ2iOTlZ"
},
"source": [
"### 5.4 Testing the Model on Hand-Classified Data\n",
"\n",
"Here we first use our same ML model from before on Niek's hand-classified texts from the wool archive. Testing our ML model on these tablets gives us 82.5% accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d20c460",
"metadata": {
"id": "iCJi2jZZzZeS"
},
"outputs": [],
"source": [
"wool_hand_tabs = set(pd.read_csv('drive/MyDrive/SumerianNetworks/JupyterBook/Outputs/wool_pid.txt',header=None)[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32aebe60",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "5YioGFV12lao",
"outputId": "f9ea6d73-5aa9-4f35-ccba-5cfe411b0d48"
},
"outputs": [
{
"data": {
"text/plain": [
"0.8253968253968254"
]
},
"execution_count": 126,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"hand_wool_frame = sparse.loc[wool_hand_tabs].loc[class_series.isna() == True]\n",
"\n",
"f.score(X = hand_wool_frame, \n",
" y = pd.Series(\n",
" index = hand_wool_frame.index, \n",
" data = ['wool' for i in range(0, hand_wool_frame.shape[0])] ))"
]
},
{
"cell_type": "markdown",
"id": "fc053bed",
"metadata": {
"id": "1lgQ_iMjG4Jt"
},
"source": [
"Testing our ML model on 100 random hand-classified tablets selected from among all the texts gives us 87.2% accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4088464f",
"metadata": {
"id": "I4xccyU3VDuQ"
},
"outputs": [],
"source": [
"niek_100_random_tabs = pd.read_pickle('/content/drive/MyDrive/niek_cats').dropna()\n",
"niek_100_random_tabs = niek_100_random_tabs.set_index('pnum')['category_text']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eba4db0a",
"metadata": {
"id": "5skhfKRDkrCB"
},
"outputs": [],
"source": [
"random_frame = sparse.loc[set(niek_100_random_tabs.index)]\n",
"random_frame['result'] = niek_100_random_tabs[random_frame.index]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "64e86851",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "TRl1Pq1BlAhY",
"outputId": "a18af6f2-6853-4024-a9e9-095ed998cf5e"
},
"outputs": [
{
"data": {
"text/plain": [
"0.872093023255814"
]
},
"execution_count": 188,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X=random_frame.drop(labels='result', axis=1), y = random_frame['result'])"
]
},
{
"cell_type": "markdown",
"id": "f932b71f",
"metadata": {
"id": "viioWwWRG9Wq"
},
"source": [
"A large majority of the tablets are part of the domestic archive and have been classified as such."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73d4b784",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "tzDgn3nbl-VH",
"outputId": "b86adeed-15ed-4555-c3ee-919c6726acaa"
},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"['dead', 'dom', 'dead', 'dom', 'dom', 'dom', 'dead', 'dom', 'dead',\n",
" 'dom', 'dead', 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'wild', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'dom', 'dead', 'dom', 'dom', 'dom', 'dom', 'wild',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'wild', 'dom',\n",
" 'dom', 'dom', 'dom', 'wild', 'dom', 'dead', 'dom', 'dead', 'dead',\n",
" 'dom', 'dead', 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom']\n",
"Length: 86, dtype: object"
]
},
"execution_count": 190,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"random_frame['result'].array"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40dbac07",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "_p29Uif4lpja",
"outputId": "195b73b0-ce3d-4ca0-81ac-9102fdb75830"
},
"outputs": [
{
"data": {
"text/plain": [
"array(['dead', 'dom', 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dead', 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'wild', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom', 'dom', 'dom',\n",
" 'wild', 'dom', 'dom', 'dom', 'wild', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom', 'dead', 'dom',\n",
" 'dom', 'dom', 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom'], dtype='"
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"plt.xlabel('Archive Class')\n",
"plt.ylabel('Frequency', rotation=0, labelpad=30)\n",
"plt.title('Frequencies of Predicted Archive Classes in All Tablets')\n",
"plt.xticks(rotation=45)\n",
"labels = list(set(predicted_df['prediction']))\n",
"counts = [predicted_df.loc[predicted_df['prediction'] == label].shape[0] for label in labels]\n",
"plt.bar(labels, counts);"
]
},
{
"cell_type": "markdown",
"id": "5f7604c7",
"metadata": {
"id": "rG726vEEHQ3U"
},
"source": [
"The below chart displays the actual frequencies of the different archives in the test set. As mentioned previously, it is visually obvious that there are many texts in the domestic archive, with comparatively very few texts in all of the other archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "85909ed0",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 310
},
"id": "-tiqpepZG7Og",
"outputId": "13c2afdc-8886-4e38-949e-8124a355c137"
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAbUAAAElCAYAAABjzHyeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3de7xc873/8ddbxKURt5PUIVJRQpsoQVxaTpuiBCWpu1JRKpwfLefQ1qUtWlpK61RdispBqy4tjhStBlVVJReNS1wqJZpESEjEnSY+vz++32HZnb2zUzsz2d95Px+Peew137Vmre+amT3v+X7Xd9ZSRGBmZlaCZZpdATMzs67iUDMzs2I41MzMrBgONTMzK4ZDzczMiuFQMzOzYjjUzLqIpA9JekVSjwZuc0NJkyW9LOkrjdpuI0g6RdLPO5g/RdKwBlaptt07JX2p0du1znGoWcNJmibp9RwAtdtaza7X+xURf4+IlSJiYQM3+zXg9xHROyLOrc7IH/q153ehpDcq909c3A1JukzSaZ1YTpKelPTI4m5jcUTE4Ii4s6vXK2m5HKhPSHo1v1/HSBrQ1duyrudQs2bZLQdA7fZMdaakZZtVsW5mHWBKvRn5Q3+liFgJ+CNwVOX5/u4SrNMngQ8CH5a0RXsL5fBbGj+DfgXsDnweWAXYBJgEbN/MSlnnLI1vKGtRkkLSkZKeAJ7IZZ/N3WsvSrpH0saV5TeVdH/uertG0tW1loSkgyXdXWf96+fp5SWdLenvkp6T9BNJK+Z5wyTNkHSspNmSZkn6YmU9K0r6gaSnJc2XdHcuG5C3sWxebhVJl+bHz5R0Wq1rUtL6kv6QH/+8pGs6eF52z62uF3PX10dz+R3Ap4Hzcutrg8V4rg+R9KikeZJulbROLpekc/J+vyTpIUkbSRoNHAB8LW/r1x2sfhRwI3BLnq5u905Jp0v6E/AaKfgGSxonaW5+LaqtyOUkXZFf4ymShlbWNU3SDpLWyi3/1SvzNs3Pa8+O9rfO87ID8BlgRERMiIgFETE/Is6PiEvrLL+epDskvZC3d6WkVSvzv55f+5clPS5p+1y+paSJ+Tl+TtIPK4/ZOr/XX5T0gCpdrPl9/WRe31OSDujgdWhNEeGbbw29AdOAHeqUBzAOWB1YEdgUmA1sBfQgfUBOA5YHlgOeBv4L6AnsBfwDOC2v62Dg7jrrXz9PnwOMzdvqDfwa+F6eNwxYAHw7r3sX0gfwann++cCdQL9cr0/kOg3I21g2L3cDcBHQi9RyGQ8cnuddBZxE+mK5ArBtO8/VBsCrpA/anqTuxqnAcnn+ncCXOvGcv7McMCKv46PAssA3gHvyvJ1IrZJVAeVl1szzLqs9vx1s5wPAS/k52xN4vlbXSj3+DgzO2+4NzAKOzc9Db2CrvOwpwBt5XT2A7wH31nsfAXcAh1XmnQX8ZFH7W6f+ZwB/WIzncv382iwP9AXuAv4nz9sQmA6sle8PANbL038GvpCnVwK2ztP9gBfyPi+T1/1CXnev/NxumJddExjc7P/npe3W9Ar41nq3/GH0CvBivv1fLg9gu8pyFwLfafPYx4FPkbq4ngFUmXcPnQi1/GH9au0DJs/7OPBUnh4GvE4Op1w2G9g6f9C8DmxSZ78G5G0sC6wBvAmsWJm/P+n4F8AVwMXA2ot4rr4JXFu5vwwwExiW77/zAbuI9VQ/iH8DHNpmna+RujK3A/5a29c267iMRYfagcCc/BysAMwHPtemHt9u85z8pZ11nQLcVrk/CHi9zfuoFmpfAu7I0yKFyScXtb91tnkJcHVnn8s680bW9ie/12YDOwA92yx3F3Aq0KdN+deBn7Upu5X0ha4X6f9lz+r7yrf33tz9aM0yMiJWzbeRlfLplel1gGNzN8yLkl4E+gNr5dvMyP/12dOd3HZfUotiUmW9v83lNS9ExILK/ddI36j7kD6s/7aIbaxDalnNqmzjIlKLDVKLS8D43K12SDvrWau6XxHxNuk56rfo3eywbj+q1Gturku/iLgDOI/UGp0t6WJJKy/GukeRQnhBRLwBXEebLkje+xr3p+Pn8tnK9GvACqp/vPU64OOS1iR94XmbdBwROtjfOut5gdQC6hRJayh1e8+U9BLwc9J7hIiYChxDCufZebnagKhDSa3wxyRNkPTZSl33bvOe35bUWn4V2Bc4gvS+ulnSRzpb11bhULOlTTWkpgOnV8Jv1Yj4QERcReqy6idJleU/VJl+lRRcAEj698q850mtrcGV9a4SaUDFojxP6hJbbxHLTSe11PpUtrFyRAwGiIhnI+KwiFgLOBy4QPl4XxvPkD7oavshUhDM7ERdO6rb4W2e1xUj4p5ct3MjYnNSy2gD4Kv5cR1e0kPS2qSW3oGSnpX0LKlbeBdJfSqLtn2NP/w+9oVc53nA70gf+p8ntbZq2+lwf9u4Ddgy70tnfJe0Px+LiJVJLdV33pMR8YuI2Jb0GgZwZi5/IiL2J33JORP4laReua4/a1PXXhFxRn7crRHxGVLwPkZqWVqFQ82WZpcAR0jaKg9g6CVpV0m9ScckFgBfkdRT0h7AlpXHPgAMljRE0gqkb8vAO62dS4BzJH0QQFI/STstqkL5sWOAH+YBCj0kfVzS8m2Wm0X6kP2BpJUlLZMHFXwqb2/vygfnPNIH3tt1NnktsKuk7fOgh2NJYVnvA7mzfgKcIGlwrssqkvbO01vk57sn6YvBG5V6PUfHAfQFUtflhsCQfNsAmEHqZqznJmBNSccoDd7pLWmrf3G/fgEcRArSX1TK293ftiLiNtJx3RskbS5p2VynI9ppTfcmdaXPl9SPd78A1H5DuF1+b7xB+iL1dp53oKS++f30Yn7I26SW3m6SdsrvrRWUBi6tnVuFI3L4vZm3W+8909IcarbUioiJwGGk7rB5pIP9B+d5bwF75PtzSd/Qr6889q+kgR63kUZSvmckJOnYxVTg3txtdBvpw7gzjgMeAibkbZ9J/f+lg0gDWh7J9f8V73ZtbQHcJ+kV0oCVoyPiyTrPweOkb/8/JrUSdyP9HOKtTtb1n0TEDbnOV+d9fxjYOc9emRT480jdni+QBl0AXAoMyt1i/1dn1aOAC3Ir9J0bKVTadkHW6vIyaTDEbqSuxidIIzr/FWOBgcCzEfFAJ/e3nr1IIzevIR0TfBgYSnqPtHUqsFle7mYq70HS4JEzSK/bs6RW2Ql53nBgSn79fwTsFxGvR8R00sCWE0nHJqeTgnKZfPtvUut9LunY8n8u6klpNXrvIQmz7kvSZcCMiPhGs+tiZs3hlpqZmRXDoWZmZsVw96OZmRXDLTUzMyuGQ83MzIrhM6E3UZ8+fWLAgAHNroaZWbcyadKk5yOib715DrUmGjBgABMnTmx2NczMuhVJ7Z4Sz92PZmZWDIeamZkVw6FmZmbFcKiZmVkxHGpmZlYMh5qZmRXDoWZmZsVwqJmZWTH842sz61YGHH9zs6vQJaadsWuzq1Akt9TMzKwYDjUzMyuGQ83MzIrhUDMzs2I41MzMrBgONTMzK4ZDzczMiuFQMzOzYjjUzMysGA41MzMrhkPNzMyK4VAzM7NiONTMzKwYDjUzMytGy4eapBUkjZf0gKQpkk7N5etKuk/SVEnXSFouly+f70/N8wdU1nVCLn9c0k7N2SMzs9bV8qEGvAlsFxGbAEOA4ZK2Bs4EzomI9YF5wKF5+UOBebn8nLwckgYB+wGDgeHABZJ6NHRPzMxaXMuHWiSv5Ls98y2A7YBf5fLLgZF5ekS+T56/vSTl8qsj4s2IeAqYCmzZgF0wM7Os5UMNQFIPSZOB2cA44G/AixGxIC8yA+iXp/sB0wHy/PnAv1XL6zymuq3RkiZKmjhnzpwlsTtmZi3LoQZExMKIGAKsTWpdfWQJbuviiBgaEUP79u27pDZjZtaSHGoVEfEi8Hvg48CqkpbNs9YGZubpmUB/gDx/FeCFanmdx5iZWQO0fKhJ6itp1Ty9IvAZ4FFSuO2VFxsF3Jinx+b75Pl3RETk8v3y6Mh1gYHA+MbshZmZASy76EWKtyZweR6puAxwbUTcJOkR4GpJpwF/AS7Ny18K/EzSVGAuacQjETFF0rXAI8AC4MiIWNjgfTEza2ktH2oR8SCwaZ3yJ6kzejEi3gD2bmddpwOnd3Udzcysc1q++9HMzMrhUDMzs2I41MzMrBgONTMzK4ZDzczMiuFQMzOzYjjUzMysGA41MzMrhkPNzMyK4VAzM7NiONTMzKwYDjUzMyuGQ83MzIrhUDMzs2I41MzMrBgONTMzK4ZDzczMiuFQMzOzYjjUzMysGA41MzMrhkPNzMyK4VAzM7NitHyoSeov6feSHpE0RdLRufwUSTMlTc63XSqPOUHSVEmPS9qpUj48l02VdHwz9sfMrJUt2+wKLAUWAMdGxP2SegOTJI3L886JiLOrC0saBOwHDAbWAm6TtEGefT7wGWAGMEHS2Ih4pCF7YWZmDrWImAXMytMvS3oU6NfBQ0YAV0fEm8BTkqYCW+Z5UyPiSQBJV+dlHWpmZg3S8t2PVZIGAJsC9+WioyQ9KGmMpNVyWT9geuVhM3JZe+VttzFa0kRJE+fMmdPFe2Bm1tocapmklYDrgGMi4iXgQmA9YAipJfeDrthORFwcEUMjYmjfvn27YpVmZpa1fPcjgKSepEC7MiKuB4iI5yrzLwFuyndnAv0rD187l9FBuZmZNUDLt9QkCbgUeDQiflgpX7Oy2OeAh/P0WGA/SctLWhcYCIwHJgADJa0raTnSYJKxjdgHMzNL3FKDbYAvAA9JmpzLTgT2lzQECGAacDhAREyRdC1pAMgC4MiIWAgg6SjgVqAHMCYipjRyR8zMWl3Lh1pE3A2ozqxbOnjM6cDpdcpv6ehxZma2ZLV896OZmZXDoWZmZsVwqJmZWTEcamZmVgyHmpmZFcOhZmZmxXComZlZMRxqZmZWDIeamZkVw6FmZmbFcKiZmVkxHGpmZlYMh5qZmRXDoWZmZsVwqJmZWTEcamZmVgyHmpmZFcOhZmZmxXComZlZMRxqZmZWDIeamZkVw6FmZmbFcKiZmVkxWj7UJPWX9HtJj0iaIunoXL66pHGSnsh/V8vlknSupKmSHpS0WWVdo/LyT0ga1ax9MjNrVS0fasAC4NiIGARsDRwpaRBwPHB7RAwEbs/3AXYGBubbaOBCSCEInAxsBWwJnFwLQjMza4yWD7WImBUR9+fpl4FHgX7ACODyvNjlwMg8PQK4IpJ7gVUlrQnsBIyLiLkRMQ8YBwxv4K6YmbW8lg+1KkkDgE2B+4A1ImJWnvUssEae7gdMrzxsRi5rr7ztNkZLmihp4pw5c7q0/mZmrc6hlklaCbgOOCYiXqrOi4gAoiu2ExEXR8TQiBjat2/frlilmZllDjVAUk9SoF0ZEdfn4udytyL57+xcPhPoX3n42rmsvXIzM2uQlg81SQIuBR6NiB9WZo0FaiMYRwE3VsoPyqMgtwbm527KW4EdJa2WB4jsmMvMzKxBlm12BZYC2wBfAB6SNDmXnQicAVwr6VDgaWCfPO8WYBdgKvAa8EWAiJgr6TvAhLzctyNibmN2wczMwKFGRNwNqJ3Z29dZPoAj21nXGGBM19XOzMwWR8t3P5qZWTkcamZmVoxFdj9KWgg8VCkaGRHTlliNzMzM/kWdOab2ekQMqTcjjxxURLzdtdUyMzNbfIvd/ShpgKTHJV0BPAz0l/RVSRPyCX5PrSx7kqS/Srpb0lWSjsvld0oamqf7SJqWp3tIOquyrsNz+bD8mF9JekzSlTlQkbSFpHskPSBpvKTeku6SNKRSj7slbfI+niczM+sGOtNSW7Ey1P0p4L9IJ/MdFRH3Stox39+SNIpwrKRPAq8C+wFD8nbuByYtYluHkn73tYWk5YE/SfpdnrcpMBh4BvgTsI2k8cA1wL4RMUHSysDrpN+dHQwcI2kDYIWIeKAT+2pmZt3YYnc/5vMjPp1P5gvpR8Y7An/J91cihVxv4IaIeC0/bmwntrUjsLGkvfL9VfK63gLGR8SMvK7JwABgPjArIiYA1E5vJemXwDclfRU4BLisE9s2M7Nu7l/9ndqrlWkB34uIi6oLSDqmg8cv4N2uzxXarOvLEfGeM3FIGga8WSlaSAd1j4jXJI0jnVF/H2DzDupiZmaF6Ioh/bcCh+QTAiOpn6QPAncBIyWtKKk3sFvlMdN4N2j2arOu/8znYkTSBpJ6dbDtx4E1JW2Rl+8tqRZ2PwXOBSbkS8GYmVnh3vcZRSLid5I+Cvw5j914BTgwIu6XdA3wAOlkwBMqDzubdAqq0cDNlfKfkroV788DQebw7nXM6m37LUn7Aj+WtCLpeNoOwCsRMUnSS8D/vt99NDOz7kHprE8N2JB0Cilszm7Q9tYC7gQ+srT+5GDo0KExceLEZlfDrFsZcPzNi16oG5h2xq7NrkK3JWlSRAytN6/IM4pIOoh0oc+TltZAMzOzrtewExpHxCkN3NYVwBWN2p6ZmS0dimypmZlZa3KomZlZMRxqZmZWDIeamZkVw6FmZmbFcKiZmVkxHGpmZlYMh5qZmRXDoWZmZsVo+VCTNEbSbEkPV8pOkTRT0uR826Uy7wRJU/PVv3eqlA/PZVMlHd/o/TAzM4capAuIDq9Tfk5EDMm3WwAkDSJdzXtwfswFknpI6gGcD+wMDAL2z8uamVkDNezcj0uriLgrX827M0YAV0fEm8BTkqYCW+Z5UyPiSQBJV+dlH+ni6pqZWQfcUmvfUZIezN2Tq+WyfsD0yjIzcll75f9E0mhJEyVNnDNnzpKot5lZy3Ko1XchsB4wBJgF/KCrVhwRF0fE0IgY2rdv365arZmZ4e7HuiLiudq0pEuAm/LdmUD/yqJr5zI6KDczswZxS60OSWtW7n4OqI2MHAvsJ2l5SesCA4HxwARgoKR1JS1HGkwytpF1NjMzt9SQdBUwDOgjaQZwMjBM0hAggGnA4QARMUXStaQBIAuAIyNiYV7PUcCtQA9gTERMafCumJm1vJYPtYjYv07xpR0sfzpwep3yW4BburBqZma2mNz9aGZmxXComZlZMRxqZmZWDIeamZkVw6FmZmbFcKiZmVkxHGpmZlYMh5qZmRXDoWZmZsVwqJmZWTEcamZmVgyHmpmZFcOhZmZmxXComZlZMRxqZmZWDIeamZkVw6FmZmbFcKiZmVkxHGpmZlYMh5qZmRXDoWZmZsVwqJmZWTEcamZmVoyWDzVJYyTNlvRwpWx1SeMkPZH/rpbLJelcSVMlPShps8pjRuXln5A0qhn7YmbW6lo+1IDLgOFtyo4Hbo+IgcDt+T7AzsDAfBsNXAgpBIGTga2ALYGTa0FoZmaN0/KhFhF3AXPbFI8ALs/TlwMjK+VXRHIvsKqkNYGdgHERMTci5gHj+OegNDOzJazlQ60da0TErDz9LLBGnu4HTK8sNyOXtVf+TySNljRR0sQ5c+Z0ba3NzFqcQ20RIiKA6ML1XRwRQyNiaN++fbtqtWZmhkOtPc/lbkXy39m5fCbQv7Lc2rmsvXIzM2sgh1p9Y4HaCMZRwI2V8oPyKMitgfm5m/JWYEdJq+UBIjvmMjMza6Blm12BZpN0FTAM6CNpBmkU4xnAtZIOBZ4G9smL3wLsAkwFXgO+CBARcyV9B5iQl/t2RLQdfGJmZktYy4daROzfzqzt6ywbwJHtrGcMMKYLq2ZmZovJ3Y9mZlYMh5qZmRXDoWZmZsVwqJmZWTEcamZmVgyHmpmZFcOhZmZmxXComZlZMRxqZmZWDIeamZkVw6FmZmbFcKiZmVkxHGpmZlYMh5qZmRXDoWZmZsVwqJmZWTEcamZmVgyHmpmZFcOhZmZmxXComZlZMRxqZmZWDIeamZkVw6HWAUnTJD0kabKkiblsdUnjJD2R/66WyyXpXElTJT0oabPm1t7MrPU41Bbt0xExJCKG5vvHA7dHxEDg9nwfYGdgYL6NBi5seE3NzFqcQ23xjQAuz9OXAyMr5VdEci+wqqQ1m1FBM7NW5VDrWAC/kzRJ0uhctkZEzMrTzwJr5Ol+wPTKY2fkMjMza5Blm12Bpdy2ETFT0geBcZIeq86MiJAUi7PCHI6jAT70oQ91XU3NzMwttY5ExMz8dzZwA7Al8FytWzH/nZ0Xnwn0rzx87VzWdp0XR8TQiBjat2/fJVl9M7OW41Brh6ReknrXpoEdgYeBscCovNgo4MY8PRY4KI+C3BqYX+mmNDOzBnD3Y/vWAG6QBOl5+kVE/FbSBOBaSYcCTwP75OVvAXYBpgKvAV9sfJXNzFqbQ60dEfEksEmd8heA7euUB3BkA6pmZmbtcPejmZkVw6FmZmbFcKiZmVkxHGpmZlYMh5qZmRXDoWZmZsVwqJmZWTEcamZmVgyHmpmZFcOhZmZmxXComZlZMRxqZmZWDIeamZkVw6FmZmbFcKiZmVkxHGpmZlYMh5qZmRXDV74262YGHH9zs6vQZaadsWuzq2CFcUvNzMyK4VAzM7NiONTMzKwYDjUzMyuGB4p0U60+WKDV99/M6nNLrYtJGi7pcUlTJR3f7PqYmbUSh1oXktQDOB/YGRgE7C9pUHNrZWbWOtz92LW2BKZGxJMAkq4GRgCPNLVWZlYEd7svmiJiiay4FUnaCxgeEV/K978AbBURR1WWGQ2Mznc3BB5veEUXTx/g+WZXoklaed+htfff+750Wyci+tab4ZZag0XExcDFza5HZ0maGBFDm12PZmjlfYfW3n/ve/fddx9T61ozgf6V+2vnMjMzawCHWteaAAyUtK6k5YD9gLFNrpOZWctw92MXiogFko4CbgV6AGMiYkqTq/V+dZuu0iWglfcdWnv/ve/dlAeKmJlZMdz9aGZmxXComZlZMRxqZotBkppdBzNrn0PN/iWt9uEuaVtJW0aLHYSWtJakDza7HtYcktZodh0Wl0PNFoukzSR9oNU+3IHNgBslbQqtEer5A+0s4HPd8cPt/ar3GrfC6w5pPyX1AcZL+nyz67M4HGrWaZKGA78EPtbsujSKpGUAIuJc4GrgZ7UWW8kfcJIUEc8BPwOGArtIWqXJ1Wqo2hc3SZ+VdLiklVrly1wkzwNHAadI2rvZdeosh5p1iqQPAWcAB0fEfc2uT6NExNsAko4EVgaeBX4raatSgy0HWu3De3VgfeB7wH6t0BVZfU0lHQJ8F9gJuEnSpiW+5lW1/ZO0TET8GjgG+L6kfZtbs85xqFmHKv/AIl2B4I+5fLn8t2ez6tYokjYifWM9NSJ2AL4B3CBp6xK/uVdaKHsAxwF7AacBnwR2K7nFVg30vJ8CdoyIPYA/At8CNik12Np8oVlfUp+IuAX4AnBGdwg2h5otSq/89xlgLUnHAkTEW5K2B35Y66IrRZ0PrFnAROAtST0j4gLg18CdkjZueAWXEEmfknRWpWhtYEJEvBAR55H2+URgVIkttjaBdizwJ+BrwJcBIuKbwEPADym0C76y//8NXAT8UtL3SOew3R/4Tr76yFKrqA8j61qSdgKulPQN4ADgK8B2ks7Pl9n5AXBbrYuuBG2/qecLv74ErAQcBNT29Q/AbXleKf4KfEnSGfn+fcAHaoNjIuJq0rUBPwa80ZwqLjmV1/0TwBbAvqQQ/0g+/R0R8S3gDmBus+q5pEnaHDgQ2JW0/38nhftE4L+BYyX1bl4NO+ZzP1pdkrYlfSM9hNT1tjFpkMiRpDf6xsAJEfGbNl0W3Vrlg+1I0hXMHyGF1+HAdcB6uWU6FBgREX9vVl27iqRPA70jYqykjwATJS2MiJMk7QPsI2kr4GXgA8BpEVFSmAPvtNA/Rmqh3BcRUyQ9CbwCHCZpxYg4KyJOa2pFu1jt/7fyf7wK8EJEvAb8WdI80jHFT0TETZLujIhXmlrpDjjU7J/kb2F9SS2zZYCPAntGxKt5OP/oyrLFBFqN0oVc9wEOA84Edsx/dwI+DawHnF1CoGWvAE9KWjsiZkjakhRsrwBfJR1P+TSwKnB0RDzdxLp2qer7N/99MHfBHiZpm4j4k6TfA8uTwn014MVS3vN5MEit96E3qefhXmCepKMi4ryIeEzSs8BA4C7gtSZVt1N8QmN7D0k7Ap8AppJGOz4PbBcRc3N35DbAmRHxahOrucTkQD8AuIZ3u2B+DJwMXBoRFzWxel2u8i19JHA5cGRE/FzSmqTupp9ExHfysr0Kft0PIH1ozwZ+TnrdDyUNDvpjHhC1XMH7Pxr4DGn/7yV1L38C+HdSd+txwPCIeKpplewkH1Ozd0jaDNgduD0ifg78HzAZCEn/QTqGNr6kf+y2g0Ii4uWI+Alp+P7OwH4RcTMwB9hd0uoljHyr7UMOtJWBIaTu5W9J+mJEzCJ1sZ4g6Tv5YUv1N/TFoXSmlBXz9JdJg0HmARuSLh11K3AZcLakj0fEP0p631dJ2o10rOwM0vGzDUm9M+fl+2sAe3SHQAN3P1qWP+R+CvyDNKJRwKXAnsBvgReBE3OfejFdjpVjaEcB6wKrkf65nwOWA/5d0mdJ+390RBQxQKCy34Mi4hFJM4BpwFXAJZKIiP+VtC4p4CnlNZfUDzgeeFjSFcAA0mt7X55/IvD9iPhSHtZf7NXrJW1Det//KCImSXoM2I70f/9sRHy9u/2/u6VmtUEhw0ndbKsDu+UzCvwlIr5BOqa0Zx5I0K3e4O3J39Q/kKePBEYC5wObAF+OiPnAeFLAfZX0Ifd8s+q7JEj6OPAbSYeTfoP1n6Qg3x04R9IBEfFcRDzRzHouAc8Ak0jdjQcAg4FPVebfRP5sjIjzCzp22vaH5SuT9n8r4GBJG0fEq/kH12sBG0D3+zLjllqLy8OXLwHuB2aQutlOkvR2RPwYIH/Ak6e71Ru8njbf1McAK5J+g3MQ6Ywhx+UD6CfmLqrlqs9BCZR+PD+d1Ao5nPRbvHtIXcyfIh1fmde0Ci4hlWOIywCD8u1+4MuS5kbET0kjIAdIWhWYX8J7vqZOC30SqaU2HvgvSVeRRriuQnpPdDseKNLC8ii3M0lD8++VtD556C5ptNslEXFyM+u4JORvq6OAjUhD9rcD+pP+iQ+MiAX5OMs/gItK+lCDd1pow4FrgdeB/wGuB3qShrOfGhGnNq+GS/KMDW0AAAV2SURBVFYeFHIc8EXSYJDnSSM79yS10j4J7BsRU5pWySUov/5Xk07/dQfpMMOfSAODjiCNgPx2RDzQtEq+D+5+bG2rkP6Bt8v3nya11v5GGuU4rkn1WmIq3ae1b+p7k/Z3MHBXDrSDSV1xt5cWaNn0fLscGAbcDLwUEZeQfsbw8+ZVrSE2BH4REZOBY4H5pO6284DvA8MKDrS2LfRBpB/Z7wo8QDrU8PnuGmjgUGtpETEO2AM4RNL+EfEP0oCIzwJzI+LuEkb6VeWupwNIo91OJAXaQuAK4BhJF5I+2Pcq8FgSABExI3ezHUo6r+OBpG/tRMSlEfG3ZtavAe4HtpE0OCLeioj/AT4M/BvpfV/UsdOa3EI7ifRl9kDSF9jVgCdIvRafB5aJiDebVsku4GNqLS4ibpT0Nul0WHuSTgN1Su0YUqEtlXe+qSud4+7/kf6pLyK1XhZExIvNrGAjRMQDuVW6PXC0pAERMa25tWqIO0mnwfq8pDtIx1TnA+dGgWdKqai20C/g3Rb69ZIWAndGxMJmVrAr+JiaASBpd+DbwJURcVb1d0zNrVnXyz80Phg4qdbNJGkC8HsKPQXUoiidqPkfza5Ho0hai9RLsQewADguIh5sbq0aQ9ImpEsJ9Qb6RsRHmlylLuVQs3fks4mMAb4SEdc3uz5LSh7V9tV8t/ZN/RjgoIh4pmkVs4aT1Iv0ObjUnstwSVC6ysL2wNGkEwxMa26Nuo5Dzd5D0meAv0XEk82uy5LUyt/UzWpKbKE71Kylteo3dbNSOdTMzKwYHtJvZmbFcKiZmVkxHGpmZlYMh5qZmRXDoWa2lJI0UlJIWuwfx0qaJqlPnfLdJR3fRfXbWdJESY9I+oukH+TyUyQd1xXbMFtcDjWzpdf+wN357z+RtNinuYuIsRFxxvutmKSNSCcAPjAiBpGukj31/a7X7P1yqJkthSStBGxLOunwfpXyYZL+KGks8IikHpLOlvSwpAfzJXNqvizpfkkP1Vp7kg6WdJ6kVSQ9na8rhqRekqZL6ilpPUm/lTQpb6teS/FrwOkR8RhARCyMiAvr7MdhkiZIekDSdZULs+6d6/yApLty2WBJ4yVNzvsysEueTGspDjWzpdMI4LcR8VfgBUmbV+ZtBhwdERsAo4EBwJCI2Bi4srLc8xGxGXAh6fph78gnrJ7Mu1d8/ixwaz67xMWkq39vnh93QZ36bUS6evSiXB8RW0TEJsCjpJAG+BawUy7fPZcdAfwoIoaQWn4zOrF+s/dwqJktnfYnXciR/LfaBTk+Ip7K0zuQLmS6ACAi5laWq52/cxIp+Nq6Btg3T+8HXJNbiJ8AfilpMunKBWu+j/3YKLf2HgIOIF23DtJFKS+TdBjQI5f9GThR0teBdSLi9fexXWtRvvSM2VJG0uqkC7d+TFKQPvRDUu0kzK92clW162ItpP7/+ljgu3l7m5NO7twLeDG3ljoyJT9mUReTvAwYWbnMzTCAiDhC0laki1NOkrR5RPxCUu2ClbdIOjwi7ljE+s3ewy01s6XPXsDPImKdiBgQEf2Bp4D/qLPsOODw2qCRHFCdks93OQH4EXBTPi72EvCUpL3z+pQvVdLWWaRW1QZ5uWUkHVFnud7ALEk9SS018vLrRcR9EfEtYA7QX9KHgScj4lzgRmDjzu6LWY1DzWzpsz9wQ5uy66g/CvKnwN+BByU9QLp68eK4hnQV5GsqZQcAh+b1TSEd33uPfEWDY4CrJD0KPEy6enRb3wTuI3U3PlYpPysPYHkYuIfU4tsHeDh3e25Euhq52WLxCY3NzKwYbqmZmVkxHGpmZlYMh5qZmRXDoWZmZsVwqJmZWTEcamZmVgyHmpmZFcOhZmZmxfj/g3lPj2JMgb8AAAAASUVORK5CYII=\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"plt.xlabel('Archive Class')\n",
"plt.ylabel('Frequency', rotation=0, labelpad=30)\n",
"plt.title('Frequencies of Test Archive Classes')\n",
"plt.xticks(rotation=45)\n",
"test_counts = [(class_series[X_test.index])[class_series == label].count() for label in labels]\n",
"plt.bar(labels, np.asarray(test_counts));"
]
},
{
"cell_type": "markdown",
"id": "8d0f0054",
"metadata": {
"id": "kYdnsxgYHW8a"
},
"source": [
"Below is a chart of the predicted frequency of the different archives by our ML model in the test set. Our predicted frequency looks very similar to the actual frequency above, which is good."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56ad4019",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 310
},
"id": "v6b1AStxRIW7",
"outputId": "59bc33f3-94cd-47b7-8f83-ef8a76fb7d03"
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAbUAAAElCAYAAABjzHyeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3debxd093H8c9XxBRBVKoyVDwERQliaGmbUgQlqVmDGCr0iaKlrdKWtjzV0nqqhqJS1Kyl8qA0piotGTRBDHURkggSScxF4vf8sdZhu869ube595zcfb7v1+u8zj5rT2vts8/+7bX2OnsrIjAzMyuDpeqdATMzs47ioGZmZqXhoGZmZqXhoGZmZqXhoGZmZqXhoGZmZqXhoGalI+mTkl6X1K2G61xP0mRJr0k6ulbrzes+RdLlebhmZZc0TdKXOns9tSbpYEn3tjL+z5JG1jJPeb2XSDq11uvtahzUGlg+KL2VD4KVV59652txRcRzEbFiRCys4Wq/A9wVET0j4uzmIyXdLenfeRvPkXS9pDU6OhNtLbukIZJmdPT687L/XNif3pX0TuHzb/6D5b0ftNsw7d2S5klatv05b5uI2DkiLu3o5So5WtIjkt6QNEPSdZI+3dHrKjMHNdstHwQrr+eLIyUtXa+MdTFrAlMXMc1REbEisC6wCnBW8wnKsL3zQX/FXNYrgJ8X9q8jO2u9kgYAnwMC2H0R09asFt8OvwKOAY4GViXtJ38Cdq1nproaBzX7CEkhabSkJ4Enc9qXc/PafEl/l7RxYfpNJT2Ym96ukXR1pZmkWlNOXv46eXhZSWdKek7Si5J+I2n5PG5IPls9TtJLkmZJOqSwnOUl/ULSs5JekXRvThuQ17F0nm5lSRfn+WdKOrVyUJO0jqS/5vnnSLqmle2yu6SpeRvcLelTOf1O4IvAObk2sm5r2zci5gJ/BDbK80+T9F1JDwFvSFpa0tZ5O8+XNEXSkEI+1sp5fk3SOGC1wrjmZV9V0u8kPZ9rMH+S1AP4M9CnWEOXtJSkEyQ9JellSddKWrWw7APztn5Z0kmtlbGVbdjafvTd/P28JukJSdtLGgqcCOyb8zmllcUfBNwPXAJ8qHlQqenufEm3SHoD+KKk/ko15tm5TOc0m+fMvM2ekbRzIf1uSV/L++58SRsVxvVWav34+KLK22xdA4HRwP4RcWdEvB0Rb0bEFRFxepXpe0m6Ked9Xh7uVxh/sKSn87Z8RtKInN7i/i5pfUnjJM3N23+fwrhdJD2alzdT0vGtfA/1FRF+NegLmAZ8qUp6AONIZ4vLA5sCLwFbAd1IB4xpwLLAMsCzwDeB7sBewLvAqXlZBwP3Vln+Onn4LGBsXldP4P+An+ZxQ4AFwI/zsncB3gR65fHnAncDfXO+PpvzNCCvY+k83Q3ABUAP4OPAeOCIPO4q4CTSCd5ywLYtbKt1gTeAHXJevgM0Acvk8XcDX2tlW78/nhSE7gR+X/geJgP98/buC7ycy7tUXufLQO88/T+AX+ayfh54Dbg8j2te9puBa4BeOd9fKGzbGc3yeAwpKPTLy74AuCqP2wB4Pa9v2bz+BVTZf5ot85LCvtDafrQeMB3oUyjH2nn4lEr5FrGuJuC/gc1J++DqzfLxCrBN3qY9gCmk/a9H8bsn7bPvAofnfH4deB5Qle9yDHBaYT2jgVsXVd4qeT8SeLYd2/JjwJ7ACqTfzXXAn/K4HsCrwHr58xrAhq3t73me6cAhwNI573OADfL4WcDn8nAvYLN6H79a3E71zoBfdfzy0w/sdWB+flV+FAFsV5jufOAnzeZ9AvgC6SD3/g8+j/s7bQhqgEiBYu3CuM8Az+ThIcBb5AN0TnsJ2Dr/KN8CNqlSrgF5HUsDqwNvA8sXxu9Puv4FcBlwIdBvEdvqB8C1hc9LATOBIfnz+we6Fua/mxSQ5+f5ruCDIDUNOLQw7XfJAa+QdhvpoPhJUjDpURh3JVWCWj6YvUc+CWi2vCF8NKg9Bmxf+LwG6eC+NPBD4OrCuB7AO7QvqLW2H62Tv9svAd2bTXMKiwhqwLY5r6vlz48D32yWj8ua7Wezi/tWYdzBQFPh8wp5m36i+Xed8/tUYdr7gIMWVd4q6zwJuL+t27LKuEHAvMJ3M58U9JZvNl3V/R3YF/hbs7QLgJPz8HPAEcBKreVxSXi5+dGGR8Qq+TW8kD69MLwmcFxuQpkvaT6pVtEnv2ZG3vOzZ9u47t6kA8akwnJvzekVL0fEgsLnN4EVSbWd5YCnFrGONUk1lFmFdVxAqrFBqnEJGJ+bFg9tYTl9iuWKiPdI26jvoov5vqPzdu4bESMiYnZhXPPtvXez7b0tKcj0IR283ihM39L27g/MjYh5bczfmsANhXU+BiwknRj0KeYxr//lNi63uPyq+1FENAHHkgLYS0pN2O3ptDQS+EtEzMmfr6RZEyQf3sb9STWjBVT3QmUgIt7MgytWme4uYAVJWyld0xtEahmA1n83zb1M+n7bRNIKki7IzcGvAvcAq0jqlr+bfUm1v1mSbpa0fp61pf19TWCrZnkdAXwij9+T1HLwbG6+/Exb81prXf6itHWaYpCaTmpiOa35RJK+APSVpEJg+yQfBJs3SIGrMv0nCrPPIdW2NoyIme3M3xzg38DapGaklkwn1dRWq3YAi4gXSM1MSNoWuF3SPfkgW/Q88H4vNEkiHaDam++WNN/ev4+Iw5tPJGlNoJekHoXA9slm8xeXs6qkVSJifivrK05/aETcV2W9s4BPFT6vQGoCa48W9yOAiLgSuFLSSqQTj58BB7aQ12Lelgf2AbpJqgSjZUkH+U0iorJ/NN/Gn5S0dCuBbZEiYqGka0m1/xeBmyLitcI6WixvM3cA50oaHBET2zD9caQm260i4gVJg4B/kgIWEXEbcFveNqcCF5GaD6vu7zmvf42IHVoo5wRgmKTuwFHAtaT9f4njmpq1xUXAkflsVJJ6SNpVUk/S9Z0FwNGSukvaA9iyMO8UYENJgyQtRzoTB96v7VwEnFW4sN5X0k6LylCedwzwS6VODt0kfUbNunJHxCzgL8AvJK2k1Bli7RyMkbR34QL7PNKB770qq7wW2FWp80J30kHlbVJTa0e7HNhN0k65XMspdZrpFxHPAhOBH0laJh+Ydqu2kFz2PwPn5Y4F3SV9Po9+EfiYpJULs/wGOC0Hzkqnh2F53B+AL0vaVtIypOuc7T1+tLgfKf3Pb7v8/f2bdLJT+R5eBAZIaml9w0k1yg1INaVBpAD8N1LnkWrGk64TnZ7zsZykbdpZnoorSTWjEXl4keVtvoCIeBI4D7gqf9fL5DztJ+mEKuvsSdpG85U685xcGSFpdUnDlDoEvU26xPBeHtfS/n4TsK5SZ6Du+bWFpE/lvIyQtHJEvEu6XlftN7JEcFCzRcpnjocD55B+CE2k6w5ExDvAHvnzXNKP+/rCvP8iHQBvJ/WkbP6n1u/m5d2fm1FuJ52BtsXxwMPAhLzun1F9nz6I1KHl0Zz/P/BBU88WwAOSXid1WDkmIp6usg2eAA4Afk2qJe5G+jvEO23Ma5tFxHRgGKnX32zSWfS3+aBsXyV1PphLOphd1sriDiRda3qcdM3q2LyOx0mdBp7OzU19SF3KxwJ/kfQaqdPIVnn6qaROEFeSgsE8oF3/c2ttPyLVrE4nbdsXSM3D38vjrsvvL0t6sMqiRwK/i/QfvRcqr7yeEaryN4lI/+PbjXQt77lcln3bU57Csh4gtUj0IZ1EtKW81Rydpz2XdE3sKeArpM5Tzf0vqVPRHNL3dGth3FLAt0itC3NJ1yy/nsdV3d9z7XJHYL883wuk31PlJPFAYFr+jR5JCuBLpEpvHrMOI+kSUieE79c7L2bWWFxTMzOz0nBQMzOz0nDzo5mZlYZramZmVhoOamZmVhr+83UdrbbaajFgwIB6Z8PMrEuZNGnSnIjoXW2cg1odDRgwgIkT23LzADMzq5DU4q343PxoZmal4aBmZmal4aBmZmal4aBmZmal4aBmZmal4aBmZmal4aBmZmal4aBmZmal4T9fm1mXMuCEm+udhQ4x7fRd652FUnJNzczMSsNBzczMSsNBzczMSsNBzczMSsNBzczMSqPhg5qk5SSNlzRF0lRJP8rpa0l6QFKTpGskLZPTl82fm/L4AYVlfS+nPyFpp/qUyMyscTV8UAPeBraLiE2AQcBQSVsDPwPOioh1gHnAYXn6w4B5Of2sPB2SNgD2AzYEhgLnSepW05KYmTW4hg9qkbyeP3bPrwC2A/6Q0y8FhufhYfkzefz2kpTTr46ItyPiGaAJ2LIGRTAzs6zhgxqApG6SJgMvAeOAp4D5EbEgTzID6JuH+wLTAfL4V4CPFdOrzGNmZjXgoAZExMKIGAT0I9Wu1u+sdUkaJWmipImzZ8/urNWYmTUkB7WCiJgP3AV8BlhFUuU2Yv2AmXl4JtAfII9fGXi5mF5lnuI6LoyIwRExuHfv3p1SDjOzRtXwQU1Sb0mr5OHlgR2Ax0jBba882Ujgxjw8Nn8mj78zIiKn75d7R64FDATG16YUZmYGvqExwBrApbmn4lLAtRFxk6RHgaslnQr8E7g4T38x8HtJTcBcUo9HImKqpGuBR4EFwOiIWFjjspiZNbSGD2oR8RCwaZX0p6nSezEi/g3s3cKyTgNO6+g8mplZ2zR886OZmZWHg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZWGg5qZmZVGwwc1Sf0l3SXpUUlTJR2T00+RNFPS5PzapTDP9yQ1SXpC0k6F9KE5rUnSCfUoj5lZI1u63hlYAiwAjouIByX1BCZJGpfHnRURZxYnlrQBsB+wIdAHuF3Sunn0ucAOwAxggqSxEfFoTUphZmYOahExC5iVh1+T9BjQt5VZhgFXR8TbwDOSmoAt87imiHgaQNLVeVoHNTOzGmn45sciSQOATYEHctJRkh6SNEZSr5zWF5hemG1GTmsp3czMasRBLZO0IvBH4NiIeBU4H1gbGESqyf2ig9YzStJESRNnz57dEYs0M7PMQQ2Q1J0U0K6IiOsBIuLFiFgYEe8BF/FBE+NMoH9h9n45raX0D4mICyNicEQM7t27d8cXxsysgTV8UJMk4GLgsYj4ZSF9jcJkXwEeycNjgf0kLStpLWAgMB6YAAyUtJakZUidScbWogxmZpY0fEcRYBvgQOBhSZNz2onA/pIGAQFMA44AiIipkq4ldQBZAIyOiIUAko4CbgO6AWMiYmotC2Jm1ugaPqhFxL2Aqoy6pZV5TgNOq5J+S2vzmZlZ52r45kczMysPBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMysNBzUzMyuNhg9qkvpLukvSo5KmSjomp68qaZykJ/N7r5wuSWdLapL0kKTNCssamad/UtLIepXJzKxRNXxQAxYAx0XEBsDWwGhJGwAnAHdExEDgjvwZYGdgYH6NAs6HFASBk4GtgC2BkyuB0MzMaqPhg1pEzIqIB/Pwa8BjQF9gGHBpnuxSYHgeHgZcFsn9wCqS1gB2AsZFxNyImAeMA4bWsChmZg2v4YNakaQBwKbAA8DqETErj3oBWD0P9wWmF2abkdNaSjczsxpxUMskrQj8ETg2Il4tjouIAKKD1jNK0kRJE2fPnt0RizQzs8xBDZDUnRTQroiI63Pyi7lZkfz+Uk6fCfQvzN4vp7WU/iERcWFEDI6Iwb179+7YgpiZNbiGD2qSBFwMPBYRvyyMGgtUejCOBG4spB+Ue0FuDbySmylvA3aU1Ct3ENkxp5mZWY0sXe8MLAG2AQ4EHpY0OaedCJwOXCvpMOBZYJ887hZgF6AJeBM4BCAi5kr6CTAhT/fjiJhbmyKYmRk4qBER9wJqYfT2VaYPYHQLyxoDjOm43JmZWXs0fPOjmZmVh4OamZmVhoOamZmVhoOamZmVhoOamZmVhoOamZmVhoOamZmVhoOamZmVxiL/fC1pIfBwIWl4REzrtByZmZn9h9pyR5G3ImJQtRH5vomKiPc6NltmZmbt1+7mR0kDJD0h6TLgEaC/pG9LmiDpIUk/Kkx7kqR/SbpX0lWSjs/pd0sanIdXkzQtD3eTdEZhWUfk9CF5nj9IelzSFTmgImkLSX+XNEXSeEk9Jd0jaVAhH/dK2mQxtpOZmXUBbampLV+40e8zwDeBgcDIiLhf0o7585akeyiOlfR54A1gP2BQXs+DwKRFrOsw0l3vt5C0LHCfpL/kcZsCGwLPA/cB20gaD1wD7BsREyStBLxFuuv+wcCxktYFlouIKW0oq5mZdWHtbn7MT4d+NiLuz0k75tc/8+cVSUGuJ3BDRLyZ5xvbhnXtCGwsaa/8eeW8rHeA8RExIy9rMjAAeAWYFRETACoP95R0HfADSd8GDgUuacO6zcysi/tP79L/RmFYwE8j4oLiBJKObWX+BXzQ9Llcs2V9IyI+9BwySUOAtwtJC2kl7xHxpqRxwDDSI2M2byUvZmZWEh3Rpf824FBJKwJI6ivp48A9wHBJy0vqCexWmGcaHwSavZot6+v5SdRIWldSj1bW/QSwhqQt8vQ9JVWC3W+Bs4EJETFvsUpoZmZdwmI/Ty0i/iLpU8A/ct+N14EDIuJBSdcAU4CX+ODhmQBnkh7AOQq4uZD+W1Kz4oO5I8hsYHgr635H0r7AryUtT7qe9iXg9YiYJOlV4HeLW0YzM+salJ55WYMVSaeQgs2ZNVpfH+BuYP0l9S8HgwcPjokTJ9Y7G2ZdyoATbl70RF3AtNN3rXcWuixJkyJicLVxpbyjiKSDgAeAk5bUgGZmZh1vsZsf2yoiTqnhui4DLqvV+szMbMlQypqamZk1Jgc1MzMrDQc1MzMrDQc1MzMrDQc1MzMrjYYPapLGSHpJ0iOFtFMkzZQ0Ob92KYz7nqSm/KSCnQrpQ3Nak6QTal0OMzNzUIN0s+OhVdLPiohB+XULgKQNSE8e2DDPc15+XE434FxgZ2ADYP88rZmZ1VDN/qe2pIqIe/KTB9piGHB1RLwNPCOpifTIHYCmiHgaQNLVedpHOzi7ZmbWCtfUWnZUflDpGEm9clpfYHphmhk5raX0j5A0StJESRNnz57dGfk2M2tYDmrVnQ+sTXrA6SzgFx214Ii4MCIGR8Tg3r17d9RizcwMNz9WFREvVoYlXQTclD/OBPoXJu2X02gl3czMasQ1tSokrVH4+BWg0jNyLLCfpGUlrUV6Kvd40mN1BkpaS9IypM4kbXnSt5mZdaCGr6lJugoYAqwmaQZwMjBE0iAgSA80PQIgIqZKupbUAWQBMDoiFublHEV6yGk3YExETK1xUczMGl7DB7WI2L9K8sWtTH8acFqV9FuAWzowa2Zm1k5ufjQzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9JwUDMzs9Jo+KAmaYyklyQ9UkhbVdI4SU/m9145XZLOltQk6SFJmxXmGZmnf1LSyHqUxcys0TV8UAMuAYY2SzsBuCMiBgJ35M8AOwMD82sUcD6kIAicDGwFbAmcXAmEZmZWOw0f1CLiHmBus+RhwKV5+FJgeCH9skjuB1aRtAawEzAuIuZGxDxgHB8NlGZm1skaPqi1YPWImJWHXwBWz8N9gemF6WbktJbSP0LSKEkTJU2cPXt2x+bazKzBOagtQkQEEB24vAsjYnBEDO7du3dHLdbMzHBQa8mLuVmR/P5STp8J9C9M1y+ntZRuZmY15KBW3Vig0oNxJHBjIf2g3Atya+CV3Ex5G7CjpF65g8iOOc3MzGpo6XpnoN4kXQUMAVaTNIPUi/F04FpJhwHPAvvkyW8BdgGagDeBQwAiYq6knwAT8nQ/jojmnU/MzKyTNXxQi4j9Wxi1fZVpAxjdwnLGAGM6MGtmZtZObn40M7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFAzM7PScFBrhaRpkh6WNFnSxJy2qqRxkp7M771yuiSdLalJ0kOSNqtv7s3MGo+D2qJ9MSIGRcTg/PkE4I6IGAjckT8D7AwMzK9RwPk1z6mZWYNzUGu/YcClefhSYHgh/bJI7gdWkbRGPTJoZtaoHNRaF8BfJE2SNCqnrR4Rs/LwC8DqebgvML0w74yc9iGSRkmaKGni7NmzOyvfZmYNael6Z2AJt21EzJT0cWCcpMeLIyMiJEV7FhgRFwIXAgwePLhd85qZWetcU2tFRMzM7y8BNwBbAi9WmhXz+0t58plA/8Ls/XKamZnViINaCyT1kNSzMgzsCDwCjAVG5slGAjfm4bHAQbkX5NbAK4VmSjMzqwE3P7ZsdeAGSZC205URcaukCcC1kg4DngX2ydPfAuwCNAFvAofUPstmZo3NQa0FEfE0sEmV9JeB7aukBzC6BlkzM7MWuPnRzMxKw0HNzMxKw0HNzMxKw0HNzMxKw0HNzMxKw0HNzMxKw0HNzMxKw0HNzMxKw0HNzMxKw0HNzMxKw0HNzMxKw0HNzMxKwzc0NutiBpxwc72z0GGmnb5rvbNgJeOampmZlYaDmpmZlYaDmpmZlYaDmpmZlYaDmpmZlYaDmpmZlYaDmpmZlYaDmpmZlYb/fN1FNfofcBu9/GZWnWtqZmZWGg5qZmZWGg5qHUzSUElPSGqSdEK982Nm1kh8Ta0DSeoGnAvsAMwAJkgaGxGP1jdnZlYGvpa8aK6pdawtgaaIeDoi3gGuBobVOU9mZg1DEVHvPJSGpL2AoRHxtfz5QGCriDiqMM0oYFT+uB7wRM0z2j6rAXPqnYk6aeSyQ2OX32Vfsq0ZEb2rjXDzY41FxIXAhfXOR1tJmhgRg+udj3po5LJDY5ffZe+6ZXfzY8eaCfQvfO6X08zMrAYc1DrWBGCgpLUkLQPsB4ytc57MzBqGmx87UEQskHQUcBvQDRgTEVPrnK3F1WWaSjtBI5cdGrv8LnsX5Y4iZmZWGm5+NDOz0nBQMzOz0nBQM2sHSap3HsysZQ5q9h9ptIO7pG0lbRkNdhFaUh9JH693Pqw+JK1e7zy0l4OatYukzSSt0GgHd2Az4EZJm0JjBPV8QDsD+EpXPLgtrmrfcSN875DKKWk1YLykr9Y7P+3hoGZtJmkocB3w6XrnpVYkLQUQEWeT7uX5+0qNrcwHOEmKiBeB3wODgV0krVznbNVU5cRN0pclHSFpxUY5mYtkDnAUcIqkveudp7ZyULM2kfRJ4HTg4Ih4oN75qZWIeA9A0mhgJeAF4FZJW5U1sOWAVjl4rwqsA/wU2K8RmiKL36mkQ4H/AXYCbpK0aRm/86JK+SQtFRH/BxwL/FzSvvXNWds4qFmrCj9gkZ5A8Lecvkx+716vvNWKpI1IZ6w/iogvAd8HbpC0dRnP3As1lD2A44G9gFOBzwO7lbnGVgzouZwCdoyIPYC/AT8ENilrYGt2QrOOpNUi4hbgQOD0rhDYHNRsUXrk9+eBPpKOA4iIdyRtD/yy0kRXFlUOWLOAicA7krpHxHnA/wF3S9q45hnsJJK+IOmMQlI/YEJEvBwR55DKfCIwsow1tmYB7TjgPuA7wDcAIuIHwMPALylpE3yh/N8CLgCuk/RT0j1s9wd+kp8+ssQq1cHIOpaknYArJH0fGAEcDWwn6dz8mJ1fALdXmujKoPmZen7w66vAisBBQKWsfwVuz+PK4l/A1ySdnj8/AKxQ6RwTEVcDj5IO6P+uTxY7T+F7/yywBbAvKYivn29/R0T8ELgTmFuvfHY2SZsDBwC7ksr/HCm4TwS+BRwnqWf9ctg63/vRqpK0LemM9FBS09vGpE4io0k7+sbA9yLiz82aLLq0woFtNLAz6SB+O3AE8Edg7VwzHQwMi4jn6pXXjiLpi0DPiBgraX1goqSFEXGSpH2AfSRtBbwGrACcGhFlCubA+zX0T5NqKA9ExFRJTwOvA4dLWj4izoiIU+ua0Q5W+f0WfscrAy9HxJvAPyTNI11T/GxE3CTp7oh4va6ZboWDmn1EPgvrTaqZLQV8CtgzIt7I3flHFaYtTUCrUHqQ6z7A4cDPgB3z+07AF4G1gTPLENCy14GnJfWLiBmStiQFtteBb5Oup3wRWAU4JiKerWNeO1Rx/83vD+Um2MMlbRMR90m6C1iWFNx7AfPLss/nziCV1oeepJaH+4F5ko6KiHMi4nFJLwADgXuAN+uU3TbxDY3tQyTtCHwWaCL1dpwDbBcRc3Nz5DbAzyLijTpms9PkgD4CuIYPmmB+DZwMXBwRF9Qxex2ucJY+HLgUGB0Rl0tag9Tc9JuI+EmetkeJv/cRpIP2S8DlpO/9MFLnoL/lDlHLlLj8o4AdSOW/n9S8/FngE6Tm1uOBoRHxTN0y2Ua+pmbvk7QZsDtwR0RcDvwJmAyEpM+RrqGNL9MPu3mnkIh4LSJ+Q+q+vzOwX0TcDMwGdpe0ahl6vlXKkAPaSsAgUvPyDyUdEhGzSE2s35P0kzzbEn2G3h5Kd0pZPg9/g9QZZB6wHunRUbcBlwBnSvpMRLxbpv2+SNJupGtlp5Oun61Hap05J39eHdijKwQ0cPOjZfkg91vgXVKPRgEXA3sCtwLzgRNzm5N9Qz0AAAdLSURBVHppmhwL19COAtYCepF+3C8CywCfkPRlUvmPiYhSdBAolHuDiHhU0gxgGnAVcJEkIuJ3ktYiBXjK8p1L6gucADwi6TJgAOm7fSCPPxH4eUR8LXfrL+3T6yVtQ9rvfxURkyQ9DmxH+t2/EBHf7Wq/d9fUrNIpZCipmW1VYLd8R4F/RsT3SdeU9swdCbrUDt6SfKa+Qh4eDQwHzgU2Ab4REa8A40kB7tukg9yceuW3M0j6DPBnSUeQ/oP1dVIg3x04S9KIiHgxIp6sZz47wfPAJFJz4whgQ+ALhfE3kY+NEXFuia6dNv9j+Uqk8m8FHCxp44h4I//hug+wLnS9kxnX1Bpc7r58EfAgMIPUzHaSpPci4tcA+QBPHu5SO3g1zc7UxwDLk/6DcxDpjiHH5wvoJ+YmqmWK26AMlP48P51UCzmC9F+8v5OamL9Aur4yr24Z7CSFa4hLARvk14PANyTNjYjfknpADpC0CvBKGfb5iio19Emkmtp44JuSriL1cF2ZtE90Oe4o0sByL7efkbrm3y9pHXLXXVJvt4si4uR65rEz5LPVkcBGpC772wH9ST/iAyJiQb7O8i5wQZkOavB+DW0ocC3wFvC/wPVAd1J39h9FxI/ql8POlTuFHA8cQuoMMofUs3NPUi3t88C+ETG1bpnsRPn7v5p0+687SZcZ7iN1DDqS1APyxxExpW6ZXAxufmxsK5N+wNvlz8+SamtPkXo5jqtTvjpNofm0cqa+N6m8GwL35IB2MKkp7o6yBbRsen5dCgwBbgZejYiLSH9juLx+WauJ9YArI2IycBzwCqm57Rzg58CQEge05jX0DUh/st8VmEK61PDVrhrQwEGtoUXEOGAP4FBJ+0fEu6QOEV8G5kbEvWXo6VeUm55GkHq7nUgKaAuBy4BjJZ1POrDvVcJrSQBExIzczHYY6b6OB5DO2omIiyPiqXrmrwYeBLaRtGFEvBMR/wv8F/Ax0n5fqmunFbmGdhLpZPYA0glsL+BJUqvFV4GlIuLtumWyA/iaWoOLiBslvUe6HdaepNtAnVK5hlTSmsr7Z+pK97j7b9KP+gJS7WVBRMyvZwZrISKm5Frp9sAxkgZExLT65qom7ibdBuurku4kXVN9BTg7SninlIJiDf08PqihXy9pIXB3RCysZwY7gq+pGQCSdgd+DFwREWcU/8dU35x1vPxH44OBkyrNTJImAHdR0ltALYrSjZrfrXc+akVSH1IrxR7AAuD4iHiovrmqDUmbkB4l1BPoHRHr1zlLHcpBzd6X7yYyBjg6Iq6vd346S+7V9u38sXKmfixwUEQ8X7eMWc1J6kE6Di6x9zLsDEpPWdgeOIZ0g4Fp9c1Rx3FQsw+RtAPwVEQ8Xe+8dKZGPlM3qyhjDd1BzRpao56pm5WVg5qZmZWGu/SbmVlpOKiZmVlpOKiZmVlpOKiZmVlpOKiZLaEkDZcUktr951hJ0yStViV9d0kndFD+dpY0UdKjkv4p6Rc5/RRJx3fEOszay0HNbMm1P3Bvfv8ISe2+zV1EjI2I0xc3Y5I2It0A+ICI2ID0lOymxV2u2eJyUDNbAklaEdiWdNPh/QrpQyT9TdJY4FFJ3SSdKekRSQ/lR+ZUfEPSg5IertT2JB0s6RxJK0t6Nj9XDEk9JE2X1F3S2pJulTQpr6taTfE7wGkR8ThARCyMiPOrlONwSRMkTZH0x8KDWffOeZ4i6Z6ctqGk8ZIm57IM7JCNaQ3FQc1syTQMuDUi/gW8LGnzwrjNgGMiYl1gFDAAGBQRGwNXFKabExGbAeeTnh/2vnzD6sl88MTnLwO35btLXEh6+vfmeb7zquRvI9LToxfl+ojYIiI2AR4jBWmAHwI75fTdc9qRwK8iYhCp5jejDcs3+xAHNbMl0/6kBzmS34tNkOMj4pk8/CXSg0wXAETE3MJ0lft3TiIFvuauAfbNw/sB1+Qa4meB6yRNJj25YI3FKMdGubb3MDCC9Nw6SA+lvETS4UC3nPYP4ERJ3wXWjIi3FmO91qD86BmzJYykVUkPbv20pCAd9ENS5SbMb7RxUZXnYi2k+m99LPA/eX2bk27u3AOYn2tLrZma51nUwyQvAYYXHnMzBCAijpS0FenhlJMkbR4RV0qqPLDyFklHRMSdi1i+2Ye4pma25NkL+H1ErBkRAyKiP/AM8Lkq044Djqh0GskBqk3y/S4nAL8CbsrXxV4FnpG0d16e8qNKmjuDVKtaN0+3lKQjq0zXE5glqTuppkaefu2IeCAifgjMBvpL+i/g6Yg4G7gR2LitZTGrcFAzW/LsD9zQLO2PVO8F+VvgOeAhSVNITy9uj2tIT0G+ppA2AjgsL28q6freh+QnGhwLXCXpMeAR0tOjm/sB8ACpufHxQvoZuQPLI8DfSTW+fYBHcrPnRqSnkZu1i29obGZmpeGampmZlYaDmpmZlYaDmpmZlYaDmpmZlYaDmpmZlYaDmpmZlYaDmpmZlYaDmpmZlcb/A62wAZjmPDyCAAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"plt.xlabel('Archive Class')\n",
"plt.ylabel('Frequency', rotation=0, labelpad=30)\n",
"plt.title('Frequencies of Predicted Test Archive Classes')\n",
"plt.xticks(rotation=45)\n",
"test_pred_counts = [predicted_df.loc[X_test.index].loc[predicted_df['prediction'] == label].shape[0] for label in labels]\n",
"plt.bar(labels, np.asarray(test_pred_counts));"
]
},
{
"cell_type": "markdown",
"id": "7b51ad25",
"metadata": {
"id": "GeVOwslAH1pP"
},
"source": [
"Unfortunately, since our texts skew so heavily towards being part of the domestic archive, most of the other archives end up being overpredicted (i.e. our model says a text is part of that archive when it is actually not). Below we can see that the domestic archive is the only archive whose texts are not overpredicted."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "739ae7ef",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 310
},
"id": "RRCs5wEESDVZ",
"outputId": "8db2a17f-b490-4af8-d257-8944be46d747"
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZ0AAAElCAYAAAA/Rj+6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3deZgcVdn+8e9NCLJFEDP6CkkIsqhhhxFQVCKLJi4BQYUAIgoEfxLEV0AQERBUNjdQkEV5EZBVUYMgqCyC7GGLhIiGNWGRsMqmEHh+f5zTUGlnhZnTmer7c119Tdcy3U91z9Td59TpKkUEZmZmJSzS6gLMzKx9OHTMzKwYh46ZmRXj0DEzs2IcOmZmVoxDx8zMinHo2JAn6ROS5kh6RtK6ra5nsEkaKykkLZqnfy/ps6/hccbk12zYINQYklYZ6Md9vSTdK2nzbpa9X9KdpWtqNw4d65P8z/p83kk9LOlUSUv38Xd3lvSXQSzvu8DUiFg6Im7p4vklaV9J/8jbcL+kwyW9YRBrKiYiJkbEz3tbr3mHGxH359fspcGt8PWTND4H2X6D9RwRcVVEvGOwHt8Sh471x8cjYmlgHWBd4GstrqdhRWBmD8uPBaYAOwEjgInAZsC5A11Io/Ux2L/Thj4LPE56D7vl13IIiAjffOv1BtwLbF6ZPgq4sDK9P3AX8DRwB/CJPP9dwL+Bl4BngCfz/DeQWij3A/8ETgCW6Oa5FwEOBO4DHgFOA5bJj/EMEMCzwF1d/O6q+bk3aJo/GvgPsCmwIfAwMKyy/BPAjMrzN7bvMVJYLZeXjc3Pv0velisr86YADwIPAftUHvsQ4JfAGcC/gF3z9vwsr/sA8K1GPcCw/Fo9CtwN7JEff9G8/Apg18rj7wbMqrwX6wGnAy8Dz+fX7KuVOhuPszwwjbRznw3s1lTzufm1f5oU8p09/L0E8KVc76PA0fl1XCw//pqVdd8CPAd0dPNYS+Xn3A54ofq8Xb3+3b0Glb/jfYAZwFPAOcDiedl4YG6+vx/wy6Y6jgGOzfe7fb9862Vf0uoCfBsaNyqhA4wC/gocU1n+qbzTWgTYlhQCb8vLdgb+0vR4P8g7uOVIrY8LgMO7ee7P553g24GlgfOB0yvLA1ilm9/9AnBfN8v+3HhOUqBsUVl2HrB/vr8XcF3e7jcAJwJn5WWNnd5peee4RGXeWXnemsC8yut3CPAisFV+vZYAfp0fd6m8E74B2L2yDX8jBeVywOV0Ezr5fXgAeDcgYBVgxeb3sKn2xuNcCRwPLE5qzc4DNq3U/G/gI6QQPBy4roe/l8h1LgeMAf5eqfF44MjKunsBF/TwWJ8h7dyH5b+TH3WxDdXXv7fX4AbS3+pypGD6Ql42nldDZ0VSEI7I08NyDRvl6W7fL9962Ze0ugDfhsYt/7M+Q/rkGMClwLI9rH8rsGW+vzOV0Mk7gmeBlSvz3gPc081jXQp8sTL9DtJOu7Gz7Cl0Duxu5wicDZyc738LOCXfH5HrWzFPzwI2q/ze2xrPX9npvb2yvDHvnZV5RwE/y/cPIX8iz9NvJbW6lqjMmwxcnu9f1tgx5ukP0X3oXALs1cN72GXokALtpcZONi8/HDi1UvOfKsvGAc/38P4HMKEy/UXg0nx/Q1KrRHl6OvDpHh7rT8APK6/LPGB40zZUX//eXoMdm96XE/L98eTQydN/AXbK97cgt6R7e7986/nmYzrWH1tFxAjSP+c7gZGNBZJ2knSrpCclPQmsUV3epANYEripsv7FeX5Xlid1rTXcR9pRvrUPNT9KComuvC0vBzgT2DoPLtgauDkiGs+5IvDrSq2zSDvo6vPP6eLxq/Puy9vR1bIVgeHAQ5XnOJH0CZr8e82P1Z3RpFZbfy0PPB4RTzc9zwqV6Ycr958DFu/lGEqX2x8R1+ffHy/pnaSWyLSuHkDSaOCDwC/yrN+SWmIf7eG5ensNmrejuwExZ5LCBGD7PA29v1/WA4eO9VtE/Bk4lXScAUkrAicDU4E3R8SywO2kFg2kT6JVj5KOLaweEcvm2zKRBil05UHSP3rDGGA+6VhQby4DRkvaoDoz78w2IrWiiIg7SDvGiSy4g4G0Q5tYqXXZiFg8Ih6orNO8jZB2ftWaH+xm/TmkT84jK4//xohYPS9/qIvH6s4cYOVulnVVY8ODwHKSRjQ9zwPdrN8XPW3/z4EdSV1nv4yIf3fzGJ8h7acukPQw6RjR4qSBBVXNr2d3r0F/nEcKxlGkY3yNv4ne3i/rgUPHXqsfAltIWpvUrx2kbg8kfY7U0mn4JzBK0mIAEfEyKaR+IOkt+XdWkPThbp7rLOB/Ja2Uh2l/BzgnIub3VmRE/J00SOEXkjaSNEzS6sCvSN1Ff6qsfibp+MIHSDuchhOAb+dwRVKHpC17e27gG5KWzM/3OdJB665qfAj4A/A9SW+UtIiklSVtklc5F/iSpFGS3kQa1NCdnwL7SFo/DxVfpVE36X14ezc1zAGuAQ6XtLiktUgH58/ow3Z2Z19Jb8oBvxcLbv8ZpB35jqTjMd35LPBN0jGmxm0b4COS3tzN7/T0GvRZRMwjdV3+H6nrd1ae39v7ZT1w6Nhrkv8hTwMOyq2E7wHXknZsawJXV1a/jDTa6WFJje6s/UiDA66T9C9Sv31335E4hTT66krgHtIB7T37Ue5U0o7oDNJxqYtJO5NtmtY7C9gEuCwiHq3MP4bU/fMHSU+TBhVs2Ifn/TNpGy8FvhsRf+hh3Z1II7vuAJ4gjW5rdAueTDpOcRtwM2kgRZci4jzg26QAfRr4DemAOaRjNAfmLqF9uvj1yaRjJA+SDpQf3BTK/fVb4CbS8b0LSaO9GnXOydsSwFVd/bKkjUgt3OMi4uHKbRrpdZ3c1e/18hr015nA5izY8oWe3y/rQeNAnpkNEEljSeE4vC+tsXYl6RTgwYg4sNW1WDn+IpWZFZeDeWvSl4ytjbh7zcyKknQYaaDJ0RFxT6vrsbLcvWZmZsW4pWNmZsU4dMzMrBgPJOjFyJEjY+zYsa0uw8xsyLjpppsejYguzzDi0OnF2LFjmT59eqvLMDMbMiR1e6omd6+ZmVkxDh0zMyvGoWNmZsU4dMzMrBiHjpmZFVOb0JF0iqRHJN3ewzrj84XGZkr6c8n6zMysRqFDuqjYhO4WSlqWdG32SfliS58qVJeZmWW1CZ2IuBJ4vIdVtgfOj4j78/qPFCnMzMxe0U5fDl0NGC7pCmAEcExE9HTFQrPXbOz+F7a6hAFx7xEfbXUJVjPtFDqLAusDmwFLANdKui5fzngBkqYAUwDGjOnpcvRmZtYftele64O5wCUR8Wy+FPGVwNpdrRgRJ0VEZ0R0dnR0efogMzN7DdopdH4LvE/SopKWJF3jflaLazIzayu16V6TdBYwHhgpaS5wMDAcICJOiIhZki4GZgAvAz+NiG6HV5uZ2cCrTehExOQ+rHM0cHSBcszMrAvt1L1mZmYt5tAxM7NiHDpmZlaMQ8fMzIpx6JiZWTEOHTMzK8ahY2ZmxTh0zMysGIeOmZkV49AxM7NiHDpmZlaMQ8fMzIpx6JiZWTG1Ocu0LVzqcrlm8CWbzQaSWzpmZlaMQ8fMzIpx6JiZWTG1CR1Jp0h6RFKPl6CW9G5J8yV9slRtZmaW1CZ0gFOBCT2tIGkYcCTwhxIFmZnZgmoTOhFxJfB4L6vtCfwKeGTwKzIzs2a1CZ3eSFoB+ATwkz6sO0XSdEnT582bN/jFmZm1ibYJHeCHwH4R8XJvK0bESRHRGRGdHR0dBUozM2sP7fTl0E7gbEkAI4GPSJofEb9pbVlmZu2jbUInIlZq3Jd0KvA7B46ZWVm1CR1JZwHjgZGS5gIHA8MBIuKEFpZmZmZZbUInIib3Y92dB7EUMzPrRjsNJDAzsxZz6JiZWTEOHTMzK8ahY2ZmxTh0zMysGIeOmZkV49AxM7NiHDpmZlaMQ8fMzIpx6JiZWTEOHTMzK8ahY2ZmxTh0zMysGIeOmZkVU5tLGyyMxu5/YatLGDD3HvHRVpdgZjXglo6ZmRXj0DEzs2JqEzqSTpH0iKTbu1m+g6QZkv4q6RpJa5eu0cys3dUmdIBTgQk9LL8H2CQi1gQOA04qUZSZmb2qNgMJIuJKSWN7WH5NZfI6YNRg12RmZguqU0unP3YBft/dQklTJE2XNH3evHkFyzIzq7e2Cx1JHySFzn7drRMRJ0VEZ0R0dnR0lCvOzKzmatO91heS1gJ+CkyMiMdaXY+ZWbtpm5aOpDHA+cBnIuLvra7HzKwd1aalI+ksYDwwUtJc4GBgOEBEnAAcBLwZOF4SwPyI6GxNtWZWRz4LSe9qEzoRMbmX5bsCuxYqx8zMutA23WtmZtZ6Dh0zMyvGoWNmZsU4dMzMrBiHjpmZFePQMTOzYhw6ZmZWjEPHzMyKceiYmVkxDh0zMyvGoWNmZsU4dMzMrBiHjpmZFePQMTOzYhw6ZmZWjEPHzMyKceiYmVkxtQkdSadIekTS7d0sl6RjJc2WNEPSeqVrNDNrd7UJHeBUYEIPyycCq+bbFOAnBWoyM7OK2oRORFwJPN7DKlsCp0VyHbCspLeVqc7MzKBGodMHKwBzKtNz87z/ImmKpOmSps+bN69IcWZm7aCdQqfPIuKkiOiMiM6Ojo5Wl2NmVhvtFDoPAKMr06PyPDMzK6SdQmcasFMexbYR8FREPNTqoszM2smirS5goEg6CxgPjJQ0FzgYGA4QEScAFwEfAWYDzwGfa02lZmbtqzahExGTe1kewB6FyjFrW2P3v7DVJQyYe4/4aKtLqJ126l4zM7MWc+iYmVkxDh0zMyvGoWNmZsU4dMzMrJheQ0fSS5JulXS7pAskLdvL+utI+sjAlWhmZnXRl5bO8xGxTkSsQTqhZm/DjtchfR/GzMxsAf3tXruWfJJMSRtIulbSLZKukfQOSYsBhwLb5tbRtpKWyte6uSGvu+VAb4SZmQ0Nff5yqKRhwGbAz/KsvwHvj4j5kjYHvhMR20g6COiMiKn5974DXBYRn89dczdI+lNEPDuwm2JmZgu7voTOEpJuJbVwZgF/zPOXAX4uaVUgyKec6cKHgEmS9snTiwNj8mOZmVkb6fMxHWBFQLx6TOcw4PJ8rOfjpDDpioBt8nGhdSJiTEQ4cMzM2lCfj+lExHPAl4C9JS1Kauk0Lg2wc2XVp4ERlelLgD0lCUDSuq+nYDMzG7r6NZAgIm4BZgCTgaOAwyXdwoLddJcD4xoDCUgtouHADEkz87SZmbWhXo/pRMTSTdMfr0yuVrl/YF7+OPDupofZ/bUWaGZm9eEzEpiZWTEOHTMzK6Y2oSNpgqQ7Jc2WtH8Xy8dIujx/QXWGT9VjZlZeLUInf3H1OGAiMA6YLGlc02oHAudGxLrAdsDxZas0M7NahA6wATA7Iu6OiBeAs4Hm0+0E8MZ8fxngwYL1mZkZ/TgNzkJuBWBOZXousGHTOocAf5C0J7AUsHmZ0szMrKEuLZ2+mAycGhGjSGfBPl1Sl9svaYqk6ZKmz5s3r2iRZmZ1VpfQeQAYXZkexatnS2jYBTgXICKuJZ22Z2RXDxYRJ0VEZ0R0dnR0DEK5ZmbtqS6hcyOwqqSV8uUVtgOmNa1zP+ks2Uh6Fyl03IwxMyuoFqETEfOBqaTzvM0ijVKbKelQSZPyansDu0m6DTgL2DkiojUVm5m1p7oMJCAiLgIuapp3UOX+HcDGpesyM7NX1aKlY2ZmQ4NDx8zMinHomJlZMQ4dMzMrxqFjZmbFOHTMzKwYh46ZmRXj0DEzs2IcOmZmVoxDx8zMinHomJlZMQ4dMzMrxqFjZmbFOHTMzKwYh46ZmRXj0DEzs2IcOmZmVkxtQkfSBEl3Spotaf9u1vm0pDskzZR0ZukazczaXS0uVy1pGHAcsAUwF7hR0rR8ierGOqsCXwM2jognJL2lNdWambWvurR0NgBmR8TdEfECcDawZdM6uwHHRcQTABHxSOEazczaXl1CZwVgTmV6bp5XtRqwmqSrJV0naUKx6szMDKhJ91ofLQqsCowHRgFXSlozIp5sXlHSFGAKwJgxY0rWaGZWa3Vp6TwAjK5Mj8rzquYC0yLixYi4B/g7KYT+S0ScFBGdEdHZ0dExKAWbmbWjuoTOjcCqklaStBiwHTCtaZ3fkFo5SBpJ6m67u2SRZmbtrhahExHzganAJcAs4NyImCnpUEmT8mqXAI9JugO4HNg3Ih5rTcVmZu2pNsd0IuIi4KKmeQdV7gfwlXwzM7MWqEVLx8zMhgaHjpmZFePQMTOzYhw6ZmZWjEPHzMyKceiYmVkxDh0zMyvGoWNmZsU4dMzMrBiHjpmZFePQMTOzYhw6ZmZWjEPHzMyKceiYmVkxDh0zMyvGoWNmZsU4dMzMrJjahI6kCZLulDRb0v49rLeNpJDUWbI+MzOrSehIGgYcB0wExgGTJY3rYr0RwF7A9WUrNDMzqEnoABsAsyPi7oh4ATgb2LKL9Q4DjgT+XbI4MzNL6hI6KwBzKtNz87xXSFoPGB0RF/b2YJKmSJouafq8efMGtlIzszZWl9DpkaRFgO8De/dl/Yg4KSI6I6Kzo6NjcIszM2sjdQmdB4DRlelReV7DCGAN4ApJ9wIbAdM8mMDMrKy6hM6NwKqSVpK0GLAdMK2xMCKeioiRETE2IsYC1wGTImJ6a8o1M2tPtQidiJgPTAUuAWYB50bETEmHSprU2urMzKxh0VYXMFAi4iLgoqZ5B3Wz7vgSNZmZ2YJq0dIxM7OhwaFjZmbFOHTMzKwYh46ZmRXj0DEzs2IcOmZmVoxDx8zMinHomJlZMQ4dMzMrxqFjZmbFOHTMzKwYh46ZmRXj0DEzs2IcOmZmVoxDx8zMinHomJlZMQ4dMzMrpjahI2mCpDslzZa0fxfLvyLpDkkzJF0qacVW1Glm1s5qETqShgHHAROBccBkSeOaVrsF6IyItYBfAkeVrdLMzGoROsAGwOyIuDsiXgDOBrasrhARl0fEc3nyOmBU4RrNzNpeXUJnBWBOZXpuntedXYDfd7dQ0hRJ0yVNnzdv3gCVaGZmdQmdPpO0I9AJHN3dOhFxUkR0RkRnR0dHueLMzGpu0VYXMEAeAEZXpkfleQuQtDnwdWCTiPhPodrMzCyrS0vnRmBVSStJWgzYDphWXUHSusCJwKSIeKQFNZqZtb1ahE5EzAemApcAs4BzI2KmpEMlTcqrHQ0sDZwn6VZJ07p5ODMzGyR16V4jIi4CLmqad1Dl/ubFizIzswXUoqVjZmZDg0PHzMyKceiYmVkxDh0zMyvGoWNmZsU4dMzMrBiHjpmZFePQMTOzYhw6ZmZWjEPHzMyKceiYmVkxDh0zMyvGoWNmZsU4dMzMrBiHjpmZFePQMTOzYhw6ZmZWTG1CR9IESXdKmi1p/y6Wv0HSOXn59ZLGlq/SzKy91SJ0JA0DjgMmAuOAyZLGNa22C/BERKwC/AA4smyVZmZWi9ABNgBmR8TdEfECcDawZdM6WwI/z/d/CWwmSQVrNDNre4qIVtfwukn6JDAhInbN058BNoyIqZV1bs/rzM3Td+V1Hu3i8aYAU/LkO4A7B3kTXo+RwH9tQxtp5+33trevhX37V4yIjq4WLFq6kqEgIk4CTmp1HX0haXpEdLa6jlZp5+33trfntsPQ3v66dK89AIyuTI/K87pcR9KiwDLAY0WqMzMzoD6hcyOwqqSVJC0GbAdMa1pnGvDZfP+TwGVRh75FM7MhpBbdaxExX9JU4BJgGHBKRMyUdCgwPSKmAT8DTpc0G3icFEx1MCS6AQdRO2+/t719Ddntr8VAAjMzGxrq0r1mZmZDgEPHzMyKcehYrfgLv2YLN4dOTbXbzlfS+yRt0G4jEiUtL+ktra7DWkPSW1tdQ385dGpG0nqSlmy3nS+wHvBbSetCe4Ru3uEcDXxiKO58Xq+u3uN2eN8hbaekkcANkrZvdT394dCpEUkTgPOANVtdSymSFgGIiGNJ59w7vdHiqfMOSJIi4p/A6UAn8BFJy7S4rKIaH6wkfUzS7pKWbpcPW5E8CkwFDpH0qVbX1FcOnZqQNAY4Atg5Iq5vdT2lRMTLAJL2AN4IPAxcLGnDugZPDpzGznU5YBXgcGC7duhqq76nkj4PfAf4MPA7SevW8T2vamyfpEUi4gLgy8BRkrZtbWV949AZ4ir/YCKdafuqPH+x/HN4q2orRdIapE9834yIzYEDgV9L2qiOn3wrn/C3BvYhnWHjW8AHgI/XucVTDdy8nQI+FBFbA1cBBwFr1zV4mj5wrCJpZERcBHwGOGIoBI9DZ+hbKv98EFhe0t4AEfGCpM2A7ze6oOqiix3KQ8B04AVJwyPieOAC4ApJaxUvcJBI2kTS0ZVZo4AbI+KxiPgxaZsPAD5bxxZPU+DsDVwNfBXYEyAivgH8Ffg+Ne1irmz/V4ATgfMkHU46t+Rk4LB8lv2FVq12Ru1G0oeBX0g6ENgB+BKwqaTj8uUevgf8qdEFVQfNn3TzBfz+BSwN7AQ0tvXPwJ/ysrr4O7CrpCPy9PXAko3BExFxNnAHaYf779aUOHgq7/t7gXcD25JC9p35NFhExEHAZaRTXdWSpPWBHYGPkrb/flL4Tge+AuwtaUTrKuxZLc691o4kvY/0ie7zpK6ltUiDCPYg/SGuBXwtIn7f1CQf0io7nj1IV4q9gxQuuwO/AlbOLbtOYMuIuL9VtQ4USR8ERkTENEnvBKZLeikivi7p08CnJW0IPA0sCXwrIuoUtsArLdw1SZ/wr8/nV7wbeAbYTdISEXF0RHyrpYUOsMb/b+X/eBngsYh4DrhW0hOkY1rvjYjfSboiIp5padE9cOgMQflTTAepZbMI8C5gm4h4Ng+XnlJZtzaB06B0kb1PA7uRLjv+ofzzw8AHgZWB79YhcLJngLsljYqIuZI2IAXPM8C+pP78DwLLAntFxH0trHVAVf9+888ZuYtxN0kbR8TVki4H3kAK3zcBT9blbz4PFmi03keQWu7XAU9ImhoRP46Iv0l6GFgVuBJ4rkXl9olP+DnESPoQ8F5gNmm02qPAphHxeO5u2xg4MiKebWGZgyYH7g7AObzaxfAj4GDgZxFxYgvLG3CVT7lbkS63vkdEnCHpbaTulBMi4rC87lI1ft93IO1UHwHOIL3vu5AGj1yVB8wsVuPtnwJsQdr+60jdp+8F/ofUnbgP6crI97SsyD7yMZ0hRNJ6wCTg0og4A/gNcCsQkt5POoZzQ53+8ZoHDUTE0xFxAml49ERgu4i4EJgHTJK0XB1GLjW2IQfOG4F1SN2nB0n6XEQ8ROpC/Jqkw/KvLdSfcPtD6UwLS+T7e5IGCzxBunz8Jfl2KvBdSe+JiBfr9HdfJenjpGM1R5CO37yD1Lvx4zz9VmDroRA44O61ISPvhH4KvEgakSbSNYK2AS4GngQOyH26telSqxzDmQqsBLyJ9M/3T2Ax4H8kfYy0/XtFRC0OIFe2e1xE3CFpLnAvcBZwsiQi4v8krUQKYOrynktaAdgfuF3SacBY0nt7fV5+AHBUROyah003XyW4NiRtTPq7PyYibpL0N2BT0v/9wxGx31D7f3dLZwjIgwYmkLqRlgM+nr+RfEtEHEg6prFNPtA8pP4Au5M/6S6Z7+8BbAUcB6wN7BkRTwE3kAJoX9JO6NFW1TsYJL0H+L2k3UnfQfl/pKCdBPxA0g4R8c+I+Ecr6xwEDwI3kbrTdgBWBzapLP8ded8VEcfV6Nhd8xdf30ja/g2BnSWtFRHP5i+ELg+sBkPvw4ZbOgu5PDz0ZOBmYC6pG+nrkl6OiB8B5B0w+f6Q+gPsStMn3VOAJUjfQdiJdMaBffIB1gNyF8xi1degDpS+3DuH9Cl+d9J3ka4hdaFuQurff6JlBQ6SyjGsRYBx+XYzsKekxyPip6QRbGMlLQs8VYe/+YYuWrg3kVo6NwD/K+ks0gjFZUh/E0OOBxIsxPIopSNJQ5+vk7QKeWgkabTSyRFxcCtrHAz5095ngTVIQ6I3BUaT/sl2jHR58j1JXY0n1mmnA6+0cCYA5wLPAz8EzgeGk4YLfzMivtm6CgdXHjSwD/A50mCBR0kj87YhtXI+AGwbETNbVuQgyu//2aTT+1xG6ka/mjRw5AukEWyHRsRtLSvydXD32sJtGdI/2KZ5+j5Sa+cu0ii1P7aorkFT6R5sfNL9FGl7VweuzIGzM6mr6dK6BU42J99+DowHLgT+FREnk4aJn9G60op4B3BmRNwK7A08RepO+jFwFDC+xoHT3MIdR/oS8EeB20hd6dsP1cABh85CLSL+CGwNfF7S5Ih4kXTA/GPA4xHxlzqM1KrKXSs7kEYrHUAKnJeA04AvS/oJacf7yRoeywAgIubmbqRdSOdV25H0qZeI+FlE3NXK+gq4GdhY0uoR8UJE/BB4O/Bm0t99rY7dNeQWztdJHzZ3JH3AfBPwD1Krf3tgkYj4T8uKHAA+prOQi4jfSnqZdLqbbUineTmkcQyjpp/0X/mkq3SOqS+S/ulOJH36nx8RT7aywBIi4rbcqtsM2EvS2Ii4t7VVFXEF6TQ320u6jHRM7yng2KjhmRYqqi3c43m1hXu+pJeAKyLipVYWOBB8TGeIkDQJOBT4RUQcXf0eR2srG3j5i5A7A19vdKNIuhG4nJqe4qU3SicyfbHVdZQiaXlSK39rYD6wT0TMaG1VZUham3SpihFAR0S8s8UlDSiHzhCSz0ZwCvCliDi/1fUMljwqad882fik+2Vgp4h4sGWFWXGSliLtpxbac4kNBqWzhG8G7EX6AvS9ra1o4Dh0hhhJWwB3RcTdra5lMLXzJ12zhjq2cB06tlBr10+6ZnXl0DEzs2I8ZNrMzIpx6JiZWTEOHTMzK8ahY2ZmxTh0zF4jSVtJCkn9/vKepHsljexi/iRJ+w9QfRMlTZd0h6RbJH0vzz9E0j4D8Rxm/eXQMXvtJgN/yT//i6R+n2YqIqZFxBGvtzBJa5BOkLljRIwjXWV09ut9XLPXy6Fj9hpIWhp4H+mknNtV5o+XdJWkacAdkoZJ+q6k2yXNyJdkaNhT0s2S/tpoLUnaWdKPJS0j6b58XRkkLSVpjqThklaWdLGkm/JzddXS+irw7Yj4G0BEvE2RRZYAAAJhSURBVBQRP+liO3aTdKOk2yT9qnLhvE/lmm+TdGWet7qkGyTdmrdl1QF5Ma2tOHTMXpstgYsj4u/AY5LWryxbj3R55dWAKaTLLa8TEWsBv6is92hErAf8hHT9mFfkE7reyqtXzPwYcEn+dvpJpKunrp9/7/gu6luDdPXN3pwfEe+OiLWBWaQQBTgI+HCePynP+wLpssnrkFpOc/vw+GYLcOiYvTaTSRfaIv+sdrHdEBH35Pubky40Nx8gIh6vrNc4f95NpGBqdg6wbb6/HXBObmG9FzhP0q2kM2+/7XVsxxq5tfRXXr00NKSLhp0qaTdgWJ53LXCApP2AFSPi+dfxvNamfGkDs36StBzpwnprSgrSTjkkNU5S+mwfH6pxXZSX6Pp/cRrwnfx865NOfroU8GRubfRkZv6d3i72dSqwVeUyCuMBIuILkjYkXTzsJknrR8SZkhoXFLtI0u4RcVkvj2+2ALd0zPrvk8DpEbFiRIyNiNHAPcD7u1j3j8DujUEFOUD6JJ9v7kbgGOB3+bjMv4B7JH0qP57yqfCbHU1qlayW11tE0he6WG8E8JCk4aSWDnn9lSPi+og4CJgHjJb0duDuiDgW+C2wVl+3xazBoWPWf5OBXzfN+xVdj2L7KXA/MEPSbaSrP/bHOaSrSJ5TmbcDsEt+vJmk40sLyGfk/jJwlqRZwO2kq282+wbpcshXA3+rzD86D3C4HbiG1GL6NHB77tZbg3Q1V7N+8Qk/zcysGLd0zMysGIeOmZkV49AxM7NiHDpmZlaMQ8fMzIpx6JiZWTEOHTMzK8ahY2Zmxfx/2a+bi/2hMLMAAAAASUVORK5CYII=\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"plt.xlabel('Archive Class')\n",
"plt.ylabel('Rate', rotation=0, labelpad=30)\n",
"plt.title('Rate of Overprediction by Archive')\n",
"plt.xticks(rotation=45)\n",
"rate = np.asarray(test_pred_counts)/np.asarray(test_counts)*sum(test_counts)/sum(test_pred_counts)\n",
"plt.bar(labels, rate);"
]
},
{
"cell_type": "markdown",
"id": "c083e610",
"metadata": {
"id": "VCkyV5guUAVw"
},
"source": [
"### 5.3 Accuracy By Archive\n",
"\n",
"The accuracies for the dead and wild archives are relatively low. This is likely because those texts are being misclassified into the domestic archive, our largest archive, since all three of these archives deal with animals. The wool and precious archives have decent accuracies."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19a813e6",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "OZmI6aR3G58T",
"outputId": "b25b85b6-b6a3-47a0-d741-a98c128283a8"
},
"outputs": [
{
"data": {
"text/plain": [
"0.734375"
]
},
"execution_count": 169,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test[class_series == 'dead'], y_test[class_series == 'dead'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "266a4a78",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "HbRZc7FSTt8B",
"outputId": "20d98cb3-8354-4c33-eda9-c66f6d487af4"
},
"outputs": [
{
"data": {
"text/plain": [
"0.9449010654490106"
]
},
"execution_count": 170,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test[class_series == 'dom'], y_test[class_series == 'dom'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2dc2a622",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "23aREE8sTxKt",
"outputId": "3eee05bc-0955-483f-ca5b-ca2094f3720c"
},
"outputs": [
{
"data": {
"text/plain": [
"0.7410071942446043"
]
},
"execution_count": 171,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test[class_series == 'wild'], y_test[class_series == 'wild'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9378b0e5",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3pE1BBM5T2n_",
"outputId": "04fb818b-8814-4fd1-9584-2579f183e476"
},
"outputs": [
{
"data": {
"text/plain": [
"0.8333333333333334"
]
},
"execution_count": 172,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test[class_series == 'wool'], y_test[class_series == 'wool'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59399459",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "LiVNsk8lT5qw",
"outputId": "6a8e0e52-f860-4a35-90e3-820d71927a9b"
},
"outputs": [
{
"data": {
"text/plain": [
"0.9264705882352942"
]
},
"execution_count": 173,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test[class_series == 'prec'], y_test[class_series == 'prec'])"
]
},
{
"cell_type": "markdown",
"id": "6f0be9f0",
"metadata": {
"id": "NCBvhrQkBawe"
},
"source": [
"We can also look at the confusion matrix. A confusion matrix is used to evaluate the accuracy of a classification. The rows denote the actual archive, while the columns denote the predicted archive. \n",
"\n",
"Looking at the first column: \n",
"- 73.44% of the dead archive texts are predicted correctly\n",
"- 1.31% of the domestic archive texts are predicted to be part of the dead archive\n",
"- 1.47% of the wild archive texts are predicted to be part of the dead archive\n",
"- 1.43% of the wool archive texts are predicted to be part of the dead archive\n",
"- none of the precious archive texts are predicted to be part of the dead archive"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4241129f",
"metadata": {
"id": "FMszAvyUBaR5"
},
"outputs": [],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"archive_confusion = confusion_matrix(y_test, f.predict(X_test), normalize='true')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f03d0b21",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "50GlA0txCLgq",
"outputId": "f055525f-a3ec-41e5-c650-dabc6061b778"
},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.734375 , 0.171875 , 0.015625 , 0.0625 , 0.015625 ],\n",
" [0.0130898 , 0.94490107, 0.00487062, 0.03531202, 0.00182648],\n",
" [0.01470588, 0.05882353, 0.92647059, 0. , 0. ],\n",
" [0.01438849, 0.21582734, 0.02158273, 0.74100719, 0.00719424],\n",
" [0. , 0.08333333, 0.08333333, 0. , 0.83333333]])"
]
},
"execution_count": 175,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"archive_confusion"
]
},
{
"cell_type": "markdown",
"id": "b21d3325",
"metadata": {
"id": "Vm1vDZPGIr1f"
},
"source": [
"This is the same confusion matrix converted into real numbers of texts. Since the number of domestic archive texts is so high, even a small bit of misclassification of the domestic archive texts can overwhelm the other archives.\n",
"\n",
"For example, even though only 1.3% of the domestic archive texts are predicted to be part of the dead archive, that corresponds to 43 texts, while the 73% of the dead archive texts that were predicted correctly correspond to just 47 texts. As a result, about half of the texts that were predicted to be part of the dead archive are incorrectly classified."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "54b5662d",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "HSDHsdWRDy_e",
"outputId": "cb72fa2a-8899-4c4e-95ac-32df77871d06"
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 47, 11, 1, 4, 1],\n",
" [ 43, 3104, 16, 116, 6],\n",
" [ 1, 4, 63, 0, 0],\n",
" [ 2, 30, 3, 103, 1],\n",
" [ 0, 2, 2, 0, 20]])"
]
},
"execution_count": 176,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"confusion_matrix(y_test, f.predict(X_test), normalize=None)"
]
},
{
"cell_type": "markdown",
"id": "84759158",
"metadata": {
"id": "G5YG_CKScVNB"
},
"source": [
"## 6 Save Results in CSV file & Pickle"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}