"
],
"text/plain": [
" lemma id_text ... 2 archive\n",
"pn id_line ... \n",
"100041 3 6(diš)[]NU P100041 ... NU \n",
" 3 udu[sheep]N P100041 ... N domesticated_animal\n",
" 4 kišib[seal]N P100041 ... N \n",
" 4 Lusuen[0]PN P100041 ... PN \n",
" 5 ki[place]N P100041 ... N \n",
"\n",
"[5 rows x 32 columns]"
]
},
"execution_count": 43,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"for archive in labels.keys():\n",
" data.loc[data.loc[:, 1].str.contains('|'.join([re.escape(x) for x in labels[archive]])), 'archive'] = archive\n",
"\n",
"data.loc[:, 'archive'] = data.loc[:, 'archive'].fillna('')\n",
"\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"id": "bd6d513c",
"metadata": {
"id": "-5so2KJ5bLin"
},
"source": [
"The function get_set has a dataframe row as an input and returns a dictionary where each key is a word type like NU and PN. The values are its corresponding lemmas."
]
},
{
"cell_type": "markdown",
"id": "5e180b98",
"metadata": {
"id": "IFeAuQmDInDY"
},
"source": [
"### 1.2 Data Structuring"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63afc925",
"metadata": {
"id": "5kTJp996bNaz"
},
"outputs": [],
"source": [
"def get_set(df):\n",
" \n",
" d = {}\n",
"\n",
" seals = df[df['label'].str.contains('seal')]\n",
" df = df[~df['label'].str.contains('seal')]\n",
"\n",
" for x in df[2].unique():\n",
" d[x] = set(df.loc[df[2] == x, 0])\n",
"\n",
" d['SEALS'] = {}\n",
" for x in seals[2].unique():\n",
" d['SEALS'][x] = set(seals.loc[seals[2] == x, 0])\n",
"\n",
" return d"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bff4408f",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "dgLTwrRebO-7",
"outputId": "dd3479a1-5d36-4235-8864-e6c92daa070e"
},
"outputs": [
{
"data": {
"text/plain": [
"{'': {''},\n",
" 'MN': {'Šueša'},\n",
" 'N': {'itud', 'maš', 'mu', 'mu.DU', 'udu'},\n",
" 'NU': {'1(diš)', '2(diš)'},\n",
" 'PN': {'Apilatum', 'Ku.ru.ub.er₃', 'Šulgisimti'},\n",
" 'SEALS': {},\n",
" 'SN': {'Šašrum'},\n",
" 'V/i': {'hulu'},\n",
" 'V/t': {'dab'}}"
]
},
"execution_count": 45,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"get_set(data.loc[100271])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eb30aa35",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 230
},
"id": "rNidxpgcbQVq",
"outputId": "0abc700c-248e-42b6-99c6-e989458c3dbe"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
archive
\n",
"
set
\n",
"
\n",
"
\n",
"
pn
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
100041
\n",
"
{domesticated_animal}
\n",
"
{'NU': {'6(diš)'}, 'N': {'ki', 'kišib', 'udu'}...
\n",
"
\n",
"
\n",
"
100189
\n",
"
{dead_animal}
\n",
"
{'NU': {'1(diš)', '5(diš)-kam', '2(diš)'}, 'N'...
\n",
"
\n",
"
\n",
"
100190
\n",
"
{dead_animal}
\n",
"
{'NU': {'3(u)', '1(diš)', '5(diš)', '1(diš)-ka...
\n",
"
\n",
"
\n",
"
100191
\n",
"
{dead_animal}
\n",
"
{'NU': {'1(diš)', '4(diš)', '4(diš)-kam', '2(u...
\n",
"
\n",
"
\n",
"
100211
\n",
"
{dead_animal}
\n",
"
{'NU': {'1(diš)', '1(u)', '1(diš)-kam', '2(diš...
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" archive set\n",
"pn \n",
"100041 {domesticated_animal} {'NU': {'6(diš)'}, 'N': {'ki', 'kišib', 'udu'}...\n",
"100189 {dead_animal} {'NU': {'1(diš)', '5(diš)-kam', '2(diš)'}, 'N'...\n",
"100190 {dead_animal} {'NU': {'3(u)', '1(diš)', '5(diš)', '1(diš)-ka...\n",
"100191 {dead_animal} {'NU': {'1(diš)', '4(diš)', '4(diš)-kam', '2(u...\n",
"100211 {dead_animal} {'NU': {'1(diš)', '1(u)', '1(diš)-kam', '2(diš..."
]
},
"execution_count": 46,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"archives = pd.DataFrame(data.groupby('pn').apply(lambda x: set(x['archive'].unique()) - set(['']))).rename(columns={0: 'archive'})\n",
"archives.loc[:, 'set'] = data.reset_index().groupby('pn').apply(get_set)\n",
"archives.loc[:, 'archive'] = archives.loc[:, 'archive'].apply(lambda x: {'dead_animal'} if 'dead_animal' in x else x)\n",
"archives.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "08da24cb",
"metadata": {
"id": "rOKf3s6qbSrG"
},
"outputs": [],
"source": [
"def get_line(row, pos_lst=['N']):\n",
" words = {'pn' : [row.name]} #set p_number\n",
" for pos in pos_lst:\n",
" if pos in row['set']:\n",
" #add word entries for all words of the selected part of speech\n",
" words.update({word: [1] for word in row['set'][pos]})\n",
" return pd.DataFrame(words)"
]
},
{
"cell_type": "markdown",
"id": "c844a607",
"metadata": {
"id": "D-fon0TCLhxA"
},
"source": [
"Each row represents a unique P-number, so the matrix indicates which word are present in each text."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27c47c98",
"metadata": {
"id": "wgZkM0MCQzu3"
},
"outputs": [],
"source": [
"sparse = words_df.groupby(by=['id_text', 'lemma']).count()\n",
"sparse = sparse['id_word'].unstack('lemma')\n",
"sparse = sparse.fillna(0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "978fdb70",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 461
},
"id": "yHn9mTTVMOsy",
"outputId": "6ba8c00f-d3ef-4073-f98f-f26a176aba22"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
ki
\n",
"
kišib
\n",
"
udu
\n",
"
itud
\n",
"
mu
\n",
"
ud
\n",
"
ga
\n",
"
sila
\n",
"
šu
\n",
"
mu.DU
\n",
"
maškim
\n",
"
ekišibak
\n",
"
zabardab
\n",
"
u
\n",
"
maš
\n",
"
šag
\n",
"
lugal
\n",
"
mašgal
\n",
"
kir
\n",
"
a
\n",
"
en
\n",
"
ensik
\n",
"
egia
\n",
"
igikar
\n",
"
ŋiri
\n",
"
ragaba
\n",
"
dubsar
\n",
"
mašda
\n",
"
saŋŋa
\n",
"
amar
\n",
"
mada
\n",
"
akiti
\n",
"
lu
\n",
"
ab
\n",
"
gud
\n",
"
ziga
\n",
"
uzud
\n",
"
ašgar
\n",
"
gukkal
\n",
"
šugid
\n",
"
...
\n",
"
šaŋanla
\n",
"
pukutum
\n",
"
lagaztum
\n",
"
bangi
\n",
"
imdua
\n",
"
KU.du₃
\n",
"
batiʾum
\n",
"
niŋna
\n",
"
sikiduʾa
\n",
"
gudumdum
\n",
"
šuhugari
\n",
"
šutur
\n",
"
gaguru
\n",
"
nindašura
\n",
"
ekaskalak
\n",
"
usaŋ
\n",
"
nammah
\n",
"
egizid
\n",
"
nisku
\n",
"
gara
\n",
"
saŋ.DUN₃
\n",
"
muhaldimgal
\n",
"
šagiagal
\n",
"
šagiamah
\n",
"
kurunakgal
\n",
"
ugulaʾek
\n",
"
šidimgal
\n",
"
kalam
\n",
"
enkud
\n",
"
in
\n",
"
kiʾana
\n",
"
bahar
\n",
"
hurizum
\n",
"
lagab
\n",
"
ibadu
\n",
"
balla
\n",
"
šembulug
\n",
"
li
\n",
"
niŋsaha
\n",
"
ensi
\n",
"
\n",
"
\n",
"
pn
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
100041
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
100189
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
100190
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
100191
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
100211
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
519650
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
519658
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
519792
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
519957
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
519959
\n",
"
1.0
\n",
"
1.0
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1.0
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
15139 rows × 1076 columns
\n",
"
"
],
"text/plain": [
" ki kišib udu itud mu ... balla šembulug li niŋsaha ensi\n",
"pn ... \n",
"100041 1.0 1.0 1.0 NaN NaN ... NaN NaN NaN NaN NaN\n",
"100189 1.0 NaN 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN\n",
"100190 1.0 NaN 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN\n",
"100191 1.0 NaN 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN\n",
"100211 1.0 NaN 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN\n",
"... ... ... ... ... ... ... ... ... .. ... ...\n",
"519650 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN\n",
"519658 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN\n",
"519792 NaN NaN 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN\n",
"519957 NaN NaN NaN 1.0 NaN ... NaN NaN NaN NaN NaN\n",
"519959 1.0 1.0 NaN 1.0 NaN ... NaN NaN NaN NaN NaN\n",
"\n",
"[15139 rows x 1076 columns]"
]
},
"execution_count": 49,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"sparse = pd.concat(archives.apply(get_line, axis=1).values).set_index('pn')\n",
"\n",
"sparse"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aeb77970",
"metadata": {
"id": "Cps3vRT8Xc7f"
},
"outputs": [],
"source": [
"sparse = sparse.fillna(0)\n",
"sparse = sparse.join(archives.loc[:, 'archive'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95e8af2d",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 279
},
"id": "j43addrEfqrB",
"outputId": "d73bcf04-e30d-4fb0-dd19-c8f83425eed3"
},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(pca_archive.explained_variance_ratio_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "107b64d4",
"metadata": {
"id": "5Ey9jWsOd3LL"
},
"outputs": [],
"source": [
"known_reindexed = known.reset_index()\n",
"known_reindexed"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "632abd1e",
"metadata": {
"id": "vojBl0tgd53L"
},
"outputs": [],
"source": [
"plt.figure()\n",
"plt.figure(figsize=(10,10))\n",
"plt.xticks(fontsize=12)\n",
"plt.yticks(fontsize=14)\n",
"plt.xlabel('Principal Component 1',fontsize=20)\n",
"plt.ylabel('Principal Component 2',fontsize=20)\n",
"plt.title(\"Principal Component Analysis of Archives\",fontsize=20)\n",
"targets = ['domesticated_animal', 'wild_animal', 'dead_animal', 'leather_object', 'precious_object', 'wool']\n",
"colors = ['red', 'orange', 'yellow', 'green', 'blue', 'violet']\n",
"for target, color in zip(targets,colors):\n",
" indicesToKeep = known_reindexed.index[known_reindexed['archive_class'] == target].tolist()\n",
" plt.scatter(principal_archive_Df.loc[indicesToKeep, 'principal component 1']\n",
" , principal_archive_Df.loc[indicesToKeep, 'principal component 2'], c = color, s = 50)\n",
"\n",
"plt.legend(targets,prop={'size': 15})\n",
"\n",
"# import seaborn as sns\n",
"# plt.figure(figsize=(16,10))\n",
"# sns.scatterplot(\n",
"# x=\"principal component 1\", y=\"principal component 2\",\n",
"# hue=\"y\",\n",
"# palette=sns.color_palette(\"hls\", 10),\n",
"# data=principal_cifar_Df,\n",
"# legend=\"full\",\n",
"# alpha=0.3\n",
"# )"
]
},
{
"cell_type": "markdown",
"id": "176c95eb",
"metadata": {
"id": "pXxYqfyRLilb"
},
"source": [
"## 2 Simple Modeling Methods"
]
},
{
"cell_type": "markdown",
"id": "36cb6ba4",
"metadata": {
"id": "0NfR51-5M29w"
},
"source": [
"### 2.1 Logistic Regression\n",
"\n",
"Here we will train our model using logistic regression to predict archives based on the features made in the previous subsection."
]
},
{
"cell_type": "markdown",
"id": "1614eadb",
"metadata": {
"id": "pwuS7jLwcReG"
},
"source": [
"#### 2.1.1 Logistic Regression by Archive\n",
"\n",
"Here we will train and test a set of 1 vs all Logistic Regression Classifiers which will attempt to classify tablets as either a part of an archive, or not in an archive."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c516db73",
"metadata": {
"id": "rGmHTOlvd-Nl"
},
"outputs": [],
"source": [
"clf_da = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_da.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'domesticated_animal'])\n",
"clf_da.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'domesticated_animal'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9536cdff",
"metadata": {
"id": "T1tHDbt8eBFZ"
},
"outputs": [],
"source": [
"clf_wa = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_wa.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'wild_animal'])\n",
"clf_wa.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'wild_animal'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "66f815ff",
"metadata": {
"id": "eyr-ZmqReDZ_"
},
"outputs": [],
"source": [
"clf_dea = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_dea.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'dead_animal'])\n",
"clf_dea.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'dead_animal'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "362a3834",
"metadata": {
"id": "ibPqJEyjeEdJ"
},
"outputs": [],
"source": [
"clf_lo = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_lo.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'leather_object'])\n",
"clf_lo.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'leather_object'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59486eec",
"metadata": {
"id": "P4VI2NUXeFnL"
},
"outputs": [],
"source": [
"clf_po = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_po.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'precious_object'])\n",
"clf_po.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'precious_object'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2a87a9db",
"metadata": {
"id": "RVStvp5aeGtD"
},
"outputs": [],
"source": [
"clf_w = LogisticRegression(random_state=42, solver='lbfgs', max_iter=200)\n",
"clf_w.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'wool'])\n",
"clf_w.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'wool'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0cc28ad5",
"metadata": {
"id": "dL56mGZUTCTx"
},
"outputs": [],
"source": [
"known.loc[:, 'AN.bu.um':'šuʾura']"
]
},
{
"cell_type": "markdown",
"id": "b0929509",
"metadata": {
"id": "bFvtNRjlGtsS"
},
"source": [
"As we can see the domesticated animal model has the lowest accuracy while the leather object, precious_object, and wool classifiers work fairly well."
]
},
{
"cell_type": "markdown",
"id": "5616d480",
"metadata": {
"id": "YopZHGeAb9LB"
},
"source": [
"#### 2.1.2 Multinomial Logistic Regression\n",
"\n",
"Here we will be using multinomial logistic regression as we have multiple archive which we could classify each text into. We are fitting our data onto the tablets with known archives and then checking the score to see how accurate the model is.\n",
"\n",
"Finally, we append the Logistic Regression prediction as an archive prediction for the tablets without known archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d55e23b",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "G_Fx9IEBcqof",
"outputId": "56380118-5d77-4c2a-946d-d66d5e152273"
},
"outputs": [
{
"data": {
"text/plain": [
"0.6918291862811029"
]
},
"execution_count": 131,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"clf_archive = LogisticRegression(random_state=42, solver='lbfgs', max_iter=300)\n",
"clf_archive.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
"log_reg_score = clf_archive.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
"model_weights['LogReg'] = log_reg_score\n",
"log_reg_score"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6416122f",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 529
},
"id": "BQtotAI5csuL",
"outputId": "64c9ab17-0f5b-4aa2-b599-f0a3a601f716"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
ki
\n",
"
kišib
\n",
"
udu
\n",
"
itud
\n",
"
mu
\n",
"
ud
\n",
"
ga
\n",
"
sila
\n",
"
šu
\n",
"
mu.DU
\n",
"
maškim
\n",
"
ekišibak
\n",
"
zabardab
\n",
"
u
\n",
"
maš
\n",
"
šag
\n",
"
lugal
\n",
"
mašgal
\n",
"
kir
\n",
"
a
\n",
"
en
\n",
"
ensik
\n",
"
egia
\n",
"
igikar
\n",
"
ŋiri
\n",
"
ragaba
\n",
"
dubsar
\n",
"
mašda
\n",
"
saŋŋa
\n",
"
amar
\n",
"
mada
\n",
"
akiti
\n",
"
lu
\n",
"
ab
\n",
"
gud
\n",
"
ziga
\n",
"
uzud
\n",
"
ašgar
\n",
"
gukkal
\n",
"
šugid
\n",
"
...
\n",
"
sikiduʾa
\n",
"
gudumdum
\n",
"
šuhugari
\n",
"
šutur
\n",
"
gaguru
\n",
"
nindašura
\n",
"
ekaskalak
\n",
"
usaŋ
\n",
"
nammah
\n",
"
egizid
\n",
"
nisku
\n",
"
gara
\n",
"
saŋ.DUN₃
\n",
"
muhaldimgal
\n",
"
šagiagal
\n",
"
šagiamah
\n",
"
kurunakgal
\n",
"
ugulaʾek
\n",
"
šidimgal
\n",
"
kalam
\n",
"
enkud
\n",
"
in
\n",
"
kiʾana
\n",
"
bahar
\n",
"
hurizum
\n",
"
lagab
\n",
"
ibadu
\n",
"
balla
\n",
"
šembulug
\n",
"
li
\n",
"
niŋsaha
\n",
"
ensi
\n",
"
archive
\n",
"
domesticated_animal
\n",
"
wild_animal
\n",
"
dead_animal
\n",
"
leather_object
\n",
"
precious_object
\n",
"
wool
\n",
"
LogReg Predicted Archive
\n",
"
\n",
"
\n",
"
pn
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
100217
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{domesticated_animal, wild_animal}
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
100229
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{domesticated_animal, wild_animal}
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
100284
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{domesticated_animal, wild_animal}
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
100292
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
100301
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
519647
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
519650
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
519658
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
519957
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
"
\n",
"
519959
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
...
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
{}
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
domesticated_animal
\n",
"
\n",
" \n",
"
\n",
"
3243 rows × 1084 columns
\n",
"
"
],
"text/plain": [
" ki kišib udu ... precious_object wool LogReg Predicted Archive\n",
"pn ... \n",
"100217 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"100229 1.0 0.0 1.0 ... 0.0 0.0 domesticated_animal\n",
"100284 0.0 0.0 1.0 ... 0.0 0.0 domesticated_animal\n",
"100292 1.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"100301 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"... ... ... ... ... ... ... ...\n",
"519647 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"519650 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"519658 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"519957 0.0 0.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"519959 1.0 1.0 0.0 ... 0.0 0.0 domesticated_animal\n",
"\n",
"[3243 rows x 1084 columns]"
]
},
"execution_count": 129,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"#Predictions for Unknown\n",
"unknown[\"LogReg Predicted Archive\"] = clf_archive.predict(unknown.loc[:, 'AN.bu.um':'šuʾura'])\n",
"unknown"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe64bfed",
"metadata": {
"id": "D2fSDs6AcvWs"
},
"outputs": [],
"source": [
"known['archive_class'].unique()"
]
},
{
"cell_type": "markdown",
"id": "1d3769c3",
"metadata": {
"id": "gNWP2GzecAim"
},
"source": [
"### 2.2 K Nearest Neighbors\n",
"\n",
"Here we will train our model using k nearest neighbors to predict archives based on the features made in the previous subsection. We are fitting our data onto the tablets with known archives and then checking the score to see how accurate the model is.\n",
"\n",
"We then append the KNN prediction as an archive prediction for the tablets without known archives.\n",
"\n",
"Then, we use different values for K (the number of neighbors we take into consideration when predicting for a tablet) to see how the accuracy changes for different values of K. This can be seen as a form of hyperparameter tuning because we are trying to see which K we should choose to get the highest training accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8de92c39",
"metadata": {
"id": "5iHBQzoQc3_i"
},
"outputs": [],
"source": [
"#takes long time to run, so don't run again\n",
"list_k = [3, 5, 7, 9, 11, 13]\n",
"max_k, max_score = 0, 0\n",
"for k in list_k:\n",
" knn = KNeighborsClassifier(n_neighbors=k)\n",
" knn.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
" knn_score = knn.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
" print(\"Accuracy for k = %s: \" %(k), knn_score)\n",
" if max_score <= knn_score:\n",
" max_score = knn_score\n",
" max_k = k\n",
" "
]
},
{
"cell_type": "markdown",
"id": "a7c1116b",
"metadata": {
"id": "FeBx_TVR-Aww"
},
"source": [
"As we can see here, k = 5 and k = 9 have the best training accuracy performance which falls roughly in line with the Logistic Regression classification training accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3478dc3b",
"metadata": {
"id": "D5m3ZZIrcxun"
},
"outputs": [],
"source": [
"knn = KNeighborsClassifier(n_neighbors=max_k)\n",
"knn.fit(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
"knn_score = knn.score(known.loc[:, 'AN.bu.um':'šuʾura'], known.loc[:, 'archive_class'])\n",
"model_weights['KNN'] = knn_score"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "239e3978",
"metadata": {
"id": "DXC8SgIyc09u"
},
"outputs": [],
"source": [
"#Predictions for Unknown\n",
"unknown[\"KNN Predicted Archive\"] = knn.predict(unknown.loc[:, 'AN.bu.um':'šuʾura'])\n",
"unknown"
]
},
{
"cell_type": "markdown",
"id": "33e11888",
"metadata": {
"id": "Pmiq9-8N9Zys"
},
"source": [
"As we can see in the output from the previous cell, we can get different predictions depending on the classifier we choose."
]
},
{
"cell_type": "markdown",
"id": "4c58e4cb",
"metadata": {
"id": "JTPzHB1U-cXx"
},
"source": [
"Next we will split the data we have on tablets with known archives into a test and training set to further understant the atraining accuracy. For the next two sections, we will use `X_train` and `y_train` to train the data and `X_test` and `y_test` to test the data. As the known set was split randomly, we presume that both the training and test set are representative of the whole known set, so the two sets are reasonably comparable."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db472f06",
"metadata": {
"id": "-iAwXCVkc548"
},
"outputs": [],
"source": [
"#Split known into train and test, eventually predict with unknown \n",
"X_train, X_test, y_train, y_test = train_test_split(known.loc[:, 'AN.bu.um':'šuʾura'], \n",
" known.loc[:, 'archive_class'], \n",
" test_size=0.2,random_state=0) "
]
},
{
"cell_type": "markdown",
"id": "af0f7284",
"metadata": {
"id": "kWzkxLtOcFNv"
},
"source": [
"### 2.3 Naive Bayes\n",
"\n",
"Here we will train our model using a Naive Bayes Model to predict archives based on the features made in the previous subsection. Here, we make the assumption that the features are independent of each other, from which the descriptor _naive_ comes from. So:\n",
"\n",
"$$P(x_i|y; x_1, x_2, ..., x_{i-1}, x_{i+1}, ..., x_n) = P(x_i| y)$$\n",
"\n",
"and:\n",
"\n",
"$$P(x_1, x_2, ..., x_n | y) = \\prod_{i=1}^{n} P(x_i | y)$$\n",
"\n",
"Moreover, we will be using Bayes' Law, which in this case states:\n",
"\n",
"$$P(y|x_1, x_2, ..., x_n) = \\frac{P(y)P(x_1, x_2, ..., x_n | y)}{P(x_1, x_2, ..., x_n)}$$\n",
"\n",
"eg. the probability of a particular tablet (defined by features $x_1, x_2, ..., x_n$) is in archive $y$, is equal to the probability of getting a tablet from archive $y$ times the probability you would get a particular set of features $x_1, x_2, ..., x_n$ divided by the probability of getting a particular set of features $x_1, x_2, ..., x_n$.\n",
"\n",
"Applying our assumption of independence from before, we can simplify this to:\n",
"\n",
"$$P(y|x_1, x_2, ..., x_n) = \\frac{P(y)\\prod_{i=1}^{n} P(x_i | y)}{P(x_1, x_2, ..., x_n)}$$\n",
"\n",
"Which means the probability of a particular tablet (defined by features $x_1, x_2, ..., x_n$) is in archive $y$ is _proportional_ to \n",
"\n",
"$$P(y|x_1, x_2, ..., x_n) \\propto P(y)\\prod_{i=1}^{n} P(x_i | y)$$ probability of getting a tablet from archive $y$ times the product of probabilities of getting a feature $x_i$ given an archive $y$.\n",
"\n",
"We can then use this to calculate the maximizing archive.\n",
"\n",
"$$\\hat{y} = \\underset{y}{argmax} \\; P(y)\\prod_{i=1}^{n} P(x_i | y)$$\n",
"\n",
"We are training two models where the first assumes the features are Gaussian random variables and the second assumes the features are Bernoulli random variables.\n",
"\n",
"We are fitting our data onto the tablets with known archives and then checking the score to see how accurate the model is.\n",
"\n",
"Finally, we append the two Naive Bayes predictions as archive predictions for the tablets without known archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9cea5277",
"metadata": {
"id": "tCIlLq7Jc7Hs"
},
"outputs": [],
"source": [
"#Gaussian\n",
"gauss = GaussianNB()\n",
"gauss.fit(X_train, y_train)\n",
"gauss_nb_score = gauss.score(X_test, y_test)\n",
"model_weights['GaussNB'] = gauss_nb_score\n",
"gauss_nb_score"
]
},
{
"cell_type": "markdown",
"id": "a4204922",
"metadata": {
"id": "BkK71TVzErST"
},
"source": [
"We can see than the Gaussian assumption does quite poorly."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cbf74cda",
"metadata": {
"id": "Ei_I_lWMc9Ar"
},
"outputs": [],
"source": [
"#Predictions for Unknown\n",
"unknown[\"GaussNB Predicted Archive\"] = gauss.predict(unknown.loc[:, 'AN.bu.um':'šuʾura'])\n",
"unknown"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00ac7360",
"metadata": {
"id": "DtJQjvWSc_Ym"
},
"outputs": [],
"source": [
"#Bernoulli\n",
"bern = BernoulliNB()\n",
"bern.fit(X_train, y_train)\n",
"bern_nb_score = bern.score(X_test, y_test)\n",
"model_weights['BernoulliNB'] = bern_nb_score\n",
"bern_nb_score"
]
},
{
"cell_type": "markdown",
"id": "8851e94c",
"metadata": {
"id": "k0ZneRNOEwcg"
},
"source": [
"However the Bernoulli assumption does quite well."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f8fae852",
"metadata": {
"id": "91Gk1c5EdAka"
},
"outputs": [],
"source": [
"#Predictions for Unknown\n",
"unknown[\"BernoulliNB Predicted Archive\"] = bern.predict(unknown.loc[:, 'AN.bu.um':'šuʾura'])\n",
"unknown"
]
},
{
"cell_type": "markdown",
"id": "e5062b38",
"metadata": {
"id": "rlwzgIeicIB3"
},
"source": [
"### 2.4 SVM\n",
"\n",
"Here we will train our model using Support Vector Machines to predict archives based on the features made earlier in this section. We are fitting our data onto the tablets with known archives and then checking the score to see how accurate the model is.\n",
"\n",
"Finally, we append the SVM prediction as an archive prediction for the tablets without known archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95e9a98d",
"metadata": {
"id": "3EZL1pkKdCMq"
},
"outputs": [],
"source": [
"svm_archive = svm.SVC(kernel='linear')\n",
"svm_archive.fit(X_train, y_train)\n",
"y_pred = svm_archive.predict(X_test)\n",
"svm_score = metrics.accuracy_score(y_test, y_pred)\n",
"model_weights['SVM'] = svm_score\n",
"print(\"Accuracy:\", svm_score)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bbb8faf0",
"metadata": {
"id": "OBXM8WfldEMn"
},
"outputs": [],
"source": [
"unknown[\"SVM Predicted Archive\"] = svm_archive.predict(unknown.loc[:, 'AN.bu.um':'šuʾura'])\n",
"unknown"
]
},
{
"cell_type": "markdown",
"id": "e806b8b7",
"metadata": {
"id": "ssrCDxoMOhZm"
},
"source": [
"## 3 Complex Modeling Methods"
]
},
{
"cell_type": "markdown",
"id": "f632803b",
"metadata": {
"id": "PbG6QBz36R5R"
},
"source": [
"## 4 Voting Mechanism Between Models\n",
"\n",
"Here we will use the models to determine the archive which to assign to each tablet with an unknown archive. \n",
"\n",
"We will then augment the words_df with these archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06649d31",
"metadata": {
"id": "A_Z48QeD93Jd"
},
"outputs": [],
"source": [
"model_weights"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b64cfde5",
"metadata": {
"id": "RMhkJmIJS1CL"
},
"outputs": [],
"source": [
"def visualize_archives(data, prediction_name):\n",
" archive_counts = data.value_counts()\n",
"\n",
"\n",
" plt.xlabel('Archive Class')\n",
" plt.ylabel('Frequency', rotation=0, labelpad=30)\n",
" plt.title('Frequencies of ' + prediction_name + ' Predicted Archives')\n",
" plt.xticks(rotation=45)\n",
" plt.bar(archive_counts.index, archive_counts);\n",
"\n",
" percent_domesticated_animal = archive_counts['domesticated_animal'] / sum(archive_counts)\n",
"\n",
" print('Percent of texts in Domesticated Animal Archive:', percent_domesticated_animal)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2052a8cf",
"metadata": {
"id": "7-0f5WJTSJqO"
},
"outputs": [],
"source": [
"#Log Reg Predictions\n",
"visualize_archives(unknown['LogReg Predicted Archive'], 'Logistic Regression')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bf1b0746",
"metadata": {
"id": "Qgsc2nTVSf4O"
},
"outputs": [],
"source": [
"#KNN Predictions\n",
"visualize_archives(unknown['KNN Predicted Archive'], 'K Nearest Neighbors')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0e301dd",
"metadata": {
"id": "CECF4iExSmjP"
},
"outputs": [],
"source": [
"#Gaussian Naive Bayes Predictions\n",
"visualize_archives(unknown['GaussNB Predicted Archive'], 'Gaussian Naive Bayes')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "50ed6f66",
"metadata": {
"id": "pNSBTgPBSwhj"
},
"outputs": [],
"source": [
"#Bernoulli Naive Bayes Predictions\n",
"visualize_archives(unknown['BernoulliNB Predicted Archive'], 'Bernoulli Naive Bayes')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f9a8b7a",
"metadata": {
"id": "27WesTVUUfDG"
},
"outputs": [],
"source": [
"#SVM Predictions\n",
"visualize_archives(unknown['SVM Predicted Archive'], 'Support Vector Machines Naive Bayes')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e3783714",
"metadata": {
"id": "0Xz6gARR-wq_"
},
"outputs": [],
"source": [
"def weighted_voting(row):\n",
" votes = {} # create empty voting dictionary\n",
" # tally votes\n",
" for model in row.index:\n",
" model_name = model[:-18] # remove ' Predicted Archive' from column name\n",
" prediction = row[model]\n",
" if prediction not in votes.keys():\n",
" votes[prediction] = model_weights[model_name] # if the prediction isn't in the list of voting categories, add it with a weight equal to the current model weight \n",
" else:\n",
" votes[prediction] += model_weights[model_name] # else, add model weight to the prediction\n",
" return max(votes, key=votes.get) # use the values to get the prediction with the greatest weight"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc0655b8",
"metadata": {
"id": "m6MQH0da-Aq-"
},
"outputs": [],
"source": [
"predicted_archives = unknown.loc[:, 'LogReg Predicted Archive':\n",
" 'SVM Predicted Archive'].copy() # get predictions\n",
"weighted_prediction = predicted_archives.apply(weighted_voting, axis=1) #apply voting mechanism on each row and return 'winning' prediction"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7eece68",
"metadata": {
"id": "Q38uV36MRHmJ"
},
"outputs": [],
"source": [
"weighted_prediction[weighted_prediction != 'domesticated_animal']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "872b24af",
"metadata": {
"id": "PYuISOe9Gstd"
},
"outputs": [],
"source": [
"words_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "318fdedb",
"metadata": {
"id": "Pi4VugZGEsyZ"
},
"outputs": [],
"source": [
"archive_class = known['archive_class'].copy().append(weighted_prediction)\n",
"words_df['archive_class'] = words_df.apply(lambda row: archive_class[int(row['id_text'][1:])], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a9e491b4",
"metadata": {
"id": "z22PcVk-Mw2o"
},
"outputs": [],
"source": [
"words_df"
]
},
{
"cell_type": "markdown",
"id": "9db15866",
"metadata": {
"id": "lPjcplQAQ8LX"
},
"source": [
"## 5 Sophisticated Naive Bayes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d6f9714",
"metadata": {
"id": "K60UB8wYGM6h"
},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"id": "0f844643",
"metadata": {
"id": "3RvDKrY0UmLb"
},
"source": [
"### 5.1 Feature and Model Creation"
]
},
{
"cell_type": "markdown",
"id": "6fae866b",
"metadata": {
"id": "Q-V9P376rbgB"
},
"source": [
"There are some nouns that are so closely associated with a specific archive that their presence in a text virtually guarantees that the text belongs to that archive. We will use this fact to create a training set for our classification model.\n",
"\n",
"The `labels` dictionary below contains the different archives along with their possible associated nouns."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc95e73d",
"metadata": {
"id": "MdQv15BnSJVy"
},
"outputs": [],
"source": [
"labels = dict()\n",
"labels['domesticated_animal'] = ['ox', 'cow', 'sheep', 'goat', 'lamb', '~sheep', 'equid']\n",
"dom = '(' + '|'.join(labels['domesticated_animal']) + ')'\n",
"#split domesticated into large and small - sheep, goat, lamb, ~sheep would be small domesticated animals\n",
"labels['wild_animal'] = ['bear', 'gazelle', 'mountain', 'lion'] # account for 'mountain animal' and plural\n",
"wild = '(' + '|'.join(labels['wild_animal']) + ')'\n",
"labels['dead_animal'] = ['die'] # find 'die' before finding domesticated or wild\n",
"dead = '(' + '|'.join(labels['dead_animal']) + ')'\n",
"labels['leather_object'] = ['boots', 'sandals']\n",
"leath = '(' + '|'.join(labels['leather_object']) + ')'\n",
"labels['precious_object'] = ['copper', 'bronze', 'silver', 'gold']\n",
"prec = '(' + '|'.join(labels['precious_object']) + ')'\n",
"labels['wool'] = ['wool', '~wool', 'hair']\n",
"wool = '(' + '|'.join(labels['wool']) + ')'\n",
"complete = []\n",
"for lemma_list in labels.values():\n",
" complete = complete + lemma_list\n",
"tot = '(' + '|'.join(complete) + ')'\n",
"# labels['queens_archive'] = []"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "df320e78",
"metadata": {
"id": "kNHCtQr8XY9v"
},
"outputs": [],
"source": [
"dom_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + dom + '.*\\]')]['id_text'])\n",
"wild_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + wild + '.*\\]')]['id_text'])\n",
"dead_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + dead + '.*\\]')]['id_text'])\n",
"leath_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + leath + '.*\\]')]['id_text'])\n",
"prec_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + prec + '.*\\]')]['id_text'])\n",
"wool_tabs = set(words_df.loc[words_df['lemma'].str.match('.*\\[.*' + wool + '.*\\]')]['id_text'])"
]
},
{
"cell_type": "markdown",
"id": "113a4957",
"metadata": {
"id": "eQRfFFosCTQr"
},
"source": [
"Each row of the `sparse` table below corresponds to one text, and the columns of the table correspond to the words that appear in the texts. Every cell contains the number of times a specific word appears in a certain text."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c50679ba",
"metadata": {
"id": "GHyY6IbaRHYX"
},
"outputs": [],
"source": [
"# remove lemmas that are a part of a seal as well as words that are being used to determine training classes\n",
"filter = (~words_df['label'].str.contains('s')) | words_df['lemma'].str.match('.*\\[.*' + tot + '.*\\]')\n",
"sparse = words_df[filter].groupby(by=['id_text', 'lemma']).count()\n",
"sparse = sparse['id_word'].unstack('lemma')\n",
"sparse = sparse.fillna(0)\n",
"\n",
"#cleaning\n",
"del filter"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa52d78e",
"metadata": {
"id": "4etDIDVgS93i"
},
"outputs": [],
"source": [
"text_length = sparse.sum(axis=1)"
]
},
{
"cell_type": "markdown",
"id": "919684c2",
"metadata": {
"id": "B7_coDiFCiHu"
},
"source": [
"If a text contains a word that is one of the designated nouns in `labels`, it is added to the set to be used for our ML model. Texts that do not contain any of these words or that contain words corresponding to more than one archive are ignored."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cf620de3",
"metadata": {
"id": "e85qDgdYbejR"
},
"outputs": [],
"source": [
"class_array = []\n",
"\n",
"for id_text in sparse.index:\n",
" cat = None\n",
" number = 0\n",
" if id_text in dom_tabs:\n",
" number += 1\n",
" cat = 'dom'\n",
" if id_text in wild_tabs:\n",
" number += 1\n",
" cat = 'wild'\n",
" if id_text in dead_tabs:\n",
" number += 1\n",
" cat = 'dead'\n",
" if id_text in prec_tabs:\n",
" number += 1\n",
" cat = 'prec'\n",
" if id_text in wool_tabs:\n",
" number += 1\n",
" cat = 'wool'\n",
" if number == 1:\n",
" class_array.append(cat)\n",
" else:\n",
" class_array.append(None)\n",
"\n",
"class_series = pd.Series(class_array, sparse.index)"
]
},
{
"cell_type": "markdown",
"id": "1e443f84",
"metadata": {
"id": "ExKiUVAmDhB0"
},
"source": [
"Next we remove the texts from `sparse` that we used in the previous cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d196841f",
"metadata": {
"id": "Hj8bgltlEdzM"
},
"outputs": [],
"source": [
"used_cols = []\n",
"\n",
"for col in sparse.columns:\n",
" if re.match('.*\\[.*' + tot + '.*\\]', col):\n",
" used_cols.append(col)\n",
" #elif re.match('.*PN$', col) is None:\n",
" # used_cols.append(col)\n",
"\n",
"sparse = sparse.drop(used_cols, axis=1)"
]
},
{
"cell_type": "markdown",
"id": "53e5f548",
"metadata": {
"id": "ZFhS09xHCskW"
},
"source": [
"Now the `sparse` table will be updated to contain percentages of the frequency that a word appears in the text rather than the raw number of occurrences. This will allow us to better compare frequencies across texts of different lengths."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7745e6fb",
"metadata": {
"id": "ArDjsWRSYqX1"
},
"outputs": [],
"source": [
"for col in sparse.columns:\n",
" if col != 'text_length':\n",
" sparse[col] = sparse[col]/text_length*1000"
]
},
{
"cell_type": "markdown",
"id": "3960c041",
"metadata": {
"id": "Vp0R9mq1DN5F"
},
"source": [
"We must convert percentages from the previous cell into integers for the ML model to work properly."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2d7b552",
"metadata": {
"id": "JnzN2SXG5QXJ"
},
"outputs": [],
"source": [
"this sparse = sparse.round()\n",
"sparse = sparse.astype(int)"
]
},
{
"cell_type": "markdown",
"id": "2c41d431",
"metadata": {
"id": "doDVY-klDS4f"
},
"source": [
"To form X, we reduce the `sparse` table to only contain texts that were designated for use above in `class_series`. Y consists of the names of the different archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f34c355",
"metadata": {
"id": "l8TINhIO9ZuT"
},
"outputs": [],
"source": [
"X = sparse.loc[class_series.dropna().index]\n",
"X = X.drop(X.loc[X.sum(axis=1) == 0, :].index, axis=0)\n",
"y = class_series[X.index]"
]
},
{
"cell_type": "markdown",
"id": "140979dc",
"metadata": {
"id": "LKvePiz_Dus9"
},
"source": [
"Our data is split into a training set and a test set. The ML model first uses the training set to learn how to predict the archives for the texts. Afterwards, the test set is used to verify how well our ML model works."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2211eb3",
"metadata": {
"id": "o2VLiYkC1JoH"
},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, \n",
" random_state = 9)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6732b73",
"metadata": {
"id": "NrEWHgdxfyP_"
},
"outputs": [],
"source": [
"pipe = Pipeline([\n",
" ('feature_reduction', SelectPercentile(score_func = f_classif)), \n",
" ('weighted_multi_nb', MultinomialNB())\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed7a3756",
"metadata": {
"id": "TwK16rD1obnS"
},
"outputs": [],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"f = GridSearchCV(pipe, {\n",
" 'feature_reduction__percentile' : [i*10 for i in range(1, 10)],\n",
" 'weighted_multi_nb__alpha' : [i/10 for i in range(1, 10)]\n",
" }, verbose = 0, n_jobs = -1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5bfbfc18",
"metadata": {
"id": "AxpnC26SiNS9"
},
"outputs": [],
"source": [
"f.fit(X_train, y_train);"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3cfe2cfa",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "uShD8bbwkezW",
"outputId": "e47c8212-be72-45f4-ebe0-66a217597586"
},
"outputs": [
{
"data": {
"text/plain": [
"{'feature_reduction__percentile': 70, 'weighted_multi_nb__alpha': 0.1}"
]
},
"execution_count": 117,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.best_params_"
]
},
{
"cell_type": "markdown",
"id": "bebb464e",
"metadata": {
"id": "-fjlG0-vD5nC"
},
"source": [
"Our best score when run on the training set is about 93.6% accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7cc8bf2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "VDDiXkBinXcM",
"outputId": "98476187-bed2-42a6-9782-1948d59fbfd4"
},
"outputs": [
{
"data": {
"text/plain": [
"0.9359404096834265"
]
},
"execution_count": 118,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.best_score_"
]
},
{
"cell_type": "markdown",
"id": "881e2374",
"metadata": {
"id": "OyCwMd9SD_I6"
},
"source": [
"Our best score when run on the test set is very similar to above at 93.2% accuracy, which is good because it suggests that our model isn't overfitted to only work on the training set."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11588239",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "N3hcN7yvSIo-",
"outputId": "9dac856d-e097-455b-b020-90aec50ede0e"
},
"outputs": [
{
"data": {
"text/plain": [
"0.9321229050279329"
]
},
"execution_count": 119,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test, y_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "67a80b76",
"metadata": {
"id": "1AYleoeySIR2"
},
"outputs": [],
"source": [
"predicted = f.predict(sparse)"
]
},
{
"cell_type": "markdown",
"id": "0f2501f9",
"metadata": {
"id": "HdQk5tjvEJJh"
},
"source": [
"The `predicted_df` table is the same as the `sparse` table from above, except that we have added an extra column at the end named `prediction`. `prediction` contains our ML model's classification of which archive the text belongs to based on the frequency of the words that appear."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d89cb636",
"metadata": {
"id": "TKu-QXULG6Gm"
},
"outputs": [],
"source": [
"predicted_df = sparse.copy()\n",
"predicted_df['prediction'] = predicted"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9fc2c716",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 504
},
"id": "4hNchgn0HFrJ",
"outputId": "d2cd5ef7-7d84-49a2-bef1-efefd812dc9d"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
lemma
\n",
"
$AN[NA]NA
\n",
"
$GIR[NA]NA
\n",
"
$HI[NA]NA
\n",
"
$KA[NA]NA
\n",
"
$KI[NA]NA
\n",
"
$LAM[NA]NA
\n",
"
$NI[NA]NA
\n",
"
$UD[NA]NA
\n",
"
$ŠA[NA]NA
\n",
"
$ŠID[NA]NA
\n",
"
1(aš)-a[]NU
\n",
"
1(aš)-kam[]NU
\n",
"
1(aš)-še₃[]NU
\n",
"
1(aš)[]NU
\n",
"
1(aš@c)[]NU
\n",
"
1(ban₂)-bi[]NU
\n",
"
1(ban₂)-ta[]NU
\n",
"
1(ban₂)-še₃[]NU
\n",
"
1(ban₂)[]NU
\n",
"
1(barig)-ta[]NU
\n",
"
1(barig)[]NU
\n",
"
1(burʾu)[]NU
\n",
"
1(bur₃)[]NU
\n",
"
1(diš)-a-kam[]NU
\n",
"
1(diš)-a-še₃[]NU
\n",
"
1(diš)-a[]NU
\n",
"
1(diš)-am₃[]NU
\n",
"
1(diš)-bi[]NU
\n",
"
1(diš)-kam-aš[]NU
\n",
"
1(diš)-kam-ma-aš[]NU
\n",
"
1(diš)-kam-ma[]NU
\n",
"
1(diš)-kam[]NU
\n",
"
1(diš)-ta[]NU
\n",
"
1(diš)-x[]NU
\n",
"
1(diš)-še₃[]NU
\n",
"
1(diš)[]NU
\n",
"
1(diš){ša}[]NU
\n",
"
1(diš@t)-kam-aš[]NU
\n",
"
1(diš@t)-kam-ma-aš[]NU
\n",
"
1(diš@t)-kam[]NU
\n",
"
...
\n",
"
šuʾi[barber]N
\n",
"
šuʾura[goose]N
\n",
"
ʾa₃-um[NA]NA
\n",
"
Ṣa.lim.tum[00]PN
\n",
"
Ṣe.AŠ₂.da.gan[00]PN
\n",
"
Ṣe.er.ṣe.ra.num₂[00]PN
\n",
"
Ṣe.la[00]PN
\n",
"
Ṣe.li.uš.da.gan[00]PN
\n",
"
Ṣe.lu.uš.da.gan.PA[00]PN
\n",
"
Ṣe.lu.uš[00]PN
\n",
"
Ṣe.lu.uš₂.da.gan[00]PN
\n",
"
Ṣe.ra.am[00]PN
\n",
"
Ṣe.ra[00]PN
\n",
"
Ṣeherkinum[0]PN
\n",
"
Ṣeṣe[0]PN
\n",
"
Ṣe₂.la.šu[00]PN
\n",
"
Ṣi.li.sud₃.da[00]PN
\n",
"
Ṣilala[0]PN
\n",
"
Ṣillašu[0]PN
\n",
"
ṢilliAdad[0]PN
\n",
"
ṢilliSud[0]PN
\n",
"
ṢilliŠulgi[0]PN
\n",
"
ṢillušDagan[0]PN
\n",
"
Ṣillušdagan[0]PN
\n",
"
ṢillušŠulgi[0]PN
\n",
"
Ṣillušṭab[0]PN
\n",
"
Ṣipram[0]PN
\n",
"
Ṣirula[0]PN
\n",
"
Ṣummidili[0]PN
\n",
"
ṣa-bi₂-im[NA]NA
\n",
"
ṣa-bu-um[NA]NA
\n",
"
ṣi-il-x-{d}iškur[NA]NA
\n",
"
ṣi-ip-ra-am[NA]NA
\n",
"
ṣi-ra-am[NA]NA
\n",
"
Ṭabaʾili[0]PN
\n",
"
Ṭabumšar[0]PN
\n",
"
Ṭabumšarri[0]PN
\n",
"
Ṭabši[0]SN
\n",
"
Ṭahili[0]PN
\n",
"
prediction
\n",
"
\n",
"
\n",
"
id_text
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
P100041
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dom
\n",
"
\n",
"
\n",
"
P100189
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
48
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dead
\n",
"
\n",
"
\n",
"
P100190
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
33
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
100
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dom
\n",
"
\n",
"
\n",
"
P100191
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
48
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dead
\n",
"
\n",
"
\n",
"
P100211
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
30
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
152
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dead
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
P519650
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dom
\n",
"
\n",
"
\n",
"
P519658
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
wild
\n",
"
\n",
"
\n",
"
P519792
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
155
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dom
\n",
"
\n",
"
\n",
"
P519957
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
71
\n",
"
0
\n",
"
71
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dom
\n",
"
\n",
"
\n",
"
P519959
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
dead
\n",
"
\n",
" \n",
"
\n",
"
15132 rows × 9174 columns
\n",
"
"
],
"text/plain": [
"lemma $AN[NA]NA $GIR[NA]NA $HI[NA]NA ... Ṭabši[0]SN Ṭahili[0]PN prediction\n",
"id_text ... \n",
"P100041 0 0 0 ... 0 0 dom\n",
"P100189 0 0 0 ... 0 0 dead\n",
"P100190 0 0 0 ... 0 0 dom\n",
"P100191 0 0 0 ... 0 0 dead\n",
"P100211 0 0 0 ... 0 0 dead\n",
"... ... ... ... ... ... ... ...\n",
"P519650 0 0 0 ... 0 0 dom\n",
"P519658 0 0 0 ... 0 0 wild\n",
"P519792 0 0 0 ... 0 0 dom\n",
"P519957 0 0 0 ... 0 0 dom\n",
"P519959 0 0 0 ... 0 0 dead\n",
"\n",
"[15132 rows x 9174 columns]"
]
},
"execution_count": 124,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"predicted_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a31a9350",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "i7ImpZnL3U5C",
"outputId": "6da08322-0707-491a-c4b4-dd067b56a4fc"
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['P100041', 'P100189', 'P100190', 'P100191', 'P100211', 'P100214',\n",
" 'P100215', 'P100217', 'P100218', 'P100219',\n",
" ...\n",
" 'P519534', 'P519613', 'P519623', 'P519624', 'P519647', 'P519650',\n",
" 'P519658', 'P519792', 'P519957', 'P519959'],\n",
" dtype='object', name='id_text', length=15132)"
]
},
"execution_count": 125,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"predicted_df.index"
]
},
{
"cell_type": "markdown",
"id": "6e77a951",
"metadata": {
"id": "p5U8vZ2iOTlZ"
},
"source": [
"### 5.4 Testing the Model on Hand-Classified Data\n",
"\n",
"Here we first use our same ML model from before on Niek's hand-classified texts from the wool archive. Testing our ML model on these tablets gives us 82.5% accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d20c460",
"metadata": {
"id": "iCJi2jZZzZeS"
},
"outputs": [],
"source": [
"wool_hand_tabs = set(pd.read_csv('drive/MyDrive/SumerianNetworks/JupyterBook/Outputs/wool_pid.txt',header=None)[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32aebe60",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "5YioGFV12lao",
"outputId": "f9ea6d73-5aa9-4f35-ccba-5cfe411b0d48"
},
"outputs": [
{
"data": {
"text/plain": [
"0.8253968253968254"
]
},
"execution_count": 126,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"hand_wool_frame = sparse.loc[wool_hand_tabs].loc[class_series.isna() == True]\n",
"\n",
"f.score(X = hand_wool_frame, \n",
" y = pd.Series(\n",
" index = hand_wool_frame.index, \n",
" data = ['wool' for i in range(0, hand_wool_frame.shape[0])] ))"
]
},
{
"cell_type": "markdown",
"id": "fc053bed",
"metadata": {
"id": "1lgQ_iMjG4Jt"
},
"source": [
"Testing our ML model on 100 random hand-classified tablets selected from among all the texts gives us 87.2% accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4088464f",
"metadata": {
"id": "I4xccyU3VDuQ"
},
"outputs": [],
"source": [
"niek_100_random_tabs = pd.read_pickle('/content/drive/MyDrive/niek_cats').dropna()\n",
"niek_100_random_tabs = niek_100_random_tabs.set_index('pnum')['category_text']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eba4db0a",
"metadata": {
"id": "5skhfKRDkrCB"
},
"outputs": [],
"source": [
"random_frame = sparse.loc[set(niek_100_random_tabs.index)]\n",
"random_frame['result'] = niek_100_random_tabs[random_frame.index]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "64e86851",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "TRl1Pq1BlAhY",
"outputId": "a18af6f2-6853-4024-a9e9-095ed998cf5e"
},
"outputs": [
{
"data": {
"text/plain": [
"0.872093023255814"
]
},
"execution_count": 188,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X=random_frame.drop(labels='result', axis=1), y = random_frame['result'])"
]
},
{
"cell_type": "markdown",
"id": "f932b71f",
"metadata": {
"id": "viioWwWRG9Wq"
},
"source": [
"A large majority of the tablets are part of the domestic archive and have been classified as such."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73d4b784",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "tzDgn3nbl-VH",
"outputId": "b86adeed-15ed-4555-c3ee-919c6726acaa"
},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"['dead', 'dom', 'dead', 'dom', 'dom', 'dom', 'dead', 'dom', 'dead',\n",
" 'dom', 'dead', 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'wild', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'dom', 'dead', 'dom', 'dom', 'dom', 'dom', 'wild',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'wild', 'dom',\n",
" 'dom', 'dom', 'dom', 'wild', 'dom', 'dead', 'dom', 'dead', 'dead',\n",
" 'dom', 'dead', 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom']\n",
"Length: 86, dtype: object"
]
},
"execution_count": 190,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"random_frame['result'].array"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40dbac07",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "_p29Uif4lpja",
"outputId": "195b73b0-ce3d-4ca0-81ac-9102fdb75830"
},
"outputs": [
{
"data": {
"text/plain": [
"array(['dead', 'dom', 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dead', 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'wild', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom', 'dom', 'dom',\n",
" 'wild', 'dom', 'dom', 'dom', 'wild', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom', 'dead', 'dom', 'dead', 'dom',\n",
" 'dom', 'dom', 'dead', 'dom', 'dom', 'dom', 'dom', 'dom', 'dom',\n",
" 'dom', 'dom', 'dom', 'dom', 'dom'], dtype='"
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"plt.xlabel('Archive Class')\n",
"plt.ylabel('Frequency', rotation=0, labelpad=30)\n",
"plt.title('Frequencies of Predicted Archive Classes in All Tablets')\n",
"plt.xticks(rotation=45)\n",
"labels = list(set(predicted_df['prediction']))\n",
"counts = [predicted_df.loc[predicted_df['prediction'] == label].shape[0] for label in labels]\n",
"plt.bar(labels, counts);"
]
},
{
"cell_type": "markdown",
"id": "5f7604c7",
"metadata": {
"id": "rG726vEEHQ3U"
},
"source": [
"The below chart displays the actual frequencies of the different archives in the test set. As mentioned previously, it is visually obvious that there are many texts in the domestic archive, with comparatively very few texts in all of the other archives."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "85909ed0",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 310
},
"id": "-tiqpepZG7Og",
"outputId": "13c2afdc-8886-4e38-949e-8124a355c137"
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"plt.xlabel('Archive Class')\n",
"plt.ylabel('Frequency', rotation=0, labelpad=30)\n",
"plt.title('Frequencies of Test Archive Classes')\n",
"plt.xticks(rotation=45)\n",
"test_counts = [(class_series[X_test.index])[class_series == label].count() for label in labels]\n",
"plt.bar(labels, np.asarray(test_counts));"
]
},
{
"cell_type": "markdown",
"id": "8d0f0054",
"metadata": {
"id": "kYdnsxgYHW8a"
},
"source": [
"Below is a chart of the predicted frequency of the different archives by our ML model in the test set. Our predicted frequency looks very similar to the actual frequency above, which is good."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56ad4019",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 310
},
"id": "v6b1AStxRIW7",
"outputId": "59bc33f3-94cd-47b7-8f83-ef8a76fb7d03"
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"plt.xlabel('Archive Class')\n",
"plt.ylabel('Frequency', rotation=0, labelpad=30)\n",
"plt.title('Frequencies of Predicted Test Archive Classes')\n",
"plt.xticks(rotation=45)\n",
"test_pred_counts = [predicted_df.loc[X_test.index].loc[predicted_df['prediction'] == label].shape[0] for label in labels]\n",
"plt.bar(labels, np.asarray(test_pred_counts));"
]
},
{
"cell_type": "markdown",
"id": "7b51ad25",
"metadata": {
"id": "GeVOwslAH1pP"
},
"source": [
"Unfortunately, since our texts skew so heavily towards being part of the domestic archive, most of the other archives end up being overpredicted (i.e. our model says a text is part of that archive when it is actually not). Below we can see that the domestic archive is the only archive whose texts are not overpredicted."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "739ae7ef",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 310
},
"id": "RRCs5wEESDVZ",
"outputId": "8db2a17f-b490-4af8-d257-8944be46d747"
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"plt.xlabel('Archive Class')\n",
"plt.ylabel('Rate', rotation=0, labelpad=30)\n",
"plt.title('Rate of Overprediction by Archive')\n",
"plt.xticks(rotation=45)\n",
"rate = np.asarray(test_pred_counts)/np.asarray(test_counts)*sum(test_counts)/sum(test_pred_counts)\n",
"plt.bar(labels, rate);"
]
},
{
"cell_type": "markdown",
"id": "c083e610",
"metadata": {
"id": "VCkyV5guUAVw"
},
"source": [
"### 5.3 Accuracy By Archive\n",
"\n",
"The accuracies for the dead and wild archives are relatively low. This is likely because those texts are being misclassified into the domestic archive, our largest archive, since all three of these archives deal with animals. The wool and precious archives have decent accuracies."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19a813e6",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "OZmI6aR3G58T",
"outputId": "b25b85b6-b6a3-47a0-d741-a98c128283a8"
},
"outputs": [
{
"data": {
"text/plain": [
"0.734375"
]
},
"execution_count": 169,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test[class_series == 'dead'], y_test[class_series == 'dead'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "266a4a78",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "HbRZc7FSTt8B",
"outputId": "20d98cb3-8354-4c33-eda9-c66f6d487af4"
},
"outputs": [
{
"data": {
"text/plain": [
"0.9449010654490106"
]
},
"execution_count": 170,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test[class_series == 'dom'], y_test[class_series == 'dom'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2dc2a622",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "23aREE8sTxKt",
"outputId": "3eee05bc-0955-483f-ca5b-ca2094f3720c"
},
"outputs": [
{
"data": {
"text/plain": [
"0.7410071942446043"
]
},
"execution_count": 171,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test[class_series == 'wild'], y_test[class_series == 'wild'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9378b0e5",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3pE1BBM5T2n_",
"outputId": "04fb818b-8814-4fd1-9584-2579f183e476"
},
"outputs": [
{
"data": {
"text/plain": [
"0.8333333333333334"
]
},
"execution_count": 172,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test[class_series == 'wool'], y_test[class_series == 'wool'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59399459",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "LiVNsk8lT5qw",
"outputId": "6a8e0e52-f860-4a35-90e3-820d71927a9b"
},
"outputs": [
{
"data": {
"text/plain": [
"0.9264705882352942"
]
},
"execution_count": 173,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"f.score(X_test[class_series == 'prec'], y_test[class_series == 'prec'])"
]
},
{
"cell_type": "markdown",
"id": "6f0be9f0",
"metadata": {
"id": "NCBvhrQkBawe"
},
"source": [
"We can also look at the confusion matrix. A confusion matrix is used to evaluate the accuracy of a classification. The rows denote the actual archive, while the columns denote the predicted archive. \n",
"\n",
"Looking at the first column: \n",
"- 73.44% of the dead archive texts are predicted correctly\n",
"- 1.31% of the domestic archive texts are predicted to be part of the dead archive\n",
"- 1.47% of the wild archive texts are predicted to be part of the dead archive\n",
"- 1.43% of the wool archive texts are predicted to be part of the dead archive\n",
"- none of the precious archive texts are predicted to be part of the dead archive"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4241129f",
"metadata": {
"id": "FMszAvyUBaR5"
},
"outputs": [],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"archive_confusion = confusion_matrix(y_test, f.predict(X_test), normalize='true')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f03d0b21",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "50GlA0txCLgq",
"outputId": "f055525f-a3ec-41e5-c650-dabc6061b778"
},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.734375 , 0.171875 , 0.015625 , 0.0625 , 0.015625 ],\n",
" [0.0130898 , 0.94490107, 0.00487062, 0.03531202, 0.00182648],\n",
" [0.01470588, 0.05882353, 0.92647059, 0. , 0. ],\n",
" [0.01438849, 0.21582734, 0.02158273, 0.74100719, 0.00719424],\n",
" [0. , 0.08333333, 0.08333333, 0. , 0.83333333]])"
]
},
"execution_count": 175,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"archive_confusion"
]
},
{
"cell_type": "markdown",
"id": "b21d3325",
"metadata": {
"id": "Vm1vDZPGIr1f"
},
"source": [
"This is the same confusion matrix converted into real numbers of texts. Since the number of domestic archive texts is so high, even a small bit of misclassification of the domestic archive texts can overwhelm the other archives.\n",
"\n",
"For example, even though only 1.3% of the domestic archive texts are predicted to be part of the dead archive, that corresponds to 43 texts, while the 73% of the dead archive texts that were predicted correctly correspond to just 47 texts. As a result, about half of the texts that were predicted to be part of the dead archive are incorrectly classified."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "54b5662d",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "HSDHsdWRDy_e",
"outputId": "cb72fa2a-8899-4c4e-95ac-32df77871d06"
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 47, 11, 1, 4, 1],\n",
" [ 43, 3104, 16, 116, 6],\n",
" [ 1, 4, 63, 0, 0],\n",
" [ 2, 30, 3, 103, 1],\n",
" [ 0, 2, 2, 0, 20]])"
]
},
"execution_count": 176,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"confusion_matrix(y_test, f.predict(X_test), normalize=None)"
]
},
{
"cell_type": "markdown",
"id": "84759158",
"metadata": {
"id": "G5YG_CKScVNB"
},
"source": [
"## 6 Save Results in CSV file & Pickle"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}