II. N-gram Neighbors of the Proper Names (PN)¶
Section II was made to add greater context to each PN.
Below are lists of professions, roles, and family relationships.
# import necessary libraries
import pandas as pd
from tqdm.auto import tqdm
# import libraries for this section
import re
1 Find Neighbors¶
Below we are making a copy of the filtered dataframe to manipulate and add the neighbors column.
The commented out line can be used if you have a copy of the words_df dataframe from the previous section and you would like to load that instead of running part I.
#words_df = pd.read_pickle('output/part_1_output.p') #uncomment to read from local file
words_df = pd.read_pickle('https://gitlab.com/yashila.bordag/sumnet-data/-/raw/main/part_1_output.p') # uncomment to read from online file
#List of professions, roles, family
professions = [ "aʾigidu[worker]",
"abala[water-drawer]",
"abrig[functionary]",
"ad.KID[weaver]",
"agaʾus[soldier]",
"arad[slave]",
"ašgab[leatherworker]",
"aʾua [musician]",
"azlag[fuller]",
"bahar[potter]",
"bisaŋdubak[archivist]",
"damgar[merchant]",
"dikud[judge]",
"dubsar[scribe]",
"en[priest]",
"erešdiŋir[priestess]",
"ensik[ruler]",
"engar[farmer]",
"enkud[tax-collector]",
"gabaʾaš[courier]",
"galamah[singer]",
"gala[singer]",
"geme[worker]",
"gudug[priest]",
"guzala[official]",
"idu[doorkeeper]",
"išib[priest]",
"kaguruk[supervisor]",
"kaš[runner]",
"kiŋgia[messenger]",
"kinda[barber]",
"kinkin[miller]",
"kiridab[driver]",
"kurušda[fattener]",
"kuš[official]",
"lu[person]",
"lugal[king]",
"lukur[priestess]",
"lungak[brewer]",
"malah[sailor]",
"muhaldim[cook]",
"mušendu[bird-catcher]",
"nagada[herdsman]",
"nagar[carpenter]",
"nar[musician]",
"nargal[musician]",
"narsa[musician]",
"nin[lady]",
"nubanda[overseer]",
"nukirik[horticulturalist]",
"saŋ.DUN₃[recorder]",
"saŋŋa[official]",
"simug[smith]",
"sipad[shepherd]",
"sukkal[secretary]",
"šabra[administrator]",
"šagia[cup-bearer]",
"šakkanak[general]",
# "szej[cook]", this is a verb
"šidim[builder]",
"šuʾi[barber]",
"šukud[fisherman]",
"tibira[sculptor]",
"ugula[overseer]",
"unud[cowherd]",
# "urin[guard]",
"UN.IL₂[menial]",
"ušbar[weaver]",
"zabardab[official]",
"zadim[stone-cutter]"]
roles = ['ki[source]', 'maškim[administrator]',
'maškim[authorized]', 'i3-dab5[recipient]', 'giri3[intermediary]']
family = ['šeš[brother]', 'szesz[brother]', 'dumu[son]', 'dumu-munus[daughter]',
'dumumunus[daughter]' , 'dam[spouse]']
def n_neighbors(data, n):
#create list to return, non-proper names will return empty lists
n_neighbors_list = [[] for i in range(len(data))]
#find list of all PN lemma indices
PN_index = data[data['lemma'].str.contains("PN")].index
#go through each tablet and find neighbors for each PN and add to list
for i in tqdm(PN_index, desc='N Neighbors'):
#find all lemma rows from the same tablet
group_of_same_pnumber = data[data['id_text'] == data.loc[i, 'id_text']]
#find all lemma rows from the n-gram range
group_of_n_lines_befaf = group_of_same_pnumber[((group_of_same_pnumber['id_line'] >= data.loc[i, 'id_line'] - n)
&(group_of_same_pnumber['id_line'] <= data.loc[i, 'id_line']))
| ((group_of_same_pnumber['id_line'] <= data.loc[i, 'id_line'] + n)
& (group_of_same_pnumber['id_line'] >= data.loc[i, 'id_line']))]
#create list of n-grams and remove breaks
lemma_neighbors = group_of_n_lines_befaf['lemma'].values.tolist()
if 'break' in lemma_neighbors:
lemma_neighbors.remove('break')
#add to final list
n_neighbors_list[i] = lemma_neighbors
return n_neighbors_list
words_df['prof?'] = words_df['lemma'].apply(lambda word: 'Yes' if (re.match('^[^\]]*', word)[0] + ']') in professions else 'No')
words_df['role?'] = words_df['lemma'].apply(lambda word: 'Yes' if (re.match('^[^\]]*', word)[0] + ']') in roles else 'No')
words_df['family?'] = words_df['lemma'].apply(lambda word: 'Yes' if (re.match('^[^\]]*', word)[0] + ']') in family else 'No')
#Create "number?"" to see if row is number. this could imply that that next row is a commodity
words_df['number?'] = words_df['lemma'].str.contains('NU')
words_df['number?'] = ['Yes' if words_df['number?'][i] == True else 'No' for i in words_df.index]
words_df['commodity?'] = ['No'] + ['Yes' if words_df['number?'][i] == 'Yes' else 'No' for i in words_df.index[1:]]
words_df
lemma | id_text | id_line | id_word | label | date | dates_references | publication | collection | museum_no | ftype | metadata_source | prof? | role? | family? | number? | commodity? | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6(diš)[]NU | P100041 | 3 | P100041.3.1 | o 1 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | Yes | No | |
1 | udu[sheep]N | P100041 | 3 | P100041.3.2 | o 1 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | |
2 | kišib[seal]N | P100041 | 4 | P100041.4.1 | o 2 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | |
3 | Lusuen[0]PN | P100041 | 4 | P100041.4.2 | o 2 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | |
4 | ki[place]N | P100041 | 5 | P100041.5.1 | o 3 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
594695 | gud[ox]N | P481395 | 31 | P481395.31.2 | l.e. 1 | SS02 - 02 - 00 | SS02 - 02 - 00 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1954 | BDTNS | No | No | No | No | No | |
594696 | 1(diš)[]NU | P481395 | 31 | P481395.31.3 | l.e. 1 | SS02 - 02 - 00 | SS02 - 02 - 00 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1954 | BDTNS | No | No | No | Yes | Yes | |
594697 | anše[equid]N | P481395 | 31 | P481395.31.4 | l.e. 1 | SS02 - 02 - 00 | SS02 - 02 - 00 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1954 | BDTNS | No | No | No | No | No | |
594698 | Šusuen[1]RN | P517012 | 3 | P517012.3.1 | a 1 | CDLI Seals 013964 (composite) | ORACC | No | No | No | No | No | |||||
594699 | Abumilum[0]PN | P517012 | 4 | P517012.4.1 | a 2 | CDLI Seals 013964 (composite) | ORACC | No | No | No | No | No |
594700 rows × 17 columns
The next code block takes a very long time to run.
List item
List item
#call n_neighbor function to get neighbors from two lines above and below
words_df['neighbors'] = n_neighbors(words_df, 2)
words_df
lemma | id_text | id_line | id_word | label | date | dates_references | publication | collection | museum_no | ftype | metadata_source | prof? | role? | family? | number? | commodity? | neighbors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6(diš)[]NU | P100041 | 3 | P100041.3.1 | o 1 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | Yes | No | [] | |
1 | udu[sheep]N | P100041 | 3 | P100041.3.2 | o 1 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | [] | |
2 | kišib[seal]N | P100041 | 4 | P100041.4.1 | o 2 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | [] | |
3 | Lusuen[0]PN | P100041 | 4 | P100041.4.2 | o 2 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | [6(diš)[]NU, udu[sheep]N, kišib[seal]N, Lusuen... | |
4 | ki[place]N | P100041 | 5 | P100041.5.1 | o 3 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | [] | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
579680 | gud[ox]N | P481395 | 31 | P481395.31.2 | l.e. 1 | SS02 - 02 - 00 | SS02 - 02 - 00 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1954 | BDTNS | No | No | No | No | No | [] | |
579681 | 1(diš)[]NU | P481395 | 31 | P481395.31.3 | l.e. 1 | SS02 - 02 - 00 | SS02 - 02 - 00 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1954 | BDTNS | No | No | No | Yes | Yes | [] | |
579682 | anše[equid]N | P481395 | 31 | P481395.31.4 | l.e. 1 | SS02 - 02 - 00 | SS02 - 02 - 00 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1954 | BDTNS | No | No | No | No | No | [] | |
579683 | Šusuen[1]RN | P517012 | 3 | P517012.3.1 | a 1 | CDLI Seals 013964 (composite) | ORACC | No | No | No | No | No | [] | |||||
579684 | Abumilum[0]PN | P517012 | 4 | P517012.4.1 | a 2 | CDLI Seals 013964 (composite) | ORACC | No | No | No | No | No | [Šusuen[1]RN, Abumilum[0]PN] |
579685 rows × 18 columns
Check output only has neighbors for proper names.
words_df[words_df['lemma'].str.contains("PN")]
lemma | id_text | id_line | id_word | label | date | dates_references | publication | collection | museum_no | ftype | metadata_source | prof? | role? | family? | number? | commodity? | neighbors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Lusuen[0]PN | P100041 | 4 | P100041.4.2 | o 2 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | [6(diš)[]NU, udu[sheep]N, kišib[seal]N, Lusuen... | |
5 | Abbakala[0]PN | P100041 | 5 | P100041.5.2 | o 3 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | [6(diš)[]NU, udu[sheep]N, kišib[seal]N, Lusuen... | |
18 | UrKugnunak[0]PN | P100041 | 17 | P100041.17.1 | seal 1 ii 1 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | [lugal[king]N, an[sky]N, anubda[quarter]N, lim... | |
33 | Ludiŋirak[0]PN | P100189 | 7 | P100189.7.2 | o 5 | SH46 - 08 - 05 | SH46 - 08 - 05 | AAS 211 | Louvre Museum, Paris, France | AO 20039 | BDTNS | No | No | No | No | No | [uš[die]V/i, ud[sun]N, 5(diš)-kam[]NU, ki[plac... | |
34 | Urniŋarak[0]PN | P100189 | 9 | P100189.9.1 | r 1 | SH46 - 08 - 05 | SH46 - 08 - 05 | AAS 211 | Louvre Museum, Paris, France | AO 20039 | BDTNS | No | No | No | No | No | [ki[place]N, Ludiŋirak[0]PN, Urniŋarak[0]PN, š... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
579544 | Rabiʾili[0]PN | P459158 | 10 | P459158.10.1 | a ii 2 | Ibbi-Suen.00.00.00 | Ibbi-Suen.00.00.00 | CDLI Seals 006338 (physical) | private: anonymous, unlocated | Anonymous 459158 | ORACC | No | No | No | No | No | [lugal[king]N, an[sky]N, anubda[quarter]N, lim... | |
579571 | Dugazida[0]PN | P481391 | 10 | P481391.10.2 | r 1 | SH46 - 02 - 24 | SH46 - 02 - 24 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1950 | BDTNS | No | No | No | No | No | [uš[die]V/i, ud[sun]N, 2(u)[]NU, 4(diš)-kam[]N... | |
579573 | Urniŋarak[0]PN | P481391 | 11 | P481391.11.1 | r 2 | SH46 - 02 - 24 | SH46 - 02 - 24 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1950 | BDTNS | No | No | No | No | No | [ki[place]N, Dugazida[0]PN, kurušda[fattener]N... | |
579671 | Enlila[0]PN | P481395 | 27 | P481395.27.2 | r 8 | SS02 - 02 - 00 | SS02 - 02 - 00 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1954 | BDTNS | No | No | No | No | No | [šuniŋin[total]N, 1(diš)[]NU, dusu[equid]N, ni... | |
579684 | Abumilum[0]PN | P517012 | 4 | P517012.4.1 | a 2 | CDLI Seals 013964 (composite) | ORACC | No | No | No | No | No | [Šusuen[1]RN, Abumilum[0]PN] |
53093 rows × 18 columns
words_df[~words_df['lemma'].str.contains("PN")]
lemma | id_text | id_line | id_word | label | date | dates_references | publication | collection | museum_no | ftype | metadata_source | prof? | role? | family? | number? | commodity? | neighbors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6(diš)[]NU | P100041 | 3 | P100041.3.1 | o 1 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | Yes | No | [] | |
1 | udu[sheep]N | P100041 | 3 | P100041.3.2 | o 1 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | [] | |
2 | kišib[seal]N | P100041 | 4 | P100041.4.1 | o 2 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | [] | |
4 | ki[place]N | P100041 | 5 | P100041.5.1 | o 3 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | [] | |
6 | zig[rise]V/i | P100041 | 6 | P100041.6.1 | o 4 | SSXX - 00 - 00 | SSXX - 00 - 00 | AAS 053 | Louvre Museum, Paris, France | AO 20313 | BDTNS | No | No | No | No | No | [] | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
579679 | 2(u)[]NU | P481395 | 31 | P481395.31.1 | l.e. 1 | SS02 - 02 - 00 | SS02 - 02 - 00 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1954 | BDTNS | No | No | No | Yes | Yes | [] | |
579680 | gud[ox]N | P481395 | 31 | P481395.31.2 | l.e. 1 | SS02 - 02 - 00 | SS02 - 02 - 00 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1954 | BDTNS | No | No | No | No | No | [] | |
579681 | 1(diš)[]NU | P481395 | 31 | P481395.31.3 | l.e. 1 | SS02 - 02 - 00 | SS02 - 02 - 00 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1954 | BDTNS | No | No | No | Yes | Yes | [] | |
579682 | anše[equid]N | P481395 | 31 | P481395.31.4 | l.e. 1 | SS02 - 02 - 00 | SS02 - 02 - 00 | unpublished unassigned ? | Department of Classics, University of Cincinna... | UC CSC 1954 | BDTNS | No | No | No | No | No | [] | |
579683 | Šusuen[1]RN | P517012 | 3 | P517012.3.1 | a 1 | CDLI Seals 013964 (composite) | ORACC | No | No | No | No | No | [] |
526592 rows × 18 columns
The following line confirms there are no rows where the lemma is not a Proper Noun and is given neighbors.
sum([lst != [] for lst in words_df[~words_df['lemma'].str.contains("PN")]['neighbors']])
0
2 Save Results in CSV file & Pickle¶
Here we will save the words_df output from parts 1 and 2.
words_df.to_csv('output/part_2_output.csv')
words_df.to_pickle('output/part_2_output.p')