I. Importing the Data and Extracting Lemmatization from JSON¶

The code in this section will parse the ORACC JSON data of the Ur III corpus to extract lemmatization data.

The output contains text IDs, line IDs, lemmas, and relevant other data.

# import necessary libraries
import pandas as pd

# import libraries for this section
import requests
import zipfile
import json

# import libraries for workbook
# from tqdm.auto import tqdm

1 Import ORACC Data¶

We will begin by downloading the lemmatization data from ORACC. This will be downloaded in the form of zipped JSON files which we will parse in the next subsection and store in a pandas dataframe.

The ORACC database can be accessed via http://oracc.org/. The Ur III data can be viewed and interacted with at http://oracc.org/epsd2/admin/ur3.

To manipulated the data in this workbook however, we will need to download the data into our workspace and then parse and store this data in a pandas dataframe. We will download the data through a REST API using the requests library, and then store the data in a zipfile locally which will be parsed in the next section.

As we are accessing the JSON through the ORACC API, and the request URL is not the same as the viewing URL listed above. We will instead use: http://build-oracc.museum.upenn.edu/json/epsd2-admin-ur3.

If you are unfamiliar with using APIs and making HTTP requests and would like to learn how to do so in Python, this section of the Data 100 textbook might be helpful: http://www.textbook.ds100.org/ch/13/web_http.html

NOTE: If you already have this data, you can skip this step as it can take quite a long time.

# define the url to download from and name the zipfile to be created
url = 'http://build-oracc.museum.upenn.edu/json/epsd2-admin-ur3.zip'
file_name = 'jsonzip/epsd2-admin-ur3.zip'

# Workbook Code
'''
CHUNK = 1024 # Define chunk size
with requests.get(url, stream=True) as request:
    if request.status_code == 200:
        #if file is accessible, download
        tqdm.write('Saving ' + url + ' as ' + file_name)
        total_size = int(request.headers.get('content-length', 0))
        tqdm_handler = tqdm(total=total_size, unit='B', unit_scale=True, 
                            desc = 'epsd2/admin/ur3')
        with open(file_name, 'wb') as zip_file:
            #use tdqm to show download speed
            for chunk in request.iter_content(chunk_size=CHUNK):
                tqdm_handler.update(len(chunk))
                zip_file.write(chunk)
    else:
        tqdm.write("WARNING: %s does not exist." % url)
'''

# Pedagogical Code
request = requests.get(url); # use a get request to get the data from the url
open(file_name, 'wb').write(request.content); # write data into zipfile

2 Download BTDNS Data¶

You can use the following instructions to obtain the data we need from the BDTNS database.

Go to http://bdtns.filol.csic.es/index.php?p=formulario_urIII.
In the Provenience drop down menu choose “Puzriš-Dagan” and hit “Search”. This will give a list of about 15,000 tablets from Drehem.
Now, hit “Export” on the left and in the pop-up window select every box except “Transliteration of the whole text(s)”. Hit “export”. This will produce a .txt file and you’ll need to put it into the “bdtns_metadata” folder.
The name of the .txt file changes everytime we download, so you may need to alter the code slightly to properly import the data in this notebook.

NOTE: The database’s downloading function is broken at the moment, however, so we will be using a previously obtained copy of the data hosted on our Github repository.

2.1 Extract Data from .txt file.¶

We have an existing copy of the data stored in a text file on our Github repository (https://github.com/niekveldhuis/sumnet/blob/master/QUERY_catalogue.txt).

As with the last subsection, we will be using an API to access our data, so the url is slightly different (https://raw.githubusercontent.com/niekveldhuis/sumnet/master/QUERY_catalogue.txt). Using pandas, we can actually download the data without the use of the requests library and can import the data into a pandas dataframe to work with it immediately.

The data is indexed both on BDTNS number ids and CDLI P-number ids. The ORACC data is indexed by P-numbers and so we will merge the data in the next subsection using these ids.

bdtns_catalogue_data = pd.read_csv('https://raw.githubusercontent.com/niekveldhuis/sumnet/master/QUERY_catalogue.txt', delimiter='\t')

# reset the index to the p numbers
bdtns_catalogue_data = bdtns_catalogue_data.set_index('CDLI_no')  
bdtns_catalogue_data

	BDTNS_no	Publications	Museum_no	Date	Provenience
CDLI_no
NaN	197624.0	À l'école des scribes, p. 104	NaN	AS09 - 00 - 00	Umma
NaN	203551.0	A New Cuneiform Text..., IM 141866	IM 141866 //	0000 - 00 - 00	Umma
P114469	21035.0	AAA 1 78 24 = MVN 05 249 = CDLI P114469	WML 49.47.46	0000 - 07 - 12	Ĝirsu
P142651	38568.0	AAICAB 1/1, Ashm. 1909-951 = CDLI P142651	Ashm. 1909-951	AS05 - 11 - 00	Ĝirsu
P142652	38569.0	AAICAB 1/1, Ashm. 1911-139 = CDLI P142652	Ashm. 1911-139	SS09 - 08 - 23	Umma
...	...	...	...	...	...
P142619	6022.0	ZVO 25 134 2 = Hermitage 3 195 (unpubl.) = CDL...	Erm 14814	AS01 - 03 - 28	Puzriš-Dagān
P142620	6023.0	ZVO 25 136 3 = RA 73 190 = RIM E3/2.1.2.74	MA 9612	AS06 - 00 - 00	Umma
NaN	191925.0	Zwischen Orient und Okzident 262 1	NaN	AS03 - 00 - 00	Umma
NaN	191926.0	Zwischen Orient und Okzident 264 2	NaN	IS01 - 00 - 00	Umma
P253713	170094.0	NaN	MS 4703	NaN	Unkn. Prov.

101390 rows × 5 columns

2.2 Filtering Data¶

Here we are filtering the BDTNS data for Puzriš-Dagān tablets and drop tablets with null p-numbers or without dates. This will ensure we keep only the metadata relevant for our purposes.

# select for tablets from Puzriš-Dagān/Drehem
bdtns_catalogue_data = bdtns_catalogue_data[(bdtns_catalogue_data['Provenience']
    =='Puzriš-Dagān') | (bdtns_catalogue_data['Provenience']
    =='Puzriš-Dagān (?)')] 

#select for non-null p_numbers and non-null dates
bdtns_catalogue_data = bdtns_catalogue_data.loc[~bdtns_catalogue_data.index.isnull(), :]
bdtns_catalogue_data = bdtns_catalogue_data[~bdtns_catalogue_data['Date'].isnull()]

bdtns_catalogue_data

	BDTNS_no	Publications	Museum_no	Date	Provenience
CDLI_no
P142785	38699.0	AAICAB 1/1, Ashm. 1919-11a-b = CDLI P142785 = ...	Ashm. 1919-11a-b	IS01 - 08 - 00	Puzriš-Dagān
P142787	38700.0	AAICAB 1/1, Ashm. 1923-412 = CDLI P142787	Ashm. 1923-412	SS05 - 07 - 12	Puzriš-Dagān
P142788	38701.0	AAICAB 1/1, Ashm. 1923-415 = CDLI P142788	Ashm. 1923-415	SH48 - 12 - 20	Puzriš-Dagān
P142789	38702.0	AAICAB 1/1, Ashm. 1923-419 = CDLI P142789	Ashm. 1923-419	SS04 - XX - 15	Puzriš-Dagān
P142790	38703.0	AAICAB 1/1, Ashm. 1923-420 = CDLI P142790	Ashm. 1923-420	XXXX - 01//02 - 01	Puzriš-Dagān
...	...	...	...	...	...
P141938	38553.0	ZA 93 61 13	GIS 13	AS01 - 09 - 24	Puzriš-Dagān
P274585	167818.0	ZA 95 66 = CDLI P274585	A 05505	AS05 - XX - XX	Puzriš-Dagān
P381748	170855.0	ZA 96 179 = HSAO 16, no. 083	IM 044366	XXXX - XX - XX	Puzriš-Dagān
P142618	6021.0	ZVO 25 134 1 = Hermitage 3 174 (unpubl.) = CDL...	Erm 14811	SH47 - 02 - 00	Puzriš-Dagān
P142619	6022.0	ZVO 25 134 2 = Hermitage 3 195 (unpubl.) = CDL...	Erm 14814	AS01 - 03 - 28	Puzriš-Dagān

15279 rows × 5 columns

3 Parsing the Data¶

In this subsection we will merge the metadata we have on the Drehem tablets and parse the imported json data into a list we can later use to construct a pandas dataframe.

3.1 Getting the Metadata Data for the Tablets.¶

Before constructing our dataframe of lemmatized data, we will construct a dataframe of Drehem tablet metadata to filter and augment the data with.

We do this by loading the data from the catalogue.json into a dataframe and selecting the p-numbers where the provenience is indicated to be Puzriš-Dagan or modern Drehem. We will then replace the ORACC metadata with the BDTNS metadata where both databases have metadata and list the metadata source. Finally, we select the desired columns

# open the zipfile to read
zip_file = zipfile.ZipFile('jsonzip/epsd2-admin-ur3.zip')

#read and decode the project catalog json and store data in oracc_catalogue_json
oracc_json_str = zip_file.read('epsd2/admin/ur3/catalogue.json').decode('utf-8') 
oracc_catalogue_json = json.loads(oracc_json_str)['members']
zip_file.close()

#create combined catalogue_data
oracc_catalogue_data = pd.DataFrame.from_dict(oracc_catalogue_json, orient='index')
catalogue_data = oracc_catalogue_data[(oracc_catalogue_data['provenience'] 
    == 'Puzriš-Dagan (mod. Drehem)') | 
    oracc_catalogue_data.index.isin(bdtns_catalogue_data.index)]
    #choose catalogue entries marked as Puzriš-Dagan in either ORACC or BDTNS

#add bdtns dates to oracc dataset and name metadata source
#added line to suppress setting with copy warning
pd.options.mode.chained_assignment = None  # default='warn'
in_bdtns_index = catalogue_data.index.isin(bdtns_catalogue_data.index)
catalogue_data.loc[in_bdtns_index, 'date_of_origin'] = bdtns_catalogue_data['Date']
catalogue_data.loc[in_bdtns_index, 'dates_referenced'] = bdtns_catalogue_data['Date']
catalogue_data.loc[in_bdtns_index, 'provenience'] = bdtns_catalogue_data['Provenience']
catalogue_data['metadata_source'] = ['BDTNS' if in_bdtns else 'ORACC' for in_bdtns in in_bdtns_index]

#select necessary columns
catalogue_data = catalogue_data[['date_of_origin', 'dates_referenced', 'collection',
                                 'primary_publication', 'museum_no', 'provenience',
                                 'metadata_source']]
#display metadata
catalogue_data

	date_of_origin	dates_referenced	collection	primary_publication	museum_no	provenience	metadata_source
P100041	SSXX - 00 - 00	SSXX - 00 - 00	Louvre Museum, Paris, France	AAS 053	AO 20313	Puzriš-Dagān	BDTNS
P100189	SH46 - 08 - 05	SH46 - 08 - 05	Louvre Museum, Paris, France	AAS 211	AO 20039	Puzriš-Dagān	BDTNS
P100190	SH47 - 07 - 29	SH47 - 07 - 29	Louvre Museum, Paris, France	AAS 212	AO 20051	Puzriš-Dagān	BDTNS
P100191	AS01 - 03 - 24	AS01 - 03 - 24	Louvre Museum, Paris, France	AAS 213	AO 20074	Puzriš-Dagān	BDTNS
P100211	AS01 - 12 - 11	AS01 - 12 - 11	Museum of Fine Arts, Budapest, Hungary	ActSocHun Or 5-12, 156 2	MHBA 51.2400	Puzriš-Dagān	BDTNS
...	...	...	...	...	...	...	...
P456164	NaN	NaN	NaN	CDLI Seals 003454 (composite)	NaN	Puzriš-Dagan (mod. Drehem)	ORACC
P459158	Ibbi-Suen.00.00.00	Ibbi-Suen.00.00.00	private: anonymous, unlocated	CDLI Seals 006338 (physical)	Anonymous 459158	Puzriš-Dagan (mod. Drehem)	ORACC
P481391	SH46 - 02 - 24	SH46 - 02 - 24	Department of Classics, University of Cincinna...	unpublished unassigned ?	UC CSC 1950	Puzriš-Dagān	BDTNS
P481395	SS02 - 02 - 00	SS02 - 02 - 00	Department of Classics, University of Cincinna...	unpublished unassigned ?	UC CSC 1954	Puzriš-Dagān	BDTNS
P517012	NaN	NaN	NaN	CDLI Seals 013964 (composite)	NaN	Puzriš-Dagan (mod. Drehem)	ORACC

15671 rows × 7 columns

3.2 The `parse_ORACC_json()` function¶

Here we define the parse_ORACC_json() function, which will recursively parse the json files provided by ORACC until it finds the relevant lemma data. It takes in the an ORACC json data structure, the associated meta_data, and some necessary keys and returns a list of lemma python dictionary entries which we can use to constuct a pandas dataframe.

The json files consist of a hierarchy of cdl nodes; only the nodes which have ‘f’ or ‘strict’ keys contain lemmatization data. The function goes down this hierarchy by recursively calling itself when a new cdl node is encountered. For more information about the data hierarchy in the ORACC json files, see ORACC Open Data.

We are left with lemma data dictionaries with a series of attributes:

id_text: consists of a project abbreviation, such as blms or cams/gkab plus a text ID, in the format cams/gkab/P338616 or dcclt/Q000039.
ftype: denotes if a node is a year name (yn) or not.
word_id: consists of text ID, line ID, and word ID, in the format Q000039.76.2 ie. the second word in line 76 of text object Q000039.
label: human-legible label that refers to a line or another part of the text; eg. o i 23 (obverse column 1 line 23) or r v 23' (reverse column 5 line 23 prime). The label field is used in online ORACC editions to indicate line numbers.
extent, scope, and state: gives metatextual data about the condition of the object; captures the number of broken lines or columns and similar information.
date, dates_references, primary_publication, collection, and museum_no: metadata from the ORACC and BDTNS catalogues.

Note: 76 in the word_id is not a line number strictly speaking but an object reference within the text object. Things like horizontal rulings, columns, and breaks also get object references. The word_id field allows us to put lines, breaks, and horizontal drawings together in the proper order.

def parse_ORACC_json(lemmatized_data_json, meta_data, dollar_keys):
    '''
    A function to parse the ORACC json lemmatized data which has a specific
    format outlined here: http://oracc.museum.upenn.edu/doc/opendata/json/index.html
    This function recursively parses objects associated with the 'cdl' key to 
    extend the lemma_list. Each identified lemma's data is added as a new entry
    in the lemma list, which is returned at the end of the function.
    
    Inputs:
    lemmatized_data_json: JSON object containing lemmatized data for a single 
                          tablet from ORACC. Must be recursively called to 
                          access lemma data. Lemma data objects can be 
                          identified with the presence of an 'f' key or 'strict'
                          key
    meta_data: a dictionary with various metadata fields pertaining to the given
               tablet
    dollar_keys: a list of keys used to add data to the 'strict' lemmas

    Outputs:
    lemma_list: a list of dictionaries where each entry corresponds to a single 
                lemma. This list is extended at the end of each recursion.
    '''
    lemma_list = []
    for JSONobject in lemmatized_data_json['cdl']:
        if 'cdl' in JSONobject: 
            lemma_list.extend(parse_ORACC_json(JSONobject, meta_data, dollar_keys))
        if 'label' in JSONobject:
            meta_data['label'] = JSONobject['label']
        if 'f' in JSONobject:
            lemma = JSONobject['f']
            lemma['ftype'] = JSONobject.get('ftype')
            lemma['id_word'] = JSONobject['ref']
            lemma['label'] = meta_data['label']
            lemma['id_text'] = meta_data['id_text']
            lemma['date'] = meta_data['date']
            lemma['dates_references'] = meta_data['dates_references']
            lemma['publication'] = meta_data['publication']
            lemma['collection'] = meta_data['collection']
            lemma['museum_no'] = meta_data['museum_no']
            lemma['metadata_source'] = meta_data['metadata_source']
            lemma_list.append(lemma)
        if 'strict' in JSONobject and JSONobject['strict'] == '1':
            lemma = {key: JSONobject[key] for key in dollar_keys}
            lemma['id_word'] = JSONobject['ref']
            lemma['id_text'] = meta_data['id_text']
            lemma['date'] = meta_data['date']
            lemma['dates_references'] = meta_data['dates_references']
            lemma['publication'] = meta_data['publication']
            lemma['collection'] = meta_data['collection']
            lemma['museum_no'] = meta_data['museum_no']
            lemma['metadata_source'] = meta_data['metadata_source']
            lemma_list.append(lemma)
    return lemma_list

3.3 Call the `parse_ORACC_json()` function for every `JSON` file¶

Here we will use the parse_ORACC_json() and the catalogue dataframe to construct the words_df dataframe which we will be using throughout the rest of the code.

The project zip file downloaded earlier contains a directory that is called corpusjson that contains a JSON file for every text that is available in that corpus. The files are named using their text IDs in the pattern P######.json (or Q######.json or X######.json).

The function namelist() of the zipfile package is used to create a list of the names of all the files in the ZIP. From this list we select all the file names in the corpusjson directory with extension .json (this way we exclude the name of the directory itself).

Each of these files is read from the zip file and loaded with the command json.loads(), which transforms the string into a proper JSON object.

This JSON object (essentially a Python dictionary), which is called data_json is now sent to the parse_ORACC_json() function. The function adds lemmata to the lemm_l list. In the end, lemm_l will contain as many list elements as there are words in all the texts in the projects requested.

The dictionary meta_d is created to hold temporary information. The value of the key id_text is updated in the main process every time a new JSON file is opened and send to the parsejson() function. The parsejson() function itself will change values or add new keys, depending on the information found while iterating through the JSON file. When a new lemma row is created, parse_ORACC_json() will supply data such as id_text, label and (potentially) other information from meta_d.

lemma_list = []
meta_data = {'label': None}
dollar_keys = ['extent', 'scope', 'state']

try:
  zip_file = zipfile.ZipFile(file_name)       # create a Zipfile object
except:
  print(file_name, 'does not exist or is not a proper ZIP file')

drehem_list_oracc = catalogue_data.index.to_list()
files = [project + '/corpusjson/' + p_number + '.json' for p_number in drehem_list_oracc]    #that holds all the P, Q, and X numbers.
for lemma_file in tqdm(files, desc = project):       #iterate over the file names
  id_text = project + lemma_file[-13:-5] # id_text is, for instance, blms/P414332

  p_number = lemma_file[-12:-5]
  meta_data['id_text'] = id_text
  meta_data['date'] = catalogue_data.loc[p_number]['date_of_origin']
  meta_data['dates_references'] = catalogue_data.loc[p_number]['dates_referenced']
  meta_data['publication'] = catalogue_data.loc[p_number]['primary_publication']
  meta_data['collection'] = catalogue_data.loc[p_number]['collection']
  meta_data['museum_no'] = catalogue_data.loc[p_number]['museum_no']
  meta_data['metadata_source'] = catalogue_data.loc[p_number]['metadata_source']

  try:
    json_str = zip_file.read(lemma_file).decode('utf-8')         #read and decode the json file of one particular text
    lemmatized_data_json = json.loads(json_str)                # make it into a json object (essentially a dictionary)
    lemma_list.extend(parse_ORACC_json(lemmatized_data_json, meta_data, dollar_keys))               # and send to the parse_ORACC_json() function
  except:
    print(id_text, 'is not available or not complete')

zip_file.close()

4 Data Structuring¶

Here we construct our pandas dataframe.

4.1 Transform the Data into a DataFrame¶

Here we use lemma_list to make our dataframe and we view it.

words_df = pd.DataFrame(lemma_list)
words_df = words_df.fillna('')      # fill empty values with empty strings
words_df[words_df['lang'] == 'sux'] # display lemmas where language is Sumerian

	lang	form	delim	gdl	pos	ftype	id_word	label	id_text	date	dates_references	publication	collection	museum_no	metadata_source	cf	gw	sense	norm0	epos	base	morph	extent	scope	state	cont	aform
0	sux	6(diš)		[{'n': 'n', 'form': '6(diš)', 'id': 'P100041.3...	n		P100041.3.1	o 1	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
1	sux	udu		[{'v': 'udu', 'id': 'P100041.3.2.0'}]	N		P100041.3.2	o 1	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	udu	sheep	sheep	udu	N	udu	~
2	sux	kišib₃		[{'v': 'kišib₃', 'id': 'P100041.4.1.0'}]	N		P100041.4.1	o 2	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	kišib	seal	cylinder seal	kišib	N	kišib₃	~
3	sux	lu₂-{d}suen		[{'v': 'lu₂', 'id': 'P100041.4.2.0', 'delim': ...	PN		P100041.4.2	o 2	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	Lusuen	0	0	Lusuen	PN	lu₂-{d}suen	~
4	sux	ki		[{'v': 'ki', 'id': 'P100041.5.1.0'}]	N		P100041.5.1	o 3	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	ki	place	place	ki	N	ki	~
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
594695	sux	gu₄		[{'v': 'gu₄', 'id': 'P481395.31.2.0'}]	N		P481395.31.2	l.e. 1	epsd2/admin/ur3/P481395	SS02 - 02 - 00	SS02 - 02 - 00	unpublished unassigned ?	Department of Classics, University of Cincinna...	UC CSC 1954	BDTNS	gud	ox	bull, ox	gud	N	gud	~
594696	sux	1(diš)		[{'n': 'n', 'form': '1(diš)', 'id': 'P481395.3...	n		P481395.31.3	l.e. 1	epsd2/admin/ur3/P481395	SS02 - 02 - 00	SS02 - 02 - 00	unpublished unassigned ?	Department of Classics, University of Cincinna...	UC CSC 1954	BDTNS
594697	sux	anše		[{'v': 'anše', 'id': 'P481395.31.4.0'}]	N		P481395.31.4	l.e. 1	epsd2/admin/ur3/P481395	SS02 - 02 - 00	SS02 - 02 - 00	unpublished unassigned ?	Department of Classics, University of Cincinna...	UC CSC 1954	BDTNS	anše	equid	equid, donkey	anše	N	anše	~
594698	sux	{d}šu-{d}suen		[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...	RN		P517012.3.1	a 1	epsd2/admin/ur3/P517012			CDLI Seals 013964 (composite)			ORACC	Šusuen	1	1	Šusuen	RN	{d}šu-{d}suen	~
594699	sux	a-bu-um-dingir		[{'v': 'a', 'id': 'P517012.4.1.0', 'delim': '-...	PN		P517012.4.1	a 2	epsd2/admin/ur3/P517012			CDLI Seals 013964 (composite)			ORACC	Abumilum	0	0	Abumilum	PN	a-bu-um-dingir	~

588203 rows × 27 columns

words_df.shape   # see shape

(594700, 27)

words_df.columns # see column names

Index(['lang', 'form', 'delim', 'gdl', 'pos', 'ftype', 'id_word', 'label',
       'id_text', 'date', 'dates_references', 'publication', 'collection',
       'museum_no', 'metadata_source', 'cf', 'gw', 'sense', 'norm0', 'epos',
       'base', 'morph', 'extent', 'scope', 'state', 'cont', 'aform'],
      dtype='object')

4.2 Remove Spaces and Commas from Guide Word and Sense¶

Spaces and commas in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens and nothing (empty string), respectively.

By default the replace() function in pandas will match the entire string (that is, “lugal” matches “lugal” but there is no match between “l” and “lugal”). In order to match partial strings the parameter regex must be set to True.

The replace() function takes a nested dictionary as argument. The top-level keys identify the columns on which the replace() function should operate (in this case ‘gw’ and ‘sense’). The value of each key is another dictionary with the search string as key and the replace string as value.

findreplace = {' ' : '-', ',' : ''}
words_df = words_df.replace({'gw' : findreplace, 'sense' : findreplace}, regex=True)

The columns in the resulting DataFrame correspond to the elements of a full ORACC signature, plus information about text, line, and word ids:

base (Sumerian only)
cf (Citation Form)
cont (continuation of the base; Sumerian only)
epos (Effective Part of Speech)
form (transliteration, omitting all flags such as indication of breakage)
frag (transliteration; including flags)
gdl_utf8 (cuneiform)
gw (Guide Word: main or first translation in standard dictionary)
id_text (six-digit P, Q, or X number)
id_word (word ID in the format Text_ID.Line_ID.Word_ID)
label (traditional line number in the form o ii 2’ (obverse column 2 line 2’), etc.)
lang (language code, including sux, sux-x-emegir, sux-x-emesal, akk, akk-x-stdbab, etc)
morph (Morphology; Sumerian only)
norm (Normalization: Akkadian)
norm0 (Normalization: Sumerian)
pos (Part of Speech)
sense (contextual meaning)
sig (full ORACC signature)
(newly added) ftype (yn if a node represents a year name)

Not all data elements (columns) are available for all words. Sumerian words never have a norm, Akkadian words do not have norm0, base, cont, or morph. Most data elements are only present when the word is lemmatized; only lang, form, id_word, and id_text should always be there.

We also included some metadata:

date
dates_references
primary_publication
collection
museum_no

The first two of which refer to the dates on the tablets and the last three of which refer to the current ownership and where the tablet was published.

4.3 Create Line ID¶

The DataFrame currently has a word-by-word data representation. We will add to each word a field id_line that will make it possible to reconstruct lines. This newly created field id_line is different from a traditional line number (found in the field “label”) in two ways. First, id_line is an integer, so that lines are sorted correctly. Second, id_line is assigned to words, but also to gaps and horizontal drawings on the tablet. The field id_line will allow us to keep all these elements in their proper order.

The field “id_line” is created by splitting the field “id_word” into (two or) three elements. The format of “id_word” is IDtext.line.word. The middle part, id_line, is selected and its data type is changed from string to integer. Rows that represent gaps in the text or horizontal drawings have an “id_word” in the format IDtext.line (consisting of only two elements), but are treated in exactly the same way.

words_df['id_line'] = [int(wordid.split('.')[1]) for wordid in words_df['id_word']]
words_df.head(10)

	lang	form	gdl	pos	id_word	label	id_text	date	dates_references	publication	collection	museum_no	metadata_source	cf	gw	sense	norm0	epos	base	morph	extent	scope	state	id_line
0	sux	6(diš)	[{'n': 'n', 'form': '6(diš)', 'id': 'P100041.3...	n	P100041.3.1	o 1	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS											3
1	sux	udu	[{'v': 'udu', 'id': 'P100041.3.2.0'}]	N	P100041.3.2	o 1	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	udu	sheep	sheep	udu	N	udu	~				3
2	sux	kišib₃	[{'v': 'kišib₃', 'id': 'P100041.4.1.0'}]	N	P100041.4.1	o 2	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	kišib	seal	cylinder-seal	kišib	N	kišib₃	~				4
3	sux	lu₂-{d}suen	[{'v': 'lu₂', 'id': 'P100041.4.2.0', 'delim': ...	PN	P100041.4.2	o 2	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	Lusuen	0	0	Lusuen	PN	lu₂-{d}suen	~				4
4	sux	ki	[{'v': 'ki', 'id': 'P100041.5.1.0'}]	N	P100041.5.1	o 3	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	ki	place	place	ki	N	ki	~				5
5	sux	ab-ba-kal-la-ta	[{'v': 'ab', 'id': 'P100041.5.2.0', 'delim': '...	PN	P100041.5.2	o 3	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	Abbakala	0	0	Abbakala,ta	PN	ab-ba-kal-la	~,ta				5
6	sux	ba-zi	[{'v': 'ba', 'id': 'P100041.6.1.0', 'delim': '...	V/i	P100041.6.1	o 4	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	zig	rise	to-rise	ba:zig	V/i	zig₃	ba:~				6
7					P100041.8		epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS								n	space	blank	8
8					P100041.9		epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS								1	impression	other	9
9	sux	{d}šu-{d}suen	[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...	RN	P100041.12.1	seal 1 i 1	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	Šusuen	1	1	Šusuen	RN	{d}šu-{d}suen	~				12

5 Create Lemma Column¶

A lemma, ORACC style, combines Citation Form, GuideWord and POS into a unique reference to one particular lemma in a standard dictionary, as in lugal[king]N (Sumerian) or šarru[king]N.

Proper Nouns (proper names, geographical names, etc.) are a special case, because they currently receive a Part of Speech, but not a Citation Form or Guide Word.

Usually, not all words in a text are lemmatized, because a word may be (partly) broken and/or unknown. Unlemmatized and unlemmatizable words will receive a place-holder lemmatization that consists of the transliteration of the word (instead of the Citation Form), with NA as GuideWord and NA as POS, as in i-bu-x[NA]NA. Note that NA is a string.

Finally, rows representing horizontal rulings, blank lines, or broken lines have data in the fields ‘state’, ‘scope’, and ‘extent’ (for instance ‘state’ = broken, ‘scope’ = line, and ‘extent’ = 3, to indicate three broken lines). We can use this to prevent scripts to ‘jump over’ such breaks when looking for key words after (or before) a proper noun. We distinguish between physical breaks and logical breaks (such as horizontal rulings or seal impressions).

proper_nouns = ['FN', 'PN', 'DN', 'AN', 'WN', 'ON', 'TN', 'MN', 'CN', 'GN']
physical_break = ['illegible', 'traces', 'missing', 'effaced']
logical_break = ['other', 'blank', 'ruling']
words_df['lemma'] = words_df["cf"] + '[' + words_df["gw"] + ']' + words_df["pos"]
words_df.loc[words_df["cf"] == "" , 'lemma'] = words_df['form'] + '[NA]NA'
words_df.loc[words_df["pos"] == "n" , 'lemma'] = words_df['form'] + '[]NU' 
#words_df.loc[words_df["pos"].isin(proper_nouns) , 'lemma'] = words_df['form'] + '[]' + words_df['pos']
words_df.loc[words_df["state"].isin(logical_break), 'lemma'] = "break_logical"
words_df.loc[words_df["state"].isin(physical_break), 'lemma'] = "break_physical"
words_df.head(10)

	lang	form	gdl	pos	id_word	label	id_text	date	dates_references	publication	collection	museum_no	metadata_source	cf	gw	sense	norm0	epos	base	morph	extent	scope	state	id_line	lemma
0	sux	6(diš)	[{'n': 'n', 'form': '6(diš)', 'id': 'P100041.3...	n	P100041.3.1	o 1	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS											3	6(diš)[]NU
1	sux	udu	[{'v': 'udu', 'id': 'P100041.3.2.0'}]	N	P100041.3.2	o 1	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	udu	sheep	sheep	udu	N	udu	~				3	udu[sheep]N
2	sux	kišib₃	[{'v': 'kišib₃', 'id': 'P100041.4.1.0'}]	N	P100041.4.1	o 2	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	kišib	seal	cylinder-seal	kišib	N	kišib₃	~				4	kišib[seal]N
3	sux	lu₂-{d}suen	[{'v': 'lu₂', 'id': 'P100041.4.2.0', 'delim': ...	PN	P100041.4.2	o 2	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	Lusuen	0	0	Lusuen	PN	lu₂-{d}suen	~				4	Lusuen[0]PN
4	sux	ki	[{'v': 'ki', 'id': 'P100041.5.1.0'}]	N	P100041.5.1	o 3	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	ki	place	place	ki	N	ki	~				5	ki[place]N
5	sux	ab-ba-kal-la-ta	[{'v': 'ab', 'id': 'P100041.5.2.0', 'delim': '...	PN	P100041.5.2	o 3	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	Abbakala	0	0	Abbakala,ta	PN	ab-ba-kal-la	~,ta				5	Abbakala[0]PN
6	sux	ba-zi	[{'v': 'ba', 'id': 'P100041.6.1.0', 'delim': '...	V/i	P100041.6.1	o 4	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	zig	rise	to-rise	ba:zig	V/i	zig₃	ba:~				6	zig[rise]V/i
7					P100041.8		epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS								n	space	blank	8	break_logical
8					P100041.9		epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS								1	impression	other	9	break_logical
9	sux	{d}šu-{d}suen	[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...	RN	P100041.12.1	seal 1 i 1	epsd2/admin/ur3/P100041	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS	Šusuen	1	1	Šusuen	RN	{d}šu-{d}suen	~				12	Šusuen[1]RN

6 Select Relevant Columns¶

Now we can select only the field ‘lemma’ and the fields that indicate the ID of the word, the line, or the document, plus the metadata, and the field ‘label’, which indicates the physical position of the line on the document.

cols = ['lemma', 'id_text', 'id_line', 'id_word', 'label', 'date', 'dates_references', 'publication', 'collection', 'museum_no', 'ftype', 'metadata_source']
words_df = words_df[cols].copy()
words_df.head(100)

	lemma	id_text	id_line	id_word	label	date	dates_references	publication	collection	museum_no	ftype	metadata_source
0	6(diš)[]NU	epsd2/admin/ur3/P100041	3	P100041.3.1	o 1	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313		BDTNS
1	udu[sheep]N	epsd2/admin/ur3/P100041	3	P100041.3.2	o 1	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313		BDTNS
2	kišib[seal]N	epsd2/admin/ur3/P100041	4	P100041.4.1	o 2	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313		BDTNS
3	Lusuen[0]PN	epsd2/admin/ur3/P100041	4	P100041.4.2	o 2	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313		BDTNS
4	ki[place]N	epsd2/admin/ur3/P100041	5	P100041.5.1	o 3	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313		BDTNS
...	...	...	...	...	...	...	...	...	...	...	...	...
95	1(diš)[]NU	epsd2/admin/ur3/P100211	3	P100211.3.1	o 1	AS01 - 12 - 11	AS01 - 12 - 11	ActSocHun Or 5-12, 156 2	Museum of Fine Arts, Budapest, Hungary	MHBA 51.2400		BDTNS
96	udu[sheep]N	epsd2/admin/ur3/P100211	3	P100211.3.2	o 1	AS01 - 12 - 11	AS01 - 12 - 11	ActSocHun Or 5-12, 156 2	Museum of Fine Arts, Budapest, Hungary	MHBA 51.2400		BDTNS
97	a[water]N	epsd2/admin/ur3/P100211	3	P100211.3.3	o 1	AS01 - 12 - 11	AS01 - 12 - 11	ActSocHun Or 5-12, 156 2	Museum of Fine Arts, Budapest, Hungary	MHBA 51.2400		BDTNS
98	x[NA]NA	epsd2/admin/ur3/P100211	3	P100211.3.4	o 1	AS01 - 12 - 11	AS01 - 12 - 11	ActSocHun Or 5-12, 156 2	Museum of Fine Arts, Budapest, Hungary	MHBA 51.2400		BDTNS
99	sag[rare]V/i	epsd2/admin/ur3/P100211	3	P100211.3.5	o 1	AS01 - 12 - 11	AS01 - 12 - 11	ActSocHun Or 5-12, 156 2	Museum of Fine Arts, Budapest, Hungary	MHBA 51.2400		BDTNS

100 rows × 12 columns

We can simplify the ‘id_text’ column, because all documents derive from the same project (epsd2/admin/u3adm).

words_df['id_text'] = [tid[-7:] for tid in words_df['id_text']]
words_df.head(20)

	lemma	id_text	id_line	id_word	label	date	dates_references	publication	collection	museum_no	metadata_source
0	6(diš)[]NU	P100041	3	P100041.3.1	o 1	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
1	udu[sheep]N	P100041	3	P100041.3.2	o 1	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
2	kišib[seal]N	P100041	4	P100041.4.1	o 2	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
3	Lusuen[0]PN	P100041	4	P100041.4.2	o 2	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
4	ki[place]N	P100041	5	P100041.5.1	o 3	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
5	Abbakala[0]PN	P100041	5	P100041.5.2	o 3	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
6	zig[rise]V/i	P100041	6	P100041.6.1	o 4	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
7	break_logical	P100041	8	P100041.8		SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
8	break_logical	P100041	9	P100041.9		SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
9	Šusuen[1]RN	P100041	12	P100041.12.1	seal 1 i 1	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
10	lugal[king]N	P100041	13	P100041.13.1	seal 1 i 2	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
11	kalag[strong]V/i	P100041	13	P100041.13.2	seal 1 i 2	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
12	lugal[king]N	P100041	14	P100041.14.1	seal 1 i 3	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
13	Urim[1]SN	P100041	14	P100041.14.2	seal 1 i 3	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
14	lugal[king]N	P100041	15	P100041.15.1	seal 1 i 4	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
15	an[sky]N	P100041	15	P100041.15.2	seal 1 i 4	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
16	anubda[quarter]N	P100041	15	P100041.15.3	seal 1 i 4	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
17	limmu[four]NU	P100041	15	P100041.15.4	seal 1 i 4	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
18	UrKugnunak[0]PN	P100041	17	P100041.17.1	seal 1 ii 1	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS
19	dubsar[scribe]N	P100041	18	P100041.18.1	seal 1 ii 2	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313	BDTNS

test = words_df.head(4000)
test[test['lemma'].str.contains('N')]

	lemma	id_text	id_line	id_word	label	date	dates_references	publication	collection	museum_no	ftype	metadata_source
0	6(diš)[]NU	P100041	3	P100041.3.1	o 1	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313		BDTNS
1	udu[sheep]N	P100041	3	P100041.3.2	o 1	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313		BDTNS
2	kišib[seal]N	P100041	4	P100041.4.1	o 2	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313		BDTNS
3	Lusuen[0]PN	P100041	4	P100041.4.2	o 2	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313		BDTNS
4	ki[place]N	P100041	5	P100041.5.1	o 3	SSXX - 00 - 00	SSXX - 00 - 00	AAS 053	Louvre Museum, Paris, France	AO 20313		BDTNS
...	...	...	...	...	...	...	...	...	...	...	...	...
3995	Inabanum[0]PN	P100981	4	P100981.4.1	o 2	AS05 - 04 - 20	AS05 - 04 - 20	Amorites 16	Oriental Institute, University of Chicago, Chi...	OIM A02868		BDTNS
3996	MAR.TU[westerner]N	P100981	4	P100981.4.2	o 2	AS05 - 04 - 20	AS05 - 04 - 20	Amorites 16	Oriental Institute, University of Chicago, Chi...	OIM A02868		BDTNS
3997	1(diš)[]NU	P100981	5	P100981.5.1	o 3	AS05 - 04 - 20	AS05 - 04 - 20	Amorites 16	Oriental Institute, University of Chicago, Chi...	OIM A02868		BDTNS
3998	sila[lamb]N	P100981	5	P100981.5.2	o 3	AS05 - 04 - 20	AS05 - 04 - 20	Amorites 16	Oriental Institute, University of Chicago, Chi...	OIM A02868		BDTNS
3999	Urištaran[0]PN	P100981	6	P100981.6.1	o 4	AS05 - 04 - 20	AS05 - 04 - 20	Amorites 16	Oriental Institute, University of Chicago, Chi...	OIM A02868		BDTNS

3465 rows × 12 columns

7 Save Results in CSV file & Pickle¶

The output file is called part_1_output.csv and is placed in the directory output. In most computers, csv files open automatically in Excel. This program does not deal well with utf-8 encoding (files in utf-8 need to be imported; see the instructions here. If you intend to use the file in Excel, change encoding ='utf-8' to encoding='utf-16'. For usage in computational text analysis applications utf-8 is usually preferred.

The Pandas function to_pickle() writes a binary file that can be opened in a later phase of the project with the read_pickle() command and will reproduce exactly the same DataFrame with the same data structure. The resulting file can only be used by pandas.

You can skip Part I by importing the Part I output as a pandas dataframe at the beginning of Part II. Similarly, in the next parts, after you save the output, you will be able to import that output later on instead of rerunning the sections before.

catalogue_data.reset_index()

	index	date_of_origin	dates_referenced	collection	primary_publication	museum_no	provenience	metadata_source
0	P100041	SSXX - 00 - 00	SSXX - 00 - 00	Louvre Museum, Paris, France	AAS 053	AO 20313	Puzriš-Dagān	BDTNS
1	P100189	SH46 - 08 - 05	SH46 - 08 - 05	Louvre Museum, Paris, France	AAS 211	AO 20039	Puzriš-Dagān	BDTNS
2	P100190	SH47 - 07 - 29	SH47 - 07 - 29	Louvre Museum, Paris, France	AAS 212	AO 20051	Puzriš-Dagān	BDTNS
3	P100191	AS01 - 03 - 24	AS01 - 03 - 24	Louvre Museum, Paris, France	AAS 213	AO 20074	Puzriš-Dagān	BDTNS
4	P100211	AS01 - 12 - 11	AS01 - 12 - 11	Museum of Fine Arts, Budapest, Hungary	ActSocHun Or 5-12, 156 2	MHBA 51.2400	Puzriš-Dagān	BDTNS
...	...	...	...	...	...	...	...	...
15666	P456164	NaN	NaN	NaN	CDLI Seals 003454 (composite)	NaN	Puzriš-Dagan (mod. Drehem)	ORACC
15667	P459158	Ibbi-Suen.00.00.00	Ibbi-Suen.00.00.00	private: anonymous, unlocated	CDLI Seals 006338 (physical)	Anonymous 459158	Puzriš-Dagan (mod. Drehem)	ORACC
15668	P481391	SH46 - 02 - 24	SH46 - 02 - 24	Department of Classics, University of Cincinna...	unpublished unassigned ?	UC CSC 1950	Puzriš-Dagān	BDTNS
15669	P481395	SS02 - 02 - 00	SS02 - 02 - 00	Department of Classics, University of Cincinna...	unpublished unassigned ?	UC CSC 1954	Puzriš-Dagān	BDTNS
15670	P517012	NaN	NaN	NaN	CDLI Seals 013964 (composite)	NaN	Puzriš-Dagan (mod. Drehem)	ORACC

15671 rows × 8 columns

words_df.to_csv('output/part_1_output.csv')
words_df.to_pickle('output/part_1_output.p')

catalogue_data.to_csv('output/part_1_catalogue.csv')
catalogue_data.to_pickle('output/part_1_catalogue.p')

Sumerian Networks