{ "cells": [ { "cell_type": "markdown", "id": "8d207bc3", "metadata": { "id": "21P-6caix_ma" }, "source": [ "# III. Harmonizing Year Name Formulae and Processing Dates\n", "\n", "We now will convert the date field to a series of numerical entries representing the year, month, and day, as well as various special values, recorded on the tablet. This will allow us to perform time analysis on the data." ] }, { "cell_type": "code", "execution_count": null, "id": "2a604c9e", "metadata": { "id": "sH8jC6z00vB9" }, "outputs": [], "source": [ "# import necessary libraries\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "id": "b3fd08d8", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 946 }, "id": "78n2_xd41bv1", "outputId": "064e5962-eb58-4143-fff8-8bc68bd94fa0" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_of_origindates_referencedcollectionprimary_publicationmuseum_noproveniencemetadata_source
P100041SSXX - 00 - 00SSXX - 00 - 00Louvre Museum, Paris, FranceAAS 053AO 20313Puzriš-DagānBDTNS
P100189SH46 - 08 - 05SH46 - 08 - 05Louvre Museum, Paris, FranceAAS 211AO 20039Puzriš-DagānBDTNS
P100190SH47 - 07 - 29SH47 - 07 - 29Louvre Museum, Paris, FranceAAS 212AO 20051Puzriš-DagānBDTNS
P100191AS01 - 03 - 24AS01 - 03 - 24Louvre Museum, Paris, FranceAAS 213AO 20074Puzriš-DagānBDTNS
P100211AS01 - 12 - 11AS01 - 12 - 11Museum of Fine Arts, Budapest, HungaryActSocHun Or 5-12, 156 2MHBA 51.2400Puzriš-DagānBDTNS
........................
P456164NaNNaNNaNCDLI Seals 003454 (composite)NaNPuzriš-Dagan (mod. Drehem)ORACC
P459158Ibbi-Suen.00.00.00Ibbi-Suen.00.00.00private: anonymous, unlocatedCDLI Seals 006338 (physical)Anonymous 459158Puzriš-Dagan (mod. Drehem)ORACC
P481391SH46 - 02 - 24SH46 - 02 - 24Department of Classics, University of Cincinna...unpublished unassigned ?UC CSC 1950Puzriš-DagānBDTNS
P481395SS02 - 02 - 00SS02 - 02 - 00Department of Classics, University of Cincinna...unpublished unassigned ?UC CSC 1954Puzriš-DagānBDTNS
P517012NaNNaNNaNCDLI Seals 013964 (composite)NaNPuzriš-Dagan (mod. Drehem)ORACC
\n", "

15671 rows × 7 columns

\n", "
" ], "text/plain": [ " date_of_origin ... metadata_source\n", "P100041 SSXX - 00 - 00 ... BDTNS\n", "P100189 SH46 - 08 - 05 ... BDTNS\n", "P100190 SH47 - 07 - 29 ... BDTNS\n", "P100191 AS01 - 03 - 24 ... BDTNS\n", "P100211 AS01 - 12 - 11 ... BDTNS\n", "... ... ... ...\n", "P456164 NaN ... ORACC\n", "P459158 Ibbi-Suen.00.00.00 ... ORACC\n", "P481391 SH46 - 02 - 24 ... BDTNS\n", "P481395 SS02 - 02 - 00 ... BDTNS\n", "P517012 NaN ... ORACC\n", "\n", "[15671 rows x 7 columns]" ] }, "execution_count": 9, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "words_df = pd.read_pickle('https://gitlab.com/yashila.bordag/sumnet-data/-/raw/main/part_2_output.p') # uncomment to read from online file\n", "#words_df = pd.read_pickle('output/part_2_output.p') #uncomment to read from local file\n", "\n", "catalogue_data = pd.read_pickle('https://gitlab.com/yashila.bordag/sumnet-data/-/raw/main/part_1_catalogue.p')\n", "#catalogue_data = pd.read_pickle('output/part_1_catalogue.p')" ] }, { "cell_type": "markdown", "id": "ee7430ef", "metadata": { "id": "ZeF5OQ06lE4G" }, "source": [ "## 1 Normalizing BDTNS Dates\n", "\n", "In the ORACC metadata the field `date_of_origin` is formatted as `Amar-Suen.05.10.03`, meaning \"5th regnal year of Amar-Suen; month 10, day 3\". The ORACC metadata are directly imported from CDLI. In general, BDTNS metadata tend to be more reliable than CDLI metadata. As such, wherever possible we will use the BDTNS date rather than the CDLI date. \n", "\n", "However, the strings for the dates are formatted differently in BDTNS. If the date were the same as the example given above, in the BDTNS format it would be represented by the string `AS05 - 10 - 03`. Moreover, BDTNS dates can sometimes contain additional information, like if the month was a diri month, or multiple dates on the same tablet. When present, this information could be of value later on, so we attempt to include as much of it as possible. In the following we will write two functions, the first to convert from BDTNS format to several numerical and boolean fields and the second to do the same for the CDLI format." ] }, { "cell_type": "code", "execution_count": null, "id": "9f5b2564", "metadata": { "id": "N5f9Jx0-eKdP" }, "outputs": [], "source": [ "def inner_normalize_bdtns_date(bdtns_date):\n", " # the normalized time each king's reign began, with Ur-Namma's first year\n", " # of rule as year 0.\n", " reigns = {\n", " 'UN' : 0, \n", " 'SH' : 18, \n", " 'AS': 66, \n", " 'SS' : 75, \n", " 'IS' : 84\n", " }\n", " \n", " # Sometimes a date string contains multiple dates seperated by '//'.\n", " # In such a case we set the range of dates to go from the earliest\n", " # date to the latest date.\n", " if '//' in bdtns_date:\n", " dates = [normalize_bdtns_date(date) for date in bdtns_date.split('//')]\n", " return {\n", " 'min_year': min([date['min_year'] for date in dates]),\n", " 'max_year': max([date['max_year'] for date in dates]),\n", " 'min_month': min([date['min_month'] for date in dates]),\n", " 'max_month': max([date['max_month'] for date in dates]),\n", " 'diri_month': any([date['diri_month'] for date in dates]),\n", " 'min_day': min([date['min_day'] for date in dates]),\n", " 'max_day': max([date['max_day'] for date in dates]),\n", " 'questionable': any([date['questionable'] for date in dates]),\n", " 'other_meta_chars': any([date['other_meta_chars'] for date in dates]),\n", " 'multiple_dates': True\n", " }\n", "\n", " # there are several characters used as markers for metatextual information\n", " # we remove them from our string and handle them seperately.\n", " chars = ' d?]m+()l'\n", " reduced_string = bdtns_date.upper()\n", " for c in chars.upper():\n", " reduced_string = reduced_string.replace(c, '')\n", " date_list = reduced_string.split('-')\n", " date_list = ['nan' if 'XX' in s else s for s in date_list]\n", " try:\n", " year = reigns[date_list[0][:2]] + float(date_list[0][-2:])\n", " except KeyError:\n", " year = float('nan')\n", " month = float(date_list[1])\n", " try:\n", " day = float(date_list[2])\n", " except IndexError:\n", " day = float('nan')\n", " return {\n", " 'min_year': year,\n", " 'max_year': year + 1,\n", " 'min_month': month,\n", " 'max_month': month + 1,\n", " 'diri_month': 'd' in bdtns_date,\n", " 'min_day': day,\n", " 'max_day': day + 1,\n", " 'questionable': '?' in bdtns_date,\n", " 'other_meta_chars': any([c in bdtns_date for c in chars[3:]]),\n", " 'multiple_dates': False\n", " }\n", "\n", "# Lastly, we define a wrapper function to catch any errors thrown by our main function.\n", "# This is useful because the string was manually entered, so there are many edge cases,\n", "# each with only a few instances.\n", "def normalize_bdtns_date(bdtns_date):\n", " try:\n", " return inner_normalize_bdtns_date(bdtns_date)\n", " except:\n", " return {\n", " 'min_year' : None,\n", " 'max_year' : None,\n", " 'min_month' : None,\n", " 'max_month' : None,\n", " 'min_day' : None,\n", " 'max_day' : None,\n", " 'diri_month' : None,\n", " 'questionable' : None,\n", " 'other_meta_chars': None,\n", " 'multiple_dates' : None\n", " }" ] }, { "cell_type": "markdown", "id": "2ffdb970", "metadata": { "id": "ZQge9XB4XU08" }, "source": [ "## 2 Normalizing CDLI Date Strings" ] }, { "cell_type": "markdown", "id": "f7bd2d5f", "metadata": { "id": "7VGCnQY7NqpF" }, "source": [ "The next step is to convert the CDLI string representing the date into a numerical format. For months and days, this process is relatively straightforward: so long as the entry is legible we can convert the substring into a number. If it is illegible, we set it to `None`.\n", "\n", "When it comes to the year, however, the process is a bit more involved. Since the year number in the date string represents the year of the current king's reign, it does not by itself give us the absolute year. We account for this by adding the year the king began his reign to the year given in the date. The table below contains the values used to achieve this.\n", "\n", "| king | normalized years | regnal years|\n", "| ----- | ---------------- | ---------- |\n", "| Ur-Namma | 1-18 | 18 |\n", "| Šulgi | 19-66 | 48 |\n", "| Amar-Suen | 67-75 | 9 |\n", "| Šū-Suen | 76-84 | 9 |\n", "| Ibbi-Suen | 85-108 | 24 |" ] }, { "cell_type": "code", "execution_count": null, "id": "436cd8a8", "metadata": { "id": "Pus9Kx-5Uq2O" }, "outputs": [], "source": [ "def normalize_cdli_date(cdli_date):\n", " # the normalized time each king's reign began, with Ur-Namma's first year\n", " # of rule as year 0.\n", " reigns = {'Ur-Namma' : 0, \n", " 'Šulgi' : 18, \n", " 'Amar-Suen': 66, \n", " 'Šū-Suen' : 75, \n", " 'Ibbi-Suen' : 84 \n", " }\n", "\n", " # break if NaN or None (this means the date is illegible)\n", " if type(cdli_date) is not str: \n", " out = {\n", " 'min_year' : None,\n", " 'max_year' : None,\n", " 'min_month' : None,\n", " 'max_month' : None,\n", " 'min_day' : None,\n", " 'max_day' : None,\n", " 'diri_month' : None,\n", " 'questionable' : None,\n", " 'other_meta_chars': None,\n", " 'multiple_dates' : None\n", " }\n", " return out\n", "\n", " decomposed_date = cdli_date.split('.')\n", " decomposed_date[0] = reigns.get(decomposed_date[0])\n", " if decomposed_date[0] is None:\n", " decomposed_date[0] = float('nan')\n", "\n", " # if a section of the year/month/day is illegible we replace that entry with NaN,\n", " # otherwise we convert it to a float.\n", " decomposed_date[1:] = list(map(lambda x: float(x) if x.isdigit() else float('nan'), decomposed_date[1:]))\n", " try:\n", " out = {\n", " 'min_year' : decomposed_date[0] + decomposed_date[1],\n", " 'max_year' : decomposed_date[0] + decomposed_date[1] + 1,\n", " 'min_month' : decomposed_date[2],\n", " 'max_month' : decomposed_date[2] + 1,\n", " 'min_day' : decomposed_date[3],\n", " 'max_day' : decomposed_date[3] + 1,\n", " 'diri_month' : False,\n", " 'questionable' : False,\n", " 'other_meta_chars': False,\n", " 'multiple_dates' : False\n", " }\n", " return out\n", "\n", " except: # due to some edge cases in the formatting, the list can occasionally\n", " # have length shorter than 4. If so, the date is partially illegible\n", " # and can be ignored.\n", " out = {\n", " 'min_year' : None,\n", " 'max_year' : None,\n", " 'min_month' : None,\n", " 'max_month' : None,\n", " 'min_day' : None,\n", " 'max_day' : None,\n", " 'diri_month' : None,\n", " 'questionable' : None,\n", " 'other_meta_chars': None,\n", " 'multiple_dates' : None\n", " }\n", " return out" ] }, { "cell_type": "markdown", "id": "8b27cf16", "metadata": { "id": "ReAKcEhfd41X" }, "source": [ "## 3 Putting it all Together" ] }, { "cell_type": "markdown", "id": "6c3071db", "metadata": { "id": "k_v_rBKiRZiX" }, "source": [ "Now that we have a way to convert our strings in either format, we are ready to apply it to the data. To do this we will\n", "\n", "1. loop through each row of the input DataFrame\n", "2. use the `metadata_source` field to determine the format of the date string\n", "3. apply the appropriate function to the `date_of_origin` field to get a dictionary of our new fields\n", "4. compile the results for each row into a list\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "b9a21463", "metadata": { "id": "veFm7uHmd3NY" }, "outputs": [], "source": [ "def add_normalized_columns(df):\n", " row_list = []\n", " # go through each row of the DataFrame\n", " for index in df.index:\n", " # get the date_of_origin string\n", " date = df['date_of_origin'][index]\n", " # select the appropriate function to convert using metadata_source\n", " if df['metadata_source'][index] == 'BDTNS':\n", " row = normalize_bdtns_date(date)\n", " else:\n", " row = normalize_cdli_date(date)\n", " row['metadata_source'] = df['metadata_source'][index]\n", " # add the result to a list\n", " row_list.append(row)\n", " return row_list" ] }, { "cell_type": "markdown", "id": "2e3c7fe9", "metadata": { "id": "JTVMOoafSDCA" }, "source": [ "Next, we process the dates given in `catalogue_data`. From this we will both add the information to our `words_df` variable, as well as include it in a standalone variable, `time_data`." ] }, { "cell_type": "code", "execution_count": null, "id": "13d83780", "metadata": { "id": "uppJLRuSshRL" }, "outputs": [], "source": [ " result = add_normalized_columns(catalogue_data)\n", " time_data = pd.DataFrame(result, index=catalogue_data.index)\n", " words_df = words_df.merge(time_data, left_on='id_text', right_index=True)" ] }, { "cell_type": "markdown", "id": "febaff24", "metadata": { "id": "KHtnB1DQSkAz" }, "source": [ "At this stage, we can already get a bit more insight into our data using our newly added fields. For example, we can look at the number of tablets over time - even removing cases with uncertain or multiple dates." ] }, { "cell_type": "code", "execution_count": null, "id": "138b0514", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "id": "-V8HH0jlWc47", "outputId": "4f5a31a4-101d-4fd0-c8e0-d08bfc88eac8" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" } ], "source": [ "ax = time_data[(time_data['questionable'] == False)].hist(column='min_year', by='metadata_source', sharex=True, bins= range(30, 90))" ] }, { "cell_type": "markdown", "id": "494b2c5a", "metadata": { "id": "-53Lkz2mzFe7" }, "source": [ "## 4 Save Results in CSV file and Pickle\n", "Here we will save the `words_df` and `time_data` outputs from parts 1, 2, and 3." ] }, { "cell_type": "code", "execution_count": null, "id": "2242a10b", "metadata": { "id": "SDvcQVykzSh8" }, "outputs": [], "source": [ "words_df.to_csv('output/part_3_words_output.csv')\n", "words_df.to_pickle('output/part_3_words_output.p')\n", "\n", "time_data.to_csv('output/part_3_time_output.csv')\n", "time_data.to_pickle('output/part_3_time_output.p')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 5 }