Skip to content

Data retrieval from several CSV/EXCEL tables

Simple analysis

Simple analysis of table columns with identical labels from several EXCEL files: XLSX_analysing-multiple-files.py

Use cases

  • identifying chonologies in data sets
  • data comparison
  • data cleanign and data normalisation

Sample output for the above-mentioned script:

Beispielhaftes Ergebnis der (vergleichenden) Analyse mehrerer EXCEL-Dateien mit PANDAS in Python 3Sample console output of the (comparative) analysis of multiple EXCEL files with PANDAS in Python 3



Die Factoid-Liste enthält derzeit 11 Einträge.


Die Namen der erfassten Personen sind:
                 pers_name
0      Johann Caspar Abel
1      Johann Caspar Abel
2           Andreas Abley
3           Andreas Abley
4   Joannes Michael Abley
5      Joannes Adamus Abt
6           Friedrich Abt
7      Joannes Adamus Abt
8      Joannes Adamus Abt
9                     NaN
10                    NaN


Diese erfassten Ereignisse haben bislang kein Startdatum:
   event_type       start         end           pers_name  pers_title  \
0      birth  0000-00-00  1765-00-00  Johann Caspar Abel         NaN   

  per_function place_name inst_name rel_pers    source comment  
0        child  Frankfurt       NaN      NaN  RPh 267r     NaN  


Dies sind die ersten 9 Personen in der Personen-Datei:
                                            pers_name
0                       Philipp Carl von und zu Eltz
1         Carl Emerich Frantz Freiherr von Breidbach
2        Carl Emmerich Frantz Freiherr von Breidbach
3    Johann Frantz Jacob Anton Freiherr von Hoheneck
4  Frantz Godfried Carl Johann Friederich Anton G...
5    Carl Philipp Heinrich Freiherr von Greiffenclau
6     Carl Philipp Henrich Freiherr von Greiffenclau
7     Carl Philipp Henrich Freiherr von Greiffenclau
8              Hugo Frantz Carl Graf von und zu Eltz
9                     Hugo Frantz Carl Graf von Eltz


Diese Personen sind in der Factoid-Tabelle und in der Personen-Tabelle erfasst:
 ['Johann Caspar Abel']


Dies ist der am häufigsten genannte Ortsname der Factoid-Datei:
 0    Mainz
dtype: object


Zum gesuchten Ereignis-Ort gibt es folgende Informationen aus der Orts-Datei:
     place_name Weitere Angaben laut WHG Linked Data!
33       Mainz                                   NaN
151      Mainz                                   NaN
153      Mainz                                   NaN
165      Mainz                                   NaN
Done.

Script for identifying biographic events

Script for identifying biographic events per person and sorting them by event-value and date XLSX_select-and-sort-events.py (1st version)

py
# Script to select event per persons and to sort them by event values and dates

# written for the DigiKAR geohumanities project in April 2022 by Monika Barget

import csv
import pandas as pd
import numpy as np

# Paths of locally synchronised EXCEL files for DigiKAR project
# all project members can adjust paths according to their own file structure

factoids='C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\FactoidList_Erfassung_Jahns_TEST.xlsx'

# classify events

#f=pd.read_excel(factoids)
#events_f=(f[['event_type']])
#search_event=events_f.drop_duplicates()
#print(search_event)

event_value_dict={"Sonstiges":0, 
                  "Geburt":1, 
                  "Taufe":2, 
                  "Primäre Bildungsstation":3, 
                  "Privatunterricht":3,
                  "Rezeption":4, 
                  "Zulassung":9, 
                  "Immatrikulation":10,
                  "Studium":11,
                  "Prüfungsverfahren":11,
                  "Graduation":12,
                  "Praktikum":13,
                  "Promotion":14,
                  "Wohnsitznahme": 20,
                  "Reise":20, 
                  "Nobilitierung":20,
                  "Aufnahme":20,
                  "Aufschwörung":20,
                  "Eheschließung":20,
                  "Funktionsausübung":20,
                  "erfolglose Bewerbung":20,
                  "Rejektion":20,
                  "Aufenthalt":20,
                  "mittelbare Nobilitierung":20,
                  "Privilegierung":20,
                  "Wappenbesserung":20,
                  "Introduktion":30, 
                  "Mitgliedschaft":30,
                  "Gesandtschaft":30, 
                  "Präsentation":30, 
                  "Vokation":39, 
                  "Ernennung":40,
                  "Amtseinführung":41,
                  "Vereidigung":41,
                  "Amtsantritt":42,
                  "Beförderung":44, 
                  "Ehrung":45, 
                  "Entlassung":50,
                  "Suspendierung":50,
                  "Absetzung":50,
                  "Resignation":50,
                  "Rücktritt":50,
                  "Pensionierung":90,
                  "Pension":91,
                  "Tod":100}

# read person list

f=pd.read_excel(factoids)
pers_name_f=(f[['pers_name']]) 
search_df=pers_name_f.drop_duplicates() # remove duplicates
search_list=search_df['pers_name'].tolist()

# count no. of entries in flattened person list

no_person=len(search_list)
print("There are", no_person, "unique person names in this data set.")

# add event values from dict to data frame

f['event_value'] = f['event_type'].map(event_value_dict)

# iterate through unique persons to get their events

for name in search_list:
    print(name)
    res_df=(f.loc[f['pers_name'] == name])
    res_sorted=res_df.sort_values(by =['event_value','event_after-date','event_start','event_before-date',])
   
 # write results to new sheets in EXCEL
    
    with pd.ExcelWriter('C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\JahnsEvents.xlsx', engine='openpyxl', mode='a') as writer:  
        res_sorted.to_excel(writer, sheet_name=name[-14:], index="True")
     
print("Done.")

Use cases

  • experimental reconstruction of biographic chronologies in cases where a lot of events have no or vague dating
  • calculation of possible event time frames based on previous or following events
  • merging identical events / adjusting event time frames based on information from different sources

The current sorting is four-fold and starts with the event classification. This can be adjusted in the code.

res_sorted = res_df.sort_values(
    by =[
        'event_value',
        'event_after_date',
        'event_start',
        'event_before_date',
    ])

Relationship tracer

py
# Script to trace relationships between people coded in factoid lists

# written for the DigiKAR geohumanities project by Monika Barget,

# USE CASES: analysing explicit relationships mentioned in rel_pers to reconstruct implicit / hidden connections

import csv
import pandas as pd
import numpy as np
import os
from collections import Counter
from itertools import combinations
from itertools import product
 
### STEP 1: READ DATA FROM SEVERAL EXCEL FILES IN FACTOID FORMAT

# Paths of locally synchronised EXCEL files for DigiKAR project
# all project members can adjust paths according to their own file structure

filenames="C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\InputLists"

# obligatory columns in valid factoid list

column_names = ["factoid_ID",
                "pers_ID",
                "pers_name",
                "alternative_names",
                "event_type",
                "event_after-date",
                "event_before-date",
                "event_start",
                "event_end",
                "event_date",
                "pers_title",
                "pers_function",
                "place_name",
                "inst_name",
                "rel_pers",
                "source_quotations",
                "additional_info",
                "comment",
                "info_dump",
                "source",
                "source_site"]

# read all excel files in directory as one data frame

frame_list=[]
for item in os.listdir(filenames):
    file = os.path.join(filenames, item)
    df = pd.read_excel(file, sheet_name='FactoidList', header=0)
    frame_list.append(df)

f = pd.concat(frame_list, axis=0, ignore_index=False, sort=False)
    
# read factoids from data frame

pers_f=(f[['pers_name']]) # retrieve data from selected column
pers_list=pers_f.values.tolist() # convert data frame to sorted list
pers_list_flat=[item for sublist in pers_list for item in sublist] # flatten list
pers_unique=pers_f.drop_duplicates() # remove duplicates
pers_unique_list=pers_unique.values.tolist() # write unique values to list

### STEP 2: DEFINE RELATIONSHIP MARKERS TO BE QUERIED

name_list=[]

qr=["Ehefrau: ", 
    "Ehemann: ", 
    "Vater: ", 
    "Mutter: ", 
    "Tochter: ",
    "Sohn: ",
    "Bruder: ", 
    "Schwester: ", 
    "Schwiegermutter: ", 
    "Schwiegervater: ", 
    "Schwiegertochter: ", 
    "Schwiegersohn: ",
    "Schwager: ", 
    "Schwägerin: ",
    "Großmutter: ", 
    "Großvater: ",
    "GVm: ",
    "GVv: ",
    "GMm: ",
    "GMv: ",
    "Cousin: ",
    "Cousine: ",
    "Adoptivvater: ",
    "Tante: ",
    "Onkel: ",
    "Patin: ",
    "Pate: "]

### STEP 2: GET UNIQUE NAMES AND THEIR FREQUENCIES
    
print("\n\nYour factoid list contains", len(pers_f), "entries.") # count data in selected column

for i in [item for sublist in pers_unique_list for item in sublist]: # count person occurrences
    #print("\n", i, " / ", "Häufigkeit:", pers_list_flat.count(i), "\n") # print name and occurrences

### STEP 3: ITERATE THROUGH UNIQUE PERSONS TO FIND RELATIONSHIPS

    df_new=f.loc[f['pers_name'] == i]
    #print("There are", len(df_new), " entries associated with this name.")
    try:
        condlist = [df_new['rel_pers'].notnull()]
        choicelist = [df_new['rel_pers']]

        output_list = np.select(condlist, choicelist).tolist()
        unique_out_set = set(output_list) # convert to set to remove duplicates
        unique_out_list = list(unique_out_set)
        #print(unique_out_list)
        #print(len(unique_out_list))
        for u in unique_out_list:
            if u == 0:
                continue
            else:
                try:
                    results=u.split("/")
                except:
                    results=u
                for r in results:
                    #print(r)
                    #print(type(r))
                    if "$" in r:
                        pers_inf=r.split("$", 1)
                        #print("COMPLETE PERSON:", pers_inf)
                        person=pers_inf[0]
                        info=pers_inf[1]
                    else:
                        person=r
                        info=("none")
                    #print("PERSON:", person, "INFO:", info)    
                    try:
                        pers_res=person.split(":", 1)
                        function=pers_res[0]
                        name_ident=pers_res[1]
                    except:
                        pers_res=person
                        function="unknown"
                        name_ident=pers_res[0]
                    if "#" in name_ident:
                        name=name_ident.split("#")[0]
                        ident=name_ident.split("#")[1]
                    else:
                        name=name_ident
                        ident=("none")

                    name_list.append([i, function, name, ident, info])

    except AttributeError:
        pass
    
### STEP 4: CONVERT NAME LIST TO DF:

f2 = pd.DataFrame(name_list)
f2.columns=["i", "function", "name", "ident", "info"]
print(f2)
    
### STEP 4: RECONSTRUCT SIBLINGS:

siblings1=['Bruder', 'Schwester']
siblings2=['Sohn', 'Tochter']

df_sibling_brother=f2.loc[f2['function'] == siblings1[0]]
for brother in df_sibling_brother.iterrows():
    brother_current=brother[1].to_frame()
    brother_reversed=[brother_current.loc['name'].values, "brother", brother_current.loc['i'].values, "unknown", "no info"]
    name_list.append(brother_reversed)

df_sibling_sister=f2.loc[f2['function'] == siblings1[1]]
for sister in df_sibling_sister.iterrows():
    sister_current=sister[1].to_frame()
    sister_reversed=[sister_current.loc['name'].values, "brother", sister_current.loc['i'].values, "unknown", "no info"]
    name_list.append(sister_reversed)

for s2 in siblings2:
    df_sibling2=f2.loc[f2['function'] == s2]
    for child in df_sibling2.iterrows():
        child_current=child[1].to_frame()
        child_reversed=[child_current.loc['name'].values, "parent", child_current.loc['i'].values, "unknown", "no info"]
        name_list.append(child_reversed)
        
    parents=df_sibling2['i'].unique() # unique names of parents in dataframe
    for parent in parents: # find children per parent
        df_family = df_sibling2[df_sibling2['i'] == parent]
        siblings_in_family=df_family['name'].unique() # unique names of sibling in family
        siblings_number=len(siblings_in_family) # number of siblings in family
        siblings_res=list(combinations(siblings_in_family, 2)) # find all possible sibling pairs
        for pair in siblings_res:
            if pair[0] == pair[1]:
                continue
            else: 
                pair1=[pair[0], "sibling", pair[1], "unknown", "no info"]
                name_list.append(pair1)
                pair2=[pair[1], "sibling", pair[0], "unknown", "no info"]
                name_list.append(pair2)
            
### STEP 5: RECONSTRUCT GRANDPARENTS / GRANDCHILDREN:

grandchildren1=['Vater', 'Mutter']
grandchildren2=['Enkel', 'Enkelin']
grandchildren3=["GVm","GVv", "GMm", "GMv"]

for g1 in grandchildren1: # FIND GRANPARENTS / GRANDCHILDREN via KNOWN PARENTS
    df_grandchildren1=f2.loc[f2['function'] == g1]
    grandparents=df_grandchildren1['name'].unique() # unique names of parents = grandparents in data frame
    parents=df_grandchildren1['i'].unique() # unique names of children = parents in data frame
    for parent in parents:
        for s2 in siblings2:
            df_grandchildren4=f2.loc[f2['i'] == parent][f2['function'] == s2]
            grandchildren=df_grandchildren4['name'].unique() # unique names of grandchildren in family
            grandparents_per_parent=df_grandchildren1.loc[df_grandchildren1['i'] == parent]
            grandparents_values=grandparents_per_parent['name'].unique() # unique names of grandparents per parent
            
            grandchildren_res = list(list(zip(grandparents_values, element))
                           for element in product(grandchildren, repeat = len(grandparents_values)))
            grandchildren_flat=[x for y in grandchildren_res for x in y]
            #print(grandchildren_flat)
            # find all possible grandchild-grandparent pairs for each parent link
            for pair in grandchildren_flat:
                pair1=[pair[0], "grandparent-grandchild", pair[1], "unknown", "no info"]
                name_list.append(pair1)
                pair2=[pair[1], "grandparent-grandchild", pair[0], "unknown", "no info"]
                name_list.append(pair2)
    
for g2 in grandchildren2: # FIND GRANDPARENTS FOR KNOWN GRANDCHILDREN
    df_grandchildren2=f2.loc[f2['function'] == g2]
    for gc2 in df_grandchildren2.iterrows():
        gc_current2=gc2[1].to_frame()
        gc_reversed2=[gc_current2.loc['name'].values, "grandchild", gc_current2.loc['i'].values, "unknown", "no info"]
        name_list.append(gc_reversed2)
    

#print(df_grandchildren2)

for g3 in grandchildren3: # FIND GRANDCHILDREN FOR KNOWN GRANDPARENTS
    df_grandchildren3=f2.loc[f2['function'] == g3]
    for gc3 in df_grandchildren3.iterrows():
        gc_current3=gc3[1].to_frame()
        gc_reversed3=[gc_current3.loc['name'].values, "grandchild", gc_current3.loc['i'].values, "unknown", "no info"]
        name_list.append(gc_reversed3)
            
### STEP 5: SAVE TO NEW FILE

#print(name_list)
#print(len(name_list))

df = pd.DataFrame(name_list)

resultpath='C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\Results'
res_filename=input("Enter your file name: ")
res_file=os.path.join(resultpath, res_filename + ".xlsx")

df.to_excel(res_file, index=True)
    
print("Done.")

User cases

  • consolidating genealogical information from rel_pers column in factoid lists
  • reconstruction of sibling relations based on known parent-child-connections
  • reconstruction of implict grandparent-grandchild relations

Advanced script to query several columns across several spreadsheets based on user input

py
# Script to compare data from several EXCEL files

# written for the DigiKAR geohumanities project by Monika Barget,
# with kind support by https://stackoverflow.com/users/8479387/tlentali

# USE CASES: flexibly matching conditions across EXCEL columns
# check documentation for detailed description of script performance
# integration into GUI in progress

import csv
import pandas as pd
import numpy as np
import os
from collections import Counter

### STEP 1: READ DATA FROM SEVERAL EXCEL FILES IN FACTOID FORMAT

# Paths of locally synchronised EXCEL files for DigiKAR project
# all project members can adjust paths according to their own file structure

filenames="C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\InputLists"

# obligatory columns in valid factoid list

column_names = ["factoid_ID",
                "pers_ID",
                "pers_name",
                "alternative_names",
                "event_type",
                "event_after-date",
                "event_before-date",
                "event_start",
                "event_end",
                "event_date",
                "pers_title",
                "pers_function",
                "place_name",
                "inst_name",
                "rel_pers",
                "source_quotations",
                "additional_info",
                "comment",
                "info_dump",
                "source",
                "source_site"]

# read all excel files in directory as one data frame

frame_list=[]
for item in os.listdir(filenames):
    file = os.path.join(filenames, item)
    df = pd.read_excel(file, sheet_name='FactoidList', axis=1, ignore_index=False, sort=False)
    frame_list.append(df)

f = pd.concat(frame_list, axis=0, ignore_index=False, sort=False)
    
# read factoids from data frame

pers_f=(f[['pers_name']]) # retrieve data from selected column
pers_list=pers_f.values.tolist() # convert data frame to sorted list
pers_list_flat=[item for sublist in pers_list for item in sublist] # flatten list
pers_unique=pers_f.drop_duplicates() # remove duplicates
pers_unique_list=pers_unique.values.tolist() # write unique values to list
    
print("\n\nYour factoid list contains", len(pers_f), "entries.") # count data in selected column

#for i in [item for sublist in pers_unique_list for item in sublist]: # count person occurrences
    #print("\n", i, " / ", "Häufigkeit:", pers_list_flat.count(i), "\n") # print name and occurrences
    
### STEP 2: LET USER SELECT SEARCH CRITERIA

# queried names
print("Enter person names, separated by commas, or @.")
q_name=input()
# queried year  
print("Date(s):")
ex_year=input() 
# select type of time processing
print("No date selected (0), exact dates (1), data range (2), BEFORE date (3) or AFTER date (4)?")
z=input() 
# queried institution
print("Enter institutions, separated by commas, or @:")
q_inst=input() 
# queried title
print("Enter person titles, separated by commas, or @:")
q_title=input() 
# queried function
print("Enter person functions, separated by commas, or @:")
q_func=input() 
# queried related person
print("Enter related persons, separated by commas, or @:")
q_rel=input() 

# separate entries if lists of search terms have been submitted
qn=q_name.split(", ")
exy=ex_year.split(", ")
qi=q_inst.split(", ") 
qt=q_title.split(", ") 
qf=q_func.split(", ") 
qr=q_rel.split(", ")

in_lists=[[qn], [qi], [qt], [qf], [qr]]

# handling data input with different frequences from YYYY to YYYY-MM-DD

try:
    if len(exy[1]) & len(exy[0]) == 4:
        d1=pd.Period(exy[0], freq="Y") # convert input to Period with year frequence
        end_date=int(exy[1])+1 # add one year to expand end range
        d2=pd.Period(str(end_date), freq="Y") # convert input to Period with year frequence
    elif len(exy[1]) & len(exy[0]) == 7:
        d1=pd.Period(exy[0], freq="M") # convert input to Period with month frequence
        d2=pd.Period(exy[1], freq="M") # convert input to Period with month frequence
    elif len(exy[1]) & len(exy[0]) == 10:
        d1=pd.Period(exy[0], freq="D") # convert input to Period with day frequence
        d2=pd.Period(exy[1], freq="D") # convert input to Period with day frequence
    else:
        d1=pd.Period(exy[0], freq="D") # convert input to Period with day frequence
        d2=pd.Period(exy[1], freq="D") # convert input to Period with day frequence
    
except IndexError:
    pass
except ValueError:
    pass


### STEP 3: RUN QUERY BASED ON TIME CONDITIONS


result_df1=pd.DataFrame()
result_df2=pd.DataFrame()

## CASE 0: NO TIME SELECTED

if z=="0": # no dates 
    if "@" in qn and "@" in qi and "@" in qt and "@" in qf and "@" in qr:
        print("No search criteria selected!")
        
    else:
        
# define possible conditions and choices

        condlist = [(f['pers_name'].str.contains('|'.join(qn)) ^ f['pers_name'].isin(qn)),
                    f['inst_name'].isin(qi),
                    f['pers_title'].isin(qt),
                    f['pers_function'].isin(qf),
                    (f['rel_pers'].str.contains('|'.join(qr)) ^ f['rel_pers'].isin(qr)),]

        choicelist = [f['pers_name'], 
                      f['inst_name'], 
                      f['pers_title'],
                      f['pers_function'],
                      f['rel_pers']]

        output = np.select(condlist, choicelist)
        rows=np.where(output)
        new_array=f.to_numpy()
        result_array=new_array[rows]
        print(result_array)
        
## CASE 1: SEARCHING FOR EXACT DATES
        
elif z=="1": # get exact dates 
    print("Searching for dates:", exy)
    df_list=[]
    for x in exy:
        if len(x) == 4:
            date_searched=pd.Period(x, freq="Y") # convert input to Period with year frequence
            for n in range(0, len(pers_f)):
                try:
                    if date_searched == pd.Period(f['event_start'].iloc[n], freq="Y"):
                        date_found=pd.Period(f['event_start'].iloc[n])
                        f_match=f.iloc[[n]]
                        df_list.append(f_match)
                    else:
                        #print("Outside time frame.") # optional for data cleaning
                        continue
                except ValueError:
                    #print(ValueError.args, ":", f['event_start'].iloc[n]) # optional for data cleaning
                    pass
            
        elif len(x) == 7:
            date_searched=pd.Period(x, freq="M") # convert input to Period with month frequence
            for n in range(0, len(pers_f)):
                try:
                    if date_searched == pd.Period(f['event_start'].iloc[n], freq="M"):
                        date_found=pd.Period(f['event_start'].iloc[n])
                        f_match=f.iloc[[n]]
                        df_list.append(f_match)
                    else:
                        #print("Outside time frame.") # optional for data cleaning
                        continue
                except ValueError:
                    #print(ValueError.args, ":", f['event_start'].iloc[n]) # optional for data cleaning
                    pass
        
        elif len(x) == 10:  
            date_searched=pd.Period(x, freq="D") # convert input to Period with day frequence
            for n in range(0, len(pers_f)):
                try:
                    if date_searched == pd.Period(f['event_start'].iloc[n], freq="D"):
                        date_found=pd.Period(f['event_start'].iloc[n])
                        f_match=f.iloc[[n]]
                        df_list.append(f_match)
                    else:
                        #print("Outside time frame.") # optional for data cleaning
                        continue
                except ValueError:
                    #print(ValueError.args, ":", f['event_start'].iloc[n]) # optional for data cleaning
                    pass
    
    
    f_new = pd.concat([df_list], axis=1, ignore_index=False, sort=False)
        
    try:
        
# define possible conditions and choices

        condlist = [f_new['pers_name'].str.contains(str(qn)) ^ f_new['pers_name'].str.contains('|'.join(qn)) ^ f_new['pers_name'].isin(qn), 
                                f_new['inst_name'].isin(qi), 
                                f_new['pers_title'].isin(qt),
                                f_new['pers_function'].isin(qf),
                                f_new['rel_pers'].str.contains('|'.join(qr)) ^ f_new['rel_pers'].isin(qr)]

        choicelist = [f_new['pers_name'],
                                f_new['inst_name'], 
                                f_new['pers_title'],
                                f_new['pers_function'],
                                f_new['rel_pers']]

        output = np.select(condlist, choicelist)
        rows=np.where(output)
        new_array=f_new.to_numpy()
        result_array=new_array[rows]
        
    except ValueError:
        #print(ValueError.args, ":", f['event_start'].iloc[n])
        pass
    
## CASE 2: SEARCHING FOR DATE RANGE

elif z=="2": # get date range
    print("Searching for date range between", d1, "and", d2, "!") 
    for n in range(0, len(pers_f)):
        f_new=pd.DataFrame(columns=column_names)
        try:
            df_list=[]
            if d1 <= pd.Period(f['event_start'].iloc[n], freq="Y") <= d2:
                date_found=pd.Period(f['event_start'].iloc[n])
                f_match=f.iloc[[n]]
                df_list.append(f_match)
            else:
                #print("Outside time frame.")
                continue
        except ValueError:
            #print(ValueError.args, ":", f['event_start'].iloc[n])
            pass
    
   
    f_new = pd.concat(df_list, axis=1, ignore_index=False, sort=False)

# define possible conditions and choices

    if "#" in qn and "#" in qi and "#" in qt and "#" in qf and "#" in qr:
        result_df=f_new
    else:
        condlist = [(f_new['pers_name'].str.contains('|'.join(qn)) ^ f_new['pers_name'].isin(qn)),
                    f_new['inst_name'].isin(qi), 
                    f_new['pers_title'].isin(qt),
                    f_new['pers_function'].isin(qf),
                    f_new['rel_pers'].str.contains('|'.join(qr)) ^ f_new['rel_pers'].isin(qr)]

        choicelist = [f_new['pers_name'],
                    f_new['inst_name'], 
                    f_new['pers_title'],
                    f_new['pers_function'],
                    f_new['rel_pers']]

        output = np.select(condlist, choicelist)
        rows=np.where(output)
        new_array=f_new.to_numpy()
        result_array=new_array[rows]


## CASE 3: SEARCHING FOR DATES BEFORE      

elif z=="3": #get dates before
    print("Searching for dates before", pd.Period(exy[0], freq="D"), "!")
    nf=pd.DataFrame(columns=column_names)
    for n in range(0, len(pers_f)):
        try: 
            if pd.Period(f['event_start'].iloc[n], freq="D") <= pd.Period(exy[0], freq="D"):
                nf=nf.append(f.iloc[n], ignore_index=False, sort=False)
            elif pd.Period(f['event_before-date'].iloc[n], freq="D") <= pd.Period(exy[0], freq="D"):
                nf=nf.append(f.iloc[[n]], ignore_index=False, sort=False)
            else:
                continue
        except ValueError:
            pass
    # define possible conditions and choices
    if "@" in qn and "@" in qi and "@" in qt and "@" in qf and "@" in qr:
        result_df1=nf
    else:
        condlist = [(nf['pers_name'].str.contains('|'.join(qn)) ^ nf['pers_name'].isin(qn)), 
                    nf['inst_name'].isin(qi), 
                    nf['pers_title'].isin(qt),
                    nf['pers_function'].isin(qf),
                    nf['rel_pers'].str.contains('|'.join(qr)) ^ nf['rel_pers'].isin(qr)]

        choicelist = [nf['pers_name'],
                    nf['inst_name'], 
                    nf['pers_title'],
                    nf['pers_function'],
                    nf['rel_pers']]

        output = np.select(condlist, choicelist)
        rows=np.where(output)
        new_array=nf.to_numpy()
        result_array=new_array[rows]
        
## CASE 4: SEARCHING FOR DATES AFTER        

elif z=="4": # get dates after
    print("Searching for dates after", pd.Period(exy[0], freq="D"), "!")
    nf=pd.DataFrame(columns=column_names)
    for n in range(0, len(pers_f)):
        try: 
            if pd.Period(f['event_start'].iloc[n], freq="D") >= pd.Period(exy[0], freq="D"):
                nf=nf.append(f.iloc[[n]], ignore_index=False, sort=False)
            elif pd.Period(f['event_after-date'].iloc[n], freq="D") >= pd.Period(exy[0], freq="D"):
                nf=nf.append(f.iloc[[n]], ignore_index=False, sort=False)
            else:
                continue
        except ValueError:
            pass
        # define possible conditions and choices
        if "@" in qn and "@" in qi and "@" in qt and "@" in qf and "@" in qr:
             result_df1=nf
        else:
            condlist = [(nf['pers_name'].str.contains('|'.join(qn)) ^ nf['pers_name'].isin(qn)), 
                    nf['inst_name'].isin(qi), 
                    nf['pers_title'].isin(qt),
                    nf['pers_function'].isin(qf),
                    nf['rel_pers'].str.contains('|'.join(qr)) ^ nf['rel_pers'].isin(qr)]
            
            choicelist = [nf['pers_name'],
                      nf['inst_name'], 
                      nf['pers_title'],
                      nf['pers_function'],
                      nf['rel_pers']]

            output = np.select(condlist, choicelist)
            rows=np.where(output)
            new_array=nf.to_numpy()
            result_array=new_array[rows]
            
else:
    print("INVALID TIME OPERATOR!")
            
### STEP 4: convert result array into a dataframe

try:
    result_df2 = pd.DataFrame(result_array)
    result_df=pd.concat([result_df1, result_df2], axis=1, ignore_index=False, sort=False)
    
    #print("Elements found: ", result_df[4]) # optional für data cleaning
  
    # save to xlsx file

    resultpath='C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\Results'
    res_filename=input("Enter your file name: ")
    res_file=os.path.join(resultpath, res_filename + ".xlsx")

    result_df.to_excel(res_file, index=True)
    
    print("Done.") 
    
except NameError as er:
    print(er.args)
    print("No results retrieved! Try again.")
  • Files from directory DigiKAR_DATEN\\Python\\InputLists are read to a single dataframe.
  • User can select search criteria for person name, date, institution, person title, function and related person. Several keywords per field are accepted.
  • User can decide whether to search for exact dates, a date range or “before” and “after” dates.
  • Script handles time input first and writes matches to new data frame, then conditions for remaining fields are applied. Current version of script apply OR operator to the fields. In final GUI version, user will be able to select both AND or OR operations.
  • Results are written to new file. User is free to assign file name via script input.
  • Time management handles YYYY as well as YYYY-MM and YYYY-MM-DD inputs. Searching for dates before and after input date, the before and after columns in the original spreadsheet are consider alongside start and end dates.

Use cases

  • sample for creating more complex Boolean queries across several spreadsheets
  • adjustable to different spreadsheet formats
  • experimenting with data parsing and potential problems arising from early modern data