Data retrieval from several CSV/EXCEL tables
Simple analysis
Simple analysis of table columns with identical labels from several EXCEL files: XLSX_analysing-multiple-files.py
Use cases
- identifying chonologies in data sets
- data comparison
- data cleanign and data normalisation
Sample output for the above-mentioned script:
Beispielhaftes Ergebnis der (vergleichenden) Analyse mehrerer EXCEL-Dateien mit PANDAS in Python 3Sample console output of the (comparative) analysis of multiple EXCEL files with PANDAS in Python 3
Die Factoid-Liste enthält derzeit 11 Einträge.
Die Namen der erfassten Personen sind:
pers_name
0 Johann Caspar Abel
1 Johann Caspar Abel
2 Andreas Abley
3 Andreas Abley
4 Joannes Michael Abley
5 Joannes Adamus Abt
6 Friedrich Abt
7 Joannes Adamus Abt
8 Joannes Adamus Abt
9 NaN
10 NaN
Diese erfassten Ereignisse haben bislang kein Startdatum:
event_type start end pers_name pers_title \
0 birth 0000-00-00 1765-00-00 Johann Caspar Abel NaN
per_function place_name inst_name rel_pers source comment
0 child Frankfurt NaN NaN RPh 267r NaN
Dies sind die ersten 9 Personen in der Personen-Datei:
pers_name
0 Philipp Carl von und zu Eltz
1 Carl Emerich Frantz Freiherr von Breidbach
2 Carl Emmerich Frantz Freiherr von Breidbach
3 Johann Frantz Jacob Anton Freiherr von Hoheneck
4 Frantz Godfried Carl Johann Friederich Anton G...
5 Carl Philipp Heinrich Freiherr von Greiffenclau
6 Carl Philipp Henrich Freiherr von Greiffenclau
7 Carl Philipp Henrich Freiherr von Greiffenclau
8 Hugo Frantz Carl Graf von und zu Eltz
9 Hugo Frantz Carl Graf von Eltz
Diese Personen sind in der Factoid-Tabelle und in der Personen-Tabelle erfasst:
['Johann Caspar Abel']
Dies ist der am häufigsten genannte Ortsname der Factoid-Datei:
0 Mainz
dtype: object
Zum gesuchten Ereignis-Ort gibt es folgende Informationen aus der Orts-Datei:
place_name Weitere Angaben laut WHG Linked Data!
33 Mainz NaN
151 Mainz NaN
153 Mainz NaN
165 Mainz NaN
Done.
Script for identifying biographic events
Script for identifying biographic events per person and sorting them by event-value and date XLSX_select-and-sort-events.py (1st version)
py
# Script to select event per persons and to sort them by event values and dates
# written for the DigiKAR geohumanities project in April 2022 by Monika Barget
import csv
import pandas as pd
import numpy as np
# Paths of locally synchronised EXCEL files for DigiKAR project
# all project members can adjust paths according to their own file structure
factoids='C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\FactoidList_Erfassung_Jahns_TEST.xlsx'
# classify events
#f=pd.read_excel(factoids)
#events_f=(f[['event_type']])
#search_event=events_f.drop_duplicates()
#print(search_event)
event_value_dict={"Sonstiges":0,
"Geburt":1,
"Taufe":2,
"Primäre Bildungsstation":3,
"Privatunterricht":3,
"Rezeption":4,
"Zulassung":9,
"Immatrikulation":10,
"Studium":11,
"Prüfungsverfahren":11,
"Graduation":12,
"Praktikum":13,
"Promotion":14,
"Wohnsitznahme": 20,
"Reise":20,
"Nobilitierung":20,
"Aufnahme":20,
"Aufschwörung":20,
"Eheschließung":20,
"Funktionsausübung":20,
"erfolglose Bewerbung":20,
"Rejektion":20,
"Aufenthalt":20,
"mittelbare Nobilitierung":20,
"Privilegierung":20,
"Wappenbesserung":20,
"Introduktion":30,
"Mitgliedschaft":30,
"Gesandtschaft":30,
"Präsentation":30,
"Vokation":39,
"Ernennung":40,
"Amtseinführung":41,
"Vereidigung":41,
"Amtsantritt":42,
"Beförderung":44,
"Ehrung":45,
"Entlassung":50,
"Suspendierung":50,
"Absetzung":50,
"Resignation":50,
"Rücktritt":50,
"Pensionierung":90,
"Pension":91,
"Tod":100}
# read person list
f=pd.read_excel(factoids)
pers_name_f=(f[['pers_name']])
search_df=pers_name_f.drop_duplicates() # remove duplicates
search_list=search_df['pers_name'].tolist()
# count no. of entries in flattened person list
no_person=len(search_list)
print("There are", no_person, "unique person names in this data set.")
# add event values from dict to data frame
f['event_value'] = f['event_type'].map(event_value_dict)
# iterate through unique persons to get their events
for name in search_list:
print(name)
res_df=(f.loc[f['pers_name'] == name])
res_sorted=res_df.sort_values(by =['event_value','event_after-date','event_start','event_before-date',])
# write results to new sheets in EXCEL
with pd.ExcelWriter('C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\JahnsEvents.xlsx', engine='openpyxl', mode='a') as writer:
res_sorted.to_excel(writer, sheet_name=name[-14:], index="True")
print("Done.")
Use cases
- experimental reconstruction of biographic chronologies in cases where a lot of events have no or vague dating
- calculation of possible event time frames based on previous or following events
- merging identical events / adjusting event time frames based on information from different sources
The current sorting is four-fold and starts with the event classification. This can be adjusted in the code.
res_sorted = res_df.sort_values(
by =[
'event_value',
'event_after_date',
'event_start',
'event_before_date',
])
Relationship tracer
py
# Script to trace relationships between people coded in factoid lists
# written for the DigiKAR geohumanities project by Monika Barget,
# USE CASES: analysing explicit relationships mentioned in rel_pers to reconstruct implicit / hidden connections
import csv
import pandas as pd
import numpy as np
import os
from collections import Counter
from itertools import combinations
from itertools import product
### STEP 1: READ DATA FROM SEVERAL EXCEL FILES IN FACTOID FORMAT
# Paths of locally synchronised EXCEL files for DigiKAR project
# all project members can adjust paths according to their own file structure
filenames="C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\InputLists"
# obligatory columns in valid factoid list
column_names = ["factoid_ID",
"pers_ID",
"pers_name",
"alternative_names",
"event_type",
"event_after-date",
"event_before-date",
"event_start",
"event_end",
"event_date",
"pers_title",
"pers_function",
"place_name",
"inst_name",
"rel_pers",
"source_quotations",
"additional_info",
"comment",
"info_dump",
"source",
"source_site"]
# read all excel files in directory as one data frame
frame_list=[]
for item in os.listdir(filenames):
file = os.path.join(filenames, item)
df = pd.read_excel(file, sheet_name='FactoidList', header=0)
frame_list.append(df)
f = pd.concat(frame_list, axis=0, ignore_index=False, sort=False)
# read factoids from data frame
pers_f=(f[['pers_name']]) # retrieve data from selected column
pers_list=pers_f.values.tolist() # convert data frame to sorted list
pers_list_flat=[item for sublist in pers_list for item in sublist] # flatten list
pers_unique=pers_f.drop_duplicates() # remove duplicates
pers_unique_list=pers_unique.values.tolist() # write unique values to list
### STEP 2: DEFINE RELATIONSHIP MARKERS TO BE QUERIED
name_list=[]
qr=["Ehefrau: ",
"Ehemann: ",
"Vater: ",
"Mutter: ",
"Tochter: ",
"Sohn: ",
"Bruder: ",
"Schwester: ",
"Schwiegermutter: ",
"Schwiegervater: ",
"Schwiegertochter: ",
"Schwiegersohn: ",
"Schwager: ",
"Schwägerin: ",
"Großmutter: ",
"Großvater: ",
"GVm: ",
"GVv: ",
"GMm: ",
"GMv: ",
"Cousin: ",
"Cousine: ",
"Adoptivvater: ",
"Tante: ",
"Onkel: ",
"Patin: ",
"Pate: "]
### STEP 2: GET UNIQUE NAMES AND THEIR FREQUENCIES
print("\n\nYour factoid list contains", len(pers_f), "entries.") # count data in selected column
for i in [item for sublist in pers_unique_list for item in sublist]: # count person occurrences
#print("\n", i, " / ", "Häufigkeit:", pers_list_flat.count(i), "\n") # print name and occurrences
### STEP 3: ITERATE THROUGH UNIQUE PERSONS TO FIND RELATIONSHIPS
df_new=f.loc[f['pers_name'] == i]
#print("There are", len(df_new), " entries associated with this name.")
try:
condlist = [df_new['rel_pers'].notnull()]
choicelist = [df_new['rel_pers']]
output_list = np.select(condlist, choicelist).tolist()
unique_out_set = set(output_list) # convert to set to remove duplicates
unique_out_list = list(unique_out_set)
#print(unique_out_list)
#print(len(unique_out_list))
for u in unique_out_list:
if u == 0:
continue
else:
try:
results=u.split("/")
except:
results=u
for r in results:
#print(r)
#print(type(r))
if "$" in r:
pers_inf=r.split("$", 1)
#print("COMPLETE PERSON:", pers_inf)
person=pers_inf[0]
info=pers_inf[1]
else:
person=r
info=("none")
#print("PERSON:", person, "INFO:", info)
try:
pers_res=person.split(":", 1)
function=pers_res[0]
name_ident=pers_res[1]
except:
pers_res=person
function="unknown"
name_ident=pers_res[0]
if "#" in name_ident:
name=name_ident.split("#")[0]
ident=name_ident.split("#")[1]
else:
name=name_ident
ident=("none")
name_list.append([i, function, name, ident, info])
except AttributeError:
pass
### STEP 4: CONVERT NAME LIST TO DF:
f2 = pd.DataFrame(name_list)
f2.columns=["i", "function", "name", "ident", "info"]
print(f2)
### STEP 4: RECONSTRUCT SIBLINGS:
siblings1=['Bruder', 'Schwester']
siblings2=['Sohn', 'Tochter']
df_sibling_brother=f2.loc[f2['function'] == siblings1[0]]
for brother in df_sibling_brother.iterrows():
brother_current=brother[1].to_frame()
brother_reversed=[brother_current.loc['name'].values, "brother", brother_current.loc['i'].values, "unknown", "no info"]
name_list.append(brother_reversed)
df_sibling_sister=f2.loc[f2['function'] == siblings1[1]]
for sister in df_sibling_sister.iterrows():
sister_current=sister[1].to_frame()
sister_reversed=[sister_current.loc['name'].values, "brother", sister_current.loc['i'].values, "unknown", "no info"]
name_list.append(sister_reversed)
for s2 in siblings2:
df_sibling2=f2.loc[f2['function'] == s2]
for child in df_sibling2.iterrows():
child_current=child[1].to_frame()
child_reversed=[child_current.loc['name'].values, "parent", child_current.loc['i'].values, "unknown", "no info"]
name_list.append(child_reversed)
parents=df_sibling2['i'].unique() # unique names of parents in dataframe
for parent in parents: # find children per parent
df_family = df_sibling2[df_sibling2['i'] == parent]
siblings_in_family=df_family['name'].unique() # unique names of sibling in family
siblings_number=len(siblings_in_family) # number of siblings in family
siblings_res=list(combinations(siblings_in_family, 2)) # find all possible sibling pairs
for pair in siblings_res:
if pair[0] == pair[1]:
continue
else:
pair1=[pair[0], "sibling", pair[1], "unknown", "no info"]
name_list.append(pair1)
pair2=[pair[1], "sibling", pair[0], "unknown", "no info"]
name_list.append(pair2)
### STEP 5: RECONSTRUCT GRANDPARENTS / GRANDCHILDREN:
grandchildren1=['Vater', 'Mutter']
grandchildren2=['Enkel', 'Enkelin']
grandchildren3=["GVm","GVv", "GMm", "GMv"]
for g1 in grandchildren1: # FIND GRANPARENTS / GRANDCHILDREN via KNOWN PARENTS
df_grandchildren1=f2.loc[f2['function'] == g1]
grandparents=df_grandchildren1['name'].unique() # unique names of parents = grandparents in data frame
parents=df_grandchildren1['i'].unique() # unique names of children = parents in data frame
for parent in parents:
for s2 in siblings2:
df_grandchildren4=f2.loc[f2['i'] == parent][f2['function'] == s2]
grandchildren=df_grandchildren4['name'].unique() # unique names of grandchildren in family
grandparents_per_parent=df_grandchildren1.loc[df_grandchildren1['i'] == parent]
grandparents_values=grandparents_per_parent['name'].unique() # unique names of grandparents per parent
grandchildren_res = list(list(zip(grandparents_values, element))
for element in product(grandchildren, repeat = len(grandparents_values)))
grandchildren_flat=[x for y in grandchildren_res for x in y]
#print(grandchildren_flat)
# find all possible grandchild-grandparent pairs for each parent link
for pair in grandchildren_flat:
pair1=[pair[0], "grandparent-grandchild", pair[1], "unknown", "no info"]
name_list.append(pair1)
pair2=[pair[1], "grandparent-grandchild", pair[0], "unknown", "no info"]
name_list.append(pair2)
for g2 in grandchildren2: # FIND GRANDPARENTS FOR KNOWN GRANDCHILDREN
df_grandchildren2=f2.loc[f2['function'] == g2]
for gc2 in df_grandchildren2.iterrows():
gc_current2=gc2[1].to_frame()
gc_reversed2=[gc_current2.loc['name'].values, "grandchild", gc_current2.loc['i'].values, "unknown", "no info"]
name_list.append(gc_reversed2)
#print(df_grandchildren2)
for g3 in grandchildren3: # FIND GRANDCHILDREN FOR KNOWN GRANDPARENTS
df_grandchildren3=f2.loc[f2['function'] == g3]
for gc3 in df_grandchildren3.iterrows():
gc_current3=gc3[1].to_frame()
gc_reversed3=[gc_current3.loc['name'].values, "grandchild", gc_current3.loc['i'].values, "unknown", "no info"]
name_list.append(gc_reversed3)
### STEP 5: SAVE TO NEW FILE
#print(name_list)
#print(len(name_list))
df = pd.DataFrame(name_list)
resultpath='C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\Results'
res_filename=input("Enter your file name: ")
res_file=os.path.join(resultpath, res_filename + ".xlsx")
df.to_excel(res_file, index=True)
print("Done.")
User cases
- consolidating genealogical information from
rel_pers
column in factoid lists - reconstruction of sibling relations based on known parent-child-connections
- reconstruction of implict grandparent-grandchild relations
Advanced script to query several columns across several spreadsheets based on user input
py
# Script to compare data from several EXCEL files
# written for the DigiKAR geohumanities project by Monika Barget,
# with kind support by https://stackoverflow.com/users/8479387/tlentali
# USE CASES: flexibly matching conditions across EXCEL columns
# check documentation for detailed description of script performance
# integration into GUI in progress
import csv
import pandas as pd
import numpy as np
import os
from collections import Counter
### STEP 1: READ DATA FROM SEVERAL EXCEL FILES IN FACTOID FORMAT
# Paths of locally synchronised EXCEL files for DigiKAR project
# all project members can adjust paths according to their own file structure
filenames="C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\InputLists"
# obligatory columns in valid factoid list
column_names = ["factoid_ID",
"pers_ID",
"pers_name",
"alternative_names",
"event_type",
"event_after-date",
"event_before-date",
"event_start",
"event_end",
"event_date",
"pers_title",
"pers_function",
"place_name",
"inst_name",
"rel_pers",
"source_quotations",
"additional_info",
"comment",
"info_dump",
"source",
"source_site"]
# read all excel files in directory as one data frame
frame_list=[]
for item in os.listdir(filenames):
file = os.path.join(filenames, item)
df = pd.read_excel(file, sheet_name='FactoidList', axis=1, ignore_index=False, sort=False)
frame_list.append(df)
f = pd.concat(frame_list, axis=0, ignore_index=False, sort=False)
# read factoids from data frame
pers_f=(f[['pers_name']]) # retrieve data from selected column
pers_list=pers_f.values.tolist() # convert data frame to sorted list
pers_list_flat=[item for sublist in pers_list for item in sublist] # flatten list
pers_unique=pers_f.drop_duplicates() # remove duplicates
pers_unique_list=pers_unique.values.tolist() # write unique values to list
print("\n\nYour factoid list contains", len(pers_f), "entries.") # count data in selected column
#for i in [item for sublist in pers_unique_list for item in sublist]: # count person occurrences
#print("\n", i, " / ", "Häufigkeit:", pers_list_flat.count(i), "\n") # print name and occurrences
### STEP 2: LET USER SELECT SEARCH CRITERIA
# queried names
print("Enter person names, separated by commas, or @.")
q_name=input()
# queried year
print("Date(s):")
ex_year=input()
# select type of time processing
print("No date selected (0), exact dates (1), data range (2), BEFORE date (3) or AFTER date (4)?")
z=input()
# queried institution
print("Enter institutions, separated by commas, or @:")
q_inst=input()
# queried title
print("Enter person titles, separated by commas, or @:")
q_title=input()
# queried function
print("Enter person functions, separated by commas, or @:")
q_func=input()
# queried related person
print("Enter related persons, separated by commas, or @:")
q_rel=input()
# separate entries if lists of search terms have been submitted
qn=q_name.split(", ")
exy=ex_year.split(", ")
qi=q_inst.split(", ")
qt=q_title.split(", ")
qf=q_func.split(", ")
qr=q_rel.split(", ")
in_lists=[[qn], [qi], [qt], [qf], [qr]]
# handling data input with different frequences from YYYY to YYYY-MM-DD
try:
if len(exy[1]) & len(exy[0]) == 4:
d1=pd.Period(exy[0], freq="Y") # convert input to Period with year frequence
end_date=int(exy[1])+1 # add one year to expand end range
d2=pd.Period(str(end_date), freq="Y") # convert input to Period with year frequence
elif len(exy[1]) & len(exy[0]) == 7:
d1=pd.Period(exy[0], freq="M") # convert input to Period with month frequence
d2=pd.Period(exy[1], freq="M") # convert input to Period with month frequence
elif len(exy[1]) & len(exy[0]) == 10:
d1=pd.Period(exy[0], freq="D") # convert input to Period with day frequence
d2=pd.Period(exy[1], freq="D") # convert input to Period with day frequence
else:
d1=pd.Period(exy[0], freq="D") # convert input to Period with day frequence
d2=pd.Period(exy[1], freq="D") # convert input to Period with day frequence
except IndexError:
pass
except ValueError:
pass
### STEP 3: RUN QUERY BASED ON TIME CONDITIONS
result_df1=pd.DataFrame()
result_df2=pd.DataFrame()
## CASE 0: NO TIME SELECTED
if z=="0": # no dates
if "@" in qn and "@" in qi and "@" in qt and "@" in qf and "@" in qr:
print("No search criteria selected!")
else:
# define possible conditions and choices
condlist = [(f['pers_name'].str.contains('|'.join(qn)) ^ f['pers_name'].isin(qn)),
f['inst_name'].isin(qi),
f['pers_title'].isin(qt),
f['pers_function'].isin(qf),
(f['rel_pers'].str.contains('|'.join(qr)) ^ f['rel_pers'].isin(qr)),]
choicelist = [f['pers_name'],
f['inst_name'],
f['pers_title'],
f['pers_function'],
f['rel_pers']]
output = np.select(condlist, choicelist)
rows=np.where(output)
new_array=f.to_numpy()
result_array=new_array[rows]
print(result_array)
## CASE 1: SEARCHING FOR EXACT DATES
elif z=="1": # get exact dates
print("Searching for dates:", exy)
df_list=[]
for x in exy:
if len(x) == 4:
date_searched=pd.Period(x, freq="Y") # convert input to Period with year frequence
for n in range(0, len(pers_f)):
try:
if date_searched == pd.Period(f['event_start'].iloc[n], freq="Y"):
date_found=pd.Period(f['event_start'].iloc[n])
f_match=f.iloc[[n]]
df_list.append(f_match)
else:
#print("Outside time frame.") # optional for data cleaning
continue
except ValueError:
#print(ValueError.args, ":", f['event_start'].iloc[n]) # optional for data cleaning
pass
elif len(x) == 7:
date_searched=pd.Period(x, freq="M") # convert input to Period with month frequence
for n in range(0, len(pers_f)):
try:
if date_searched == pd.Period(f['event_start'].iloc[n], freq="M"):
date_found=pd.Period(f['event_start'].iloc[n])
f_match=f.iloc[[n]]
df_list.append(f_match)
else:
#print("Outside time frame.") # optional for data cleaning
continue
except ValueError:
#print(ValueError.args, ":", f['event_start'].iloc[n]) # optional for data cleaning
pass
elif len(x) == 10:
date_searched=pd.Period(x, freq="D") # convert input to Period with day frequence
for n in range(0, len(pers_f)):
try:
if date_searched == pd.Period(f['event_start'].iloc[n], freq="D"):
date_found=pd.Period(f['event_start'].iloc[n])
f_match=f.iloc[[n]]
df_list.append(f_match)
else:
#print("Outside time frame.") # optional for data cleaning
continue
except ValueError:
#print(ValueError.args, ":", f['event_start'].iloc[n]) # optional for data cleaning
pass
f_new = pd.concat([df_list], axis=1, ignore_index=False, sort=False)
try:
# define possible conditions and choices
condlist = [f_new['pers_name'].str.contains(str(qn)) ^ f_new['pers_name'].str.contains('|'.join(qn)) ^ f_new['pers_name'].isin(qn),
f_new['inst_name'].isin(qi),
f_new['pers_title'].isin(qt),
f_new['pers_function'].isin(qf),
f_new['rel_pers'].str.contains('|'.join(qr)) ^ f_new['rel_pers'].isin(qr)]
choicelist = [f_new['pers_name'],
f_new['inst_name'],
f_new['pers_title'],
f_new['pers_function'],
f_new['rel_pers']]
output = np.select(condlist, choicelist)
rows=np.where(output)
new_array=f_new.to_numpy()
result_array=new_array[rows]
except ValueError:
#print(ValueError.args, ":", f['event_start'].iloc[n])
pass
## CASE 2: SEARCHING FOR DATE RANGE
elif z=="2": # get date range
print("Searching for date range between", d1, "and", d2, "!")
for n in range(0, len(pers_f)):
f_new=pd.DataFrame(columns=column_names)
try:
df_list=[]
if d1 <= pd.Period(f['event_start'].iloc[n], freq="Y") <= d2:
date_found=pd.Period(f['event_start'].iloc[n])
f_match=f.iloc[[n]]
df_list.append(f_match)
else:
#print("Outside time frame.")
continue
except ValueError:
#print(ValueError.args, ":", f['event_start'].iloc[n])
pass
f_new = pd.concat(df_list, axis=1, ignore_index=False, sort=False)
# define possible conditions and choices
if "#" in qn and "#" in qi and "#" in qt and "#" in qf and "#" in qr:
result_df=f_new
else:
condlist = [(f_new['pers_name'].str.contains('|'.join(qn)) ^ f_new['pers_name'].isin(qn)),
f_new['inst_name'].isin(qi),
f_new['pers_title'].isin(qt),
f_new['pers_function'].isin(qf),
f_new['rel_pers'].str.contains('|'.join(qr)) ^ f_new['rel_pers'].isin(qr)]
choicelist = [f_new['pers_name'],
f_new['inst_name'],
f_new['pers_title'],
f_new['pers_function'],
f_new['rel_pers']]
output = np.select(condlist, choicelist)
rows=np.where(output)
new_array=f_new.to_numpy()
result_array=new_array[rows]
## CASE 3: SEARCHING FOR DATES BEFORE
elif z=="3": #get dates before
print("Searching for dates before", pd.Period(exy[0], freq="D"), "!")
nf=pd.DataFrame(columns=column_names)
for n in range(0, len(pers_f)):
try:
if pd.Period(f['event_start'].iloc[n], freq="D") <= pd.Period(exy[0], freq="D"):
nf=nf.append(f.iloc[n], ignore_index=False, sort=False)
elif pd.Period(f['event_before-date'].iloc[n], freq="D") <= pd.Period(exy[0], freq="D"):
nf=nf.append(f.iloc[[n]], ignore_index=False, sort=False)
else:
continue
except ValueError:
pass
# define possible conditions and choices
if "@" in qn and "@" in qi and "@" in qt and "@" in qf and "@" in qr:
result_df1=nf
else:
condlist = [(nf['pers_name'].str.contains('|'.join(qn)) ^ nf['pers_name'].isin(qn)),
nf['inst_name'].isin(qi),
nf['pers_title'].isin(qt),
nf['pers_function'].isin(qf),
nf['rel_pers'].str.contains('|'.join(qr)) ^ nf['rel_pers'].isin(qr)]
choicelist = [nf['pers_name'],
nf['inst_name'],
nf['pers_title'],
nf['pers_function'],
nf['rel_pers']]
output = np.select(condlist, choicelist)
rows=np.where(output)
new_array=nf.to_numpy()
result_array=new_array[rows]
## CASE 4: SEARCHING FOR DATES AFTER
elif z=="4": # get dates after
print("Searching for dates after", pd.Period(exy[0], freq="D"), "!")
nf=pd.DataFrame(columns=column_names)
for n in range(0, len(pers_f)):
try:
if pd.Period(f['event_start'].iloc[n], freq="D") >= pd.Period(exy[0], freq="D"):
nf=nf.append(f.iloc[[n]], ignore_index=False, sort=False)
elif pd.Period(f['event_after-date'].iloc[n], freq="D") >= pd.Period(exy[0], freq="D"):
nf=nf.append(f.iloc[[n]], ignore_index=False, sort=False)
else:
continue
except ValueError:
pass
# define possible conditions and choices
if "@" in qn and "@" in qi and "@" in qt and "@" in qf and "@" in qr:
result_df1=nf
else:
condlist = [(nf['pers_name'].str.contains('|'.join(qn)) ^ nf['pers_name'].isin(qn)),
nf['inst_name'].isin(qi),
nf['pers_title'].isin(qt),
nf['pers_function'].isin(qf),
nf['rel_pers'].str.contains('|'.join(qr)) ^ nf['rel_pers'].isin(qr)]
choicelist = [nf['pers_name'],
nf['inst_name'],
nf['pers_title'],
nf['pers_function'],
nf['rel_pers']]
output = np.select(condlist, choicelist)
rows=np.where(output)
new_array=nf.to_numpy()
result_array=new_array[rows]
else:
print("INVALID TIME OPERATOR!")
### STEP 4: convert result array into a dataframe
try:
result_df2 = pd.DataFrame(result_array)
result_df=pd.concat([result_df1, result_df2], axis=1, ignore_index=False, sort=False)
#print("Elements found: ", result_df[4]) # optional für data cleaning
# save to xlsx file
resultpath='C:\\Users\\mobarget\\Documents\\Seafile\\DigiKAR_DATEN\\Python\\Results'
res_filename=input("Enter your file name: ")
res_file=os.path.join(resultpath, res_filename + ".xlsx")
result_df.to_excel(res_file, index=True)
print("Done.")
except NameError as er:
print(er.args)
print("No results retrieved! Try again.")
- Files from directory
DigiKAR_DATEN\\Python\\InputLists
are read to a single dataframe. - User can select search criteria for person name, date, institution, person title, function and related person. Several keywords per field are accepted.
- User can decide whether to search for exact dates, a date range or “before” and “after” dates.
- Script handles time input first and writes matches to new data frame, then conditions for remaining fields are applied. Current version of script apply OR operator to the fields. In final GUI version, user will be able to select both AND or OR operations.
- Results are written to new file. User is free to assign file name via script input.
- Time management handles
YYYY
as well asYYYY-MM
andYYYY-MM-DD
inputs. Searching for dates before and after input date, thebefore
andafter
columns in the original spreadsheet are consider alongside start and end dates.
Use cases
- sample for creating more complex Boolean queries across several spreadsheets
- adjustable to different spreadsheet formats
- experimenting with data parsing and potential problems arising from early modern data