In this project, we aim at finding the effects of local measurements taken against the covid-19 pandemie and which measurements work best in which district/region.
To have a suitable data base, we first aggregate data about the structural details of each district (LK/Kreisfreie Stadt). The data is based on a combination of the https://www.landatlas.de/, https://de.wikipedia.org/wiki/Liste_der_Landkreise_in_Deutschland and the Covid-19 data of the RKI.
The general strategy is:
Special add-on:
Please contact Alexander Blaessle (alexander.blaessle@gmail.com) for more details.
%load_ext autoreload
%autoreload 2
First, we load a bunch of modules. Complex heatmap is my own package and needs to be installed manually.
# Standard
import os
# Pandas/NP
import pandas as pd
import numpy as np
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
import complexheatmap as chm
# ML
import sklearn.decomposition as skdc
import umap
We now read in the data from landatlas. This data was already sanitized in previous steps.
data_dir='../data/'
data=pd.read_csv(os.path.join(data_dir,'landatlas_sanitized_v001.csv'))
data=data.rename(columns={'Unnamed: 0': 'district_id'})
There seem to be cells with -99, which seem to have been entered because of no data coverage, so we pretend this was NaN. In the end, we will probably filter these features to not screw up our clustering based on bad features.
data=data.replace(to_replace=-99,value=np.nan)
I downloaded some extra table from Wikipedia, to align the districts with regions/Bundeslaender.
wik=pd.read_csv(os.path.join(data_dir,'../data/LK_wiki.csv'),sep='\t',index_col=None)
# Get rid of unneccessary data
wik=wik.drop(columns='Karte')
# Get proper index and merge
wik.index=wik['KrS']
# Merge and fill in gaps
data.index=data['district_id']
data=pd.concat([data,wik],axis=1)
There was no data for a bunch of districts, which turn out to be Kreisfreie Staedte. For now, we say this is unknown.
data.loc[pd.isna(data['BL']),'BL']='unknown'
data.head()
We will also add the current covid-19 data of the RKI. This has also been cleaned previously.
rki=pd.read_csv(os.path.join(data_dir,'RKI_COVID19_clean_new+total_v003.csv'))
We first add the newly infected.
# Get newly infected as matrix
new_inf=rki.pivot(index='districtId',values='new_infected',columns='date')
# Show as clustermap
sns.clustermap(new_inf,col_cluster=False,cmap='coolwarm')
# Add new infected to data frame
new_inf.columns=['ninf_%s'%str(col) for col in new_inf.columns.values]
data=pd.concat([data,new_inf],axis=1)
data=data[~pd.isna(data['district_id'])]
And also the total infected
tot_inf=rki.pivot(index='districtId',values='total_infected',columns='date')
# Show as clustermap
sns.clustermap(tot_inf,col_cluster=False,cmap='coolwarm')
# Add total infected to data frame
tot_inf.columns=['tinf_%s'%str(col) for col in tot_inf.columns.values]
data=pd.concat([data,tot_inf],axis=1)
data=data[~pd.isna(data['district_id'])]
There are a bunch of features that have not good coverage, so we remove them right away.
# Find districts with least data coverage
data['n_nan']=pd.isna(data.values).sum(axis=1)
data=data[~pd.isna(data['district_id'])]
# Filter all columns that have more than 5% of nan values
cols_keep=[]
cutoff=0.05*data.shape[0]
for col in data.columns.values:
if not pd.isna(data[col].values).sum()>cutoff:
cols_keep.append(col)
# print(col,pd.isna(data[col].values).sum())
data.head()
data=data[cols_keep]
We can now perform a hierarchical clustering with respect to the structural data. For this, we first extrac the structural data.
# Extract relevant data values
data.index=data['district_id']
mat=data.loc[:,~data.columns.str.contains('district|BL|tinf|ninf|Thuenen-Typologie')]
# Fill NaNs with mean so we can do PCA/heatmap
mat=mat.fillna(mat.mean())
We can now compute the pairwise distances between districts based on the pearson correlation.
# Do PWDs
pwd=chm.utils.pw_dist(mat)
def classifiy_cont_col(data,col,n):
"""Classifies continous feature into n bins. """
bins=np.linspace(0.99*data[col].values.min(),1.01*data[col].values.max(),n+1)
dig=np.digitize(data[col].values, bins, right=False)
data['%s_dig'%col]=dig
return data
To simplify the illustration later, we devide up some key features into classes.
data=classifiy_cont_col(data,'Krankenhausbetten',5)
data=classifiy_cont_col(data,'tinf_2020-03-18',10)
data=classifiy_cont_col(data,'ninf_2020-03-18',10)
cg=chm.hm.clustermap(pwd,meta=data,color=['Thuenen-Typologie','BL','Krankenhausbetten_dig','tinf_2020-03-18_dig','ninf_2020-03-18_dig'],figsize=(30,30),cmap='coolwarm')
# cg.ax_heatmap.set_xticks(range(len()))
_=cg.ax_heatmap.set_xticklabels(data['district_name'])
_=cg.ax_heatmap.set_yticklabels(data['district_name'])
The heatmap shows us a few results. First, we can see that there are high correlation values, so except of a few districts, most districts are fairly similar.
Moreover, we see that there are a bunch of clusters with Thuenen-Typologie=5
, indicating that highly developed areas cluster together.
We also see that this coincides with BL=unknown
, indicating that most of them are kreisfreie staedte.
As exprected, these are also regions with a lot of hospital beds.
However, because of the lack of data, we cannot see any trends in terms of Covid-19 distribution yet.
We can also perform a PCA and see if there are any interesting clusters popping up.
pca=skdc.PCA()
embed=pca.fit_transform(mat)
data['pca_1']=embed[:,0]
data['pca_2']=embed[:,1]
def cat_scatter(data,col,x,y,ax=None,figsize=(8,8)):
"""Categrocial scatter plot"""
if ax is None:
fig,ax=plt.subplots(1,figsize=figsize)
for i,t in enumerate(data[col].unique()):
tmp=data[data[col]==t]
ax.scatter(tmp[x],tmp[y],label='%s = %s'%(str(col),str(t)))
ax.legend()
return ax
def cont_scatter(data,col,x,y,ax=None,cmap='coolwarm',figsize=(8,8)):
"""Continous scatter plot"""
if ax is None:
fig,ax=plt.subplots(1,figsize=figsize)
ax.scatter(data[x],data[y],c=data[col],cmap=cmap)
ax.legend()
return ax
cat_scatter(data,'Thuenen-Typologie','pca_1','pca_2')
cat_scatter(data,'BL','pca_1','pca_2')
What is really interesting, is that most of the data points cluster together, and only kreisfreie Staedte are much different.
A umap projection is another measure to project multidimensional data into human-viewable plots.
The nice attribute about UMAP projections is that it respects both local and global structures.
embed = umap.UMAP(n_neighbors=5,
min_dist=0.3,
metric='correlation').fit_transform(mat)
data['umap_1']=embed[:,0]
data['umap_2']=embed[:,1]
cat_scatter(data,'Thuenen-Typologie','umap_1','umap_2')
ax.set_xlabel('UMAP 1')
_=ax.set_ylabel('UMAP 2')
cat_scatter(data,'BL','umap_1','umap_2')
ax.set_xlabel('UMAP 1')
_=ax.set_ylabel('UMAP 2')
We see again that the kreisfreie Staedte show the most spread in the projection, indicating that the countryside is much more comparable than the cities.
ax=cont_scatter(data,'tinf_2020-03-18','umap_1','umap_2')
ax.set_xlabel('UMAP 1')
_=ax.set_ylabel('UMAP 2')
Looking at the latest number of corona infected however, there does not seem to be a trend yet. However, more data is needed.|
There are a bunch of interesting insights coming from this analysis, which are probably well known for anyone actually working in this field (such as cities forming clusters, however being more widespread than other regions). However, this data can be used to identify districts and regions that are similar to then follow up with a positive deviance analysis.
data.to_csv(os.path.join(data_dir,'data_summarized.csv'))