District structure and the corona pandemie

In this project, we aim at finding the effects of local measurements taken against the covid-19 pandemie and which measurements work best in which district/region.

To have a suitable data base, we first aggregate data about the structural details of each district (LK/Kreisfreie Stadt). The data is based on a combination of the https://www.landatlas.de/, https://de.wikipedia.org/wiki/Liste_der_Landkreise_in_Deutschland and the Covid-19 data of the RKI.

The general strategy is:

  • Assemble all data.
  • Hierarchical clustering, do certain regions group and why?
  • Does the covid-19 pandemie behave differently in the clusters?

Special add-on:

  • What are the measurements taken by local/regional governments + social media response and how does this affect the spread of the virus (TBD, right now, we DO NOT have enough data to do this.)

Please contact Alexander Blaessle (alexander.blaessle@gmail.com) for more details.

In [1]:
%load_ext autoreload
%autoreload 2

Import modules

First, we load a bunch of modules. Complex heatmap is my own package and needs to be installed manually.

In [2]:
# Standard
import os

# Pandas/NP
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
import complexheatmap as chm

# ML
import sklearn.decomposition as skdc
import umap

Read and clean data

We now read in the data from landatlas. This data was already sanitized in previous steps.

In [3]:
data_dir='../data/'

data=pd.read_csv(os.path.join(data_dir,'landatlas_sanitized_v001.csv'))
In [4]:
data=data.rename(columns={'Unnamed: 0': 'district_id'})

There seem to be cells with -99, which seem to have been entered because of no data coverage, so we pretend this was NaN. In the end, we will probably filter these features to not screw up our clustering based on bad features.

In [5]:
data=data.replace(to_replace=-99,value=np.nan)

Read in addtional data about districts and add to dataframe

I downloaded some extra table from Wikipedia, to align the districts with regions/Bundeslaender.

In [6]:
wik=pd.read_csv(os.path.join(data_dir,'../data/LK_wiki.csv'),sep='\t',index_col=None)
In [7]:
# Get rid of unneccessary data
wik=wik.drop(columns='Karte')

# Get proper index and merge
wik.index=wik['KrS']
In [8]:
# Merge and fill in gaps
data.index=data['district_id']
data=pd.concat([data,wik],axis=1)

There was no data for a bunch of districts, which turn out to be Kreisfreie Staedte. For now, we say this is unknown.

In [9]:
data.loc[pd.isna(data['BL']),'BL']='unknown'
In [10]:
data.head()
Out[10]:
district_id Minijob Zahnarzt Krankenhausbetten Hausarzt_je_EW Grundsicherung Soziooekonomische_Lage Breitband Baulandpreise Facharzt ... Krippenkinder district_name KrS Landkreis/Kreis UZ BL Kreissitz Einw. Fl. Bev.
1001 1001.0 21.1 2.625000 101.9 66.1 52.9 NaN 99.7 105.6 5.111402 ... 35.2 SK Flensburg NaN NaN NaN unknown NaN NaN NaN NaN
1002 1002.0 20.0 2.265833 93.2 65.0 53.7 NaN 100.0 194.7 5.206772 ... 34.9 SK Kiel NaN NaN NaN unknown NaN NaN NaN NaN
1003 1003.0 18.2 4.010000 80.2 62.0 54.1 NaN 99.0 120.7 7.588854 ... 35.0 SK Lübeck NaN NaN NaN unknown NaN NaN NaN NaN
1004 1004.0 19.1 3.045833 87.4 55.4 42.4 NaN 98.4 89.2 6.807450 ... 32.8 SK Neumünster NaN NaN NaN unknown NaN NaN NaN NaN
1051 1051.0 24.9 7.313333 53.0 61.8 19.5 0.64 17.9 37.1 16.465871 ... 20.1 LK Dithmarschen 1051.0 Dithmarschen HEI, MED SH SH Heide 133.210 1.428,13 93.0

5 rows × 64 columns

Read in RKI data and add to data frame

We will also add the current covid-19 data of the RKI. This has also been cleaned previously.

In [11]:
rki=pd.read_csv(os.path.join(data_dir,'RKI_COVID19_clean_new+total_v003.csv'))

Newly infected

We first add the newly infected.

In [12]:
# Get newly infected as matrix
new_inf=rki.pivot(index='districtId',values='new_infected',columns='date')
In [13]:
# Show as clustermap
sns.clustermap(new_inf,col_cluster=False,cmap='coolwarm')
Out[13]:
<seaborn.matrix.ClusterGrid at 0x7f9ee164d1d0>
In [14]:
# Add new infected to data frame
new_inf.columns=['ninf_%s'%str(col) for col in new_inf.columns.values]

data=pd.concat([data,new_inf],axis=1)

data=data[~pd.isna(data['district_id'])]

Total infected

And also the total infected

In [15]:
tot_inf=rki.pivot(index='districtId',values='total_infected',columns='date')
In [16]:
# Show as clustermap
sns.clustermap(tot_inf,col_cluster=False,cmap='coolwarm')
Out[16]:
<seaborn.matrix.ClusterGrid at 0x7f9ee1640ad0>
In [17]:
# Add total infected to data frame
tot_inf.columns=['tinf_%s'%str(col) for col in tot_inf.columns.values]

data=pd.concat([data,tot_inf],axis=1)

data=data[~pd.isna(data['district_id'])]

Remove bad features

There are a bunch of features that have not good coverage, so we remove them right away.

In [18]:
# Find districts with least data coverage
data['n_nan']=pd.isna(data.values).sum(axis=1)
In [19]:
data=data[~pd.isna(data['district_id'])]
In [20]:
# Filter all columns that have more than 5% of nan values
cols_keep=[]
cutoff=0.05*data.shape[0]
for col in data.columns.values:
    if not pd.isna(data[col].values).sum()>cutoff:
        cols_keep.append(col)
#     print(col,pd.isna(data[col].values).sum())
In [21]:
data.head()
Out[21]:
district_id Minijob Zahnarzt Krankenhausbetten Hausarzt_je_EW Grundsicherung Soziooekonomische_Lage Breitband Baulandpreise Facharzt ... tinf_2020-03-10 tinf_2020-03-11 tinf_2020-03-12 tinf_2020-03-13 tinf_2020-03-14 tinf_2020-03-15 tinf_2020-03-16 tinf_2020-03-17 tinf_2020-03-18 n_nan
1001 1001.0 21.1 2.625000 101.9 66.1 52.9 NaN 99.7 105.6 5.111402 ... 0 0 0 0 4 4 4 4 6 12
1002 1002.0 20.0 2.265833 93.2 65.0 53.7 NaN 100.0 194.7 5.206772 ... 3 7 11 12 13 13 18 20 25 8
1003 1003.0 18.2 4.010000 80.2 62.0 54.1 NaN 99.0 120.7 7.588854 ... 2 7 8 9 9 10 10 12 17 8
1004 1004.0 19.1 3.045833 87.4 55.4 42.4 NaN 98.4 89.2 6.807450 ... 0 0 0 0 0 0 1 1 1 12
1051 1051.0 24.9 7.313333 53.0 61.8 19.5 0.64 17.9 37.1 16.465871 ... 1 1 1 1 1 6 6 6 10 4

5 rows × 167 columns

In [22]:
data=data[cols_keep]

Hierarchical clustering of districts/Heatmaps

We can now perform a hierarchical clustering with respect to the structural data. For this, we first extrac the structural data.

In [23]:
# Extract relevant data values
data.index=data['district_id']
mat=data.loc[:,~data.columns.str.contains('district|BL|tinf|ninf|Thuenen-Typologie')]
In [24]:
# Fill NaNs with mean so we can do PCA/heatmap
mat=mat.fillna(mat.mean())

We can now compute the pairwise distances between districts based on the pearson correlation.

In [25]:
# Do PWDs
pwd=chm.utils.pw_dist(mat)
In [26]:
def classifiy_cont_col(data,col,n):
    
    """Classifies continous feature into n bins. """
    
    bins=np.linspace(0.99*data[col].values.min(),1.01*data[col].values.max(),n+1)
    dig=np.digitize(data[col].values, bins, right=False)
    data['%s_dig'%col]=dig
    return data

To simplify the illustration later, we devide up some key features into classes.

In [27]:
data=classifiy_cont_col(data,'Krankenhausbetten',5)

data=classifiy_cont_col(data,'tinf_2020-03-18',10)
data=classifiy_cont_col(data,'ninf_2020-03-18',10)

Final clustering

In [28]:
cg=chm.hm.clustermap(pwd,meta=data,color=['Thuenen-Typologie','BL','Krankenhausbetten_dig','tinf_2020-03-18_dig','ninf_2020-03-18_dig'],figsize=(30,30),cmap='coolwarm')

# cg.ax_heatmap.set_xticks(range(len()))
_=cg.ax_heatmap.set_xticklabels(data['district_name'])
_=cg.ax_heatmap.set_yticklabels(data['district_name'])
/home/alexander/anaconda3/lib/python3.7/site-packages/seaborn/matrix.py:603: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
  metric=self.metric)

The heatmap shows us a few results. First, we can see that there are high correlation values, so except of a few districts, most districts are fairly similar.

Moreover, we see that there are a bunch of clusters with Thuenen-Typologie=5, indicating that highly developed areas cluster together.

We also see that this coincides with BL=unknown, indicating that most of them are kreisfreie staedte.

As exprected, these are also regions with a lot of hospital beds.

However, because of the lack of data, we cannot see any trends in terms of Covid-19 distribution yet.

PCA

We can also perform a PCA and see if there are any interesting clusters popping up.

In [29]:
pca=skdc.PCA()
In [30]:
embed=pca.fit_transform(mat)
In [31]:
data['pca_1']=embed[:,0]
data['pca_2']=embed[:,1]
In [32]:
def cat_scatter(data,col,x,y,ax=None,figsize=(8,8)):
    
    """Categrocial scatter plot"""
    
    if ax is None:
        fig,ax=plt.subplots(1,figsize=figsize)
        
    for i,t in enumerate(data[col].unique()):
        tmp=data[data[col]==t]
        ax.scatter(tmp[x],tmp[y],label='%s = %s'%(str(col),str(t)))
    ax.legend()
    
    return ax
In [33]:
def cont_scatter(data,col,x,y,ax=None,cmap='coolwarm',figsize=(8,8)):
    
    """Continous scatter plot"""
    
    if ax is None:
        fig,ax=plt.subplots(1,figsize=figsize)
        
    
    ax.scatter(data[x],data[y],c=data[col],cmap=cmap)
    ax.legend()
    
    return ax
In [34]:
cat_scatter(data,'Thuenen-Typologie','pca_1','pca_2')    
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9edd314a90>
In [35]:
cat_scatter(data,'BL','pca_1','pca_2')    
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9edd2db390>

What is really interesting, is that most of the data points cluster together, and only kreisfreie Staedte are much different.

UMAP

A umap projection is another measure to project multidimensional data into human-viewable plots.

The nice attribute about UMAP projections is that it respects both local and global structures.

In [36]:
embed = umap.UMAP(n_neighbors=5,
                      min_dist=0.3,
                      metric='correlation').fit_transform(mat)
In [37]:
data['umap_1']=embed[:,0]
data['umap_2']=embed[:,1]
In [42]:
cat_scatter(data,'Thuenen-Typologie','umap_1','umap_2')
ax.set_xlabel('UMAP 1')
_=ax.set_ylabel('UMAP 2')
In [43]:
cat_scatter(data,'BL','umap_1','umap_2')
ax.set_xlabel('UMAP 1')
_=ax.set_ylabel('UMAP 2')

We see again that the kreisfreie Staedte show the most spread in the projection, indicating that the countryside is much more comparable than the cities.

In [40]:
ax=cont_scatter(data,'tinf_2020-03-18','umap_1','umap_2')
ax.set_xlabel('UMAP 1')
_=ax.set_ylabel('UMAP 2')
No handles with labels found to put in legend.

Looking at the latest number of corona infected however, there does not seem to be a trend yet. However, more data is needed.|

Summary

There are a bunch of interesting insights coming from this analysis, which are probably well known for anyone actually working in this field (such as cities forming clusters, however being more widespread than other regions). However, this data can be used to identify districts and regions that are similar to then follow up with a positive deviance analysis.

In [41]:
data.to_csv(os.path.join(data_dir,'data_summarized.csv'))
In [ ]: