District structure and the corona pandemie¶

In this project, we aim at finding the effects of local measurements taken against the covid-19 pandemie and which measurements work best in which district/region.

To have a suitable data base, we first aggregate data about the structural details of each district (LK/Kreisfreie Stadt). The data is based on a combination of the https://www.landatlas.de/, https://de.wikipedia.org/wiki/Liste_der_Landkreise_in_Deutschland and the Covid-19 data of the RKI.

The general strategy is:

Assemble all data.
Hierarchical clustering, do certain regions group and why?
Does the covid-19 pandemie behave differently in the clusters?

Special add-on:

What are the measurements taken by local/regional governments + social media response and how does this affect the spread of the virus (TBD, right now, we DO NOT have enough data to do this.)

Please contact Alexander Blaessle (alexander.blaessle@gmail.com) for more details.

%load_ext autoreload
%autoreload 2

Import modules¶

First, we load a bunch of modules. Complex heatmap is my own package and needs to be installed manually.

# Standard
import os

# Pandas/NP
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
import complexheatmap as chm

# ML
import sklearn.decomposition as skdc
import umap

Read and clean data¶

We now read in the data from landatlas. This data was already sanitized in previous steps.

data_dir='../data/'

data=pd.read_csv(os.path.join(data_dir,'landatlas_sanitized_v001.csv'))

data=data.rename(columns={'Unnamed: 0': 'district_id'})

There seem to be cells with -99, which seem to have been entered because of no data coverage, so we pretend this was NaN. In the end, we will probably filter these features to not screw up our clustering based on bad features.

data=data.replace(to_replace=-99,value=np.nan)

Read in addtional data about districts and add to dataframe¶

I downloaded some extra table from Wikipedia, to align the districts with regions/Bundeslaender.

wik=pd.read_csv(os.path.join(data_dir,'../data/LK_wiki.csv'),sep='\t',index_col=None)

# Get rid of unneccessary data
wik=wik.drop(columns='Karte')

# Get proper index and merge
wik.index=wik['KrS']

# Merge and fill in gaps
data.index=data['district_id']
data=pd.concat([data,wik],axis=1)

There was no data for a bunch of districts, which turn out to be Kreisfreie Staedte. For now, we say this is unknown.

data.loc[pd.isna(data['BL']),'BL']='unknown'

data.head()

Read in RKI data and add to data frame¶

We will also add the current covid-19 data of the RKI. This has also been cleaned previously.

rki=pd.read_csv(os.path.join(data_dir,'RKI_COVID19_clean_new+total_v003.csv'))

Newly infected¶

We first add the newly infected.

# Get newly infected as matrix
new_inf=rki.pivot(index='districtId',values='new_infected',columns='date')

# Show as clustermap
sns.clustermap(new_inf,col_cluster=False,cmap='coolwarm')

<seaborn.matrix.ClusterGrid at 0x7f9ee164d1d0>

# Add new infected to data frame
new_inf.columns=['ninf_%s'%str(col) for col in new_inf.columns.values]

data=pd.concat([data,new_inf],axis=1)

data=data[~pd.isna(data['district_id'])]

Total infected¶

And also the total infected

tot_inf=rki.pivot(index='districtId',values='total_infected',columns='date')

# Show as clustermap
sns.clustermap(tot_inf,col_cluster=False,cmap='coolwarm')

<seaborn.matrix.ClusterGrid at 0x7f9ee1640ad0>

# Add total infected to data frame
tot_inf.columns=['tinf_%s'%str(col) for col in tot_inf.columns.values]

data=pd.concat([data,tot_inf],axis=1)

data=data[~pd.isna(data['district_id'])]

Remove bad features¶

There are a bunch of features that have not good coverage, so we remove them right away.

# Find districts with least data coverage
data['n_nan']=pd.isna(data.values).sum(axis=1)

data=data[~pd.isna(data['district_id'])]

# Filter all columns that have more than 5% of nan values
cols_keep=[]
cutoff=0.05*data.shape[0]
for col in data.columns.values:
    if not pd.isna(data[col].values).sum()>cutoff:
        cols_keep.append(col)
#     print(col,pd.isna(data[col].values).sum())

data.head()

data=data[cols_keep]

Hierarchical clustering of districts/Heatmaps¶

We can now perform a hierarchical clustering with respect to the structural data. For this, we first extrac the structural data.

# Extract relevant data values
data.index=data['district_id']
mat=data.loc[:,~data.columns.str.contains('district|BL|tinf|ninf|Thuenen-Typologie')]

# Fill NaNs with mean so we can do PCA/heatmap
mat=mat.fillna(mat.mean())

We can now compute the pairwise distances between districts based on the pearson correlation.

# Do PWDs
pwd=chm.utils.pw_dist(mat)

def classifiy_cont_col(data,col,n):
    
    """Classifies continous feature into n bins. """
    
    bins=np.linspace(0.99*data[col].values.min(),1.01*data[col].values.max(),n+1)
    dig=np.digitize(data[col].values, bins, right=False)
    data['%s_dig'%col]=dig
    return data

To simplify the illustration later, we devide up some key features into classes.

data=classifiy_cont_col(data,'Krankenhausbetten',5)

data=classifiy_cont_col(data,'tinf_2020-03-18',10)
data=classifiy_cont_col(data,'ninf_2020-03-18',10)

Final clustering¶

cg=chm.hm.clustermap(pwd,meta=data,color=['Thuenen-Typologie','BL','Krankenhausbetten_dig','tinf_2020-03-18_dig','ninf_2020-03-18_dig'],figsize=(30,30),cmap='coolwarm')

# cg.ax_heatmap.set_xticks(range(len()))
_=cg.ax_heatmap.set_xticklabels(data['district_name'])
_=cg.ax_heatmap.set_yticklabels(data['district_name'])

/home/alexander/anaconda3/lib/python3.7/site-packages/seaborn/matrix.py:603: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
  metric=self.metric)

The heatmap shows us a few results. First, we can see that there are high correlation values, so except of a few districts, most districts are fairly similar.

Moreover, we see that there are a bunch of clusters with Thuenen-Typologie=5, indicating that highly developed areas cluster together.

We also see that this coincides with BL=unknown, indicating that most of them are kreisfreie staedte.

As exprected, these are also regions with a lot of hospital beds.

However, because of the lack of data, we cannot see any trends in terms of Covid-19 distribution yet.

PCA¶

We can also perform a PCA and see if there are any interesting clusters popping up.

pca=skdc.PCA()

embed=pca.fit_transform(mat)

data['pca_1']=embed[:,0]
data['pca_2']=embed[:,1]

def cat_scatter(data,col,x,y,ax=None,figsize=(8,8)):
    
    """Categrocial scatter plot"""
    
    if ax is None:
        fig,ax=plt.subplots(1,figsize=figsize)
        
    for i,t in enumerate(data[col].unique()):
        tmp=data[data[col]==t]
        ax.scatter(tmp[x],tmp[y],label='%s = %s'%(str(col),str(t)))
    ax.legend()
    
    return ax

def cont_scatter(data,col,x,y,ax=None,cmap='coolwarm',figsize=(8,8)):
    
    """Continous scatter plot"""
    
    if ax is None:
        fig,ax=plt.subplots(1,figsize=figsize)
        
    
    ax.scatter(data[x],data[y],c=data[col],cmap=cmap)
    ax.legend()
    
    return ax

cat_scatter(data,'Thuenen-Typologie','pca_1','pca_2')

<matplotlib.axes._subplots.AxesSubplot at 0x7f9edd314a90>

cat_scatter(data,'BL','pca_1','pca_2')

<matplotlib.axes._subplots.AxesSubplot at 0x7f9edd2db390>

What is really interesting, is that most of the data points cluster together, and only kreisfreie Staedte are much different.

UMAP¶

A umap projection is another measure to project multidimensional data into human-viewable plots.

The nice attribute about UMAP projections is that it respects both local and global structures.

embed = umap.UMAP(n_neighbors=5,
                      min_dist=0.3,
                      metric='correlation').fit_transform(mat)

data['umap_1']=embed[:,0]
data['umap_2']=embed[:,1]

cat_scatter(data,'Thuenen-Typologie','umap_1','umap_2')
ax.set_xlabel('UMAP 1')
_=ax.set_ylabel('UMAP 2')

cat_scatter(data,'BL','umap_1','umap_2')
ax.set_xlabel('UMAP 1')
_=ax.set_ylabel('UMAP 2')

We see again that the kreisfreie Staedte show the most spread in the projection, indicating that the countryside is much more comparable than the cities.

ax=cont_scatter(data,'tinf_2020-03-18','umap_1','umap_2')
ax.set_xlabel('UMAP 1')
_=ax.set_ylabel('UMAP 2')

No handles with labels found to put in legend.

Looking at the latest number of corona infected however, there does not seem to be a trend yet. However, more data is needed.|

Summary¶

There are a bunch of interesting insights coming from this analysis, which are probably well known for anyone actually working in this field (such as cities forming clusters, however being more widespread than other regions). However, this data can be used to identify districts and regions that are similar to then follow up with a positive deviance analysis.

data.to_csv(os.path.join(data_dir,'data_summarized.csv'))

	district_id	Minijob	Zahnarzt	Krankenhausbetten	Hausarzt_je_EW	Grundsicherung	Soziooekonomische_Lage	Breitband	Baulandpreise	Facharzt	...	Krippenkinder	district_name	KrS	Landkreis/Kreis	UZ	BL	Kreissitz	Einw.	Fl.	Bev.
1001	1001.0	21.1	2.625000	101.9	66.1	52.9	NaN	99.7	105.6	5.111402	...	35.2	SK Flensburg	NaN	NaN	NaN	unknown	NaN	NaN	NaN	NaN
1002	1002.0	20.0	2.265833	93.2	65.0	53.7	NaN	100.0	194.7	5.206772	...	34.9	SK Kiel	NaN	NaN	NaN	unknown	NaN	NaN	NaN	NaN
1003	1003.0	18.2	4.010000	80.2	62.0	54.1	NaN	99.0	120.7	7.588854	...	35.0	SK Lübeck	NaN	NaN	NaN	unknown	NaN	NaN	NaN	NaN
1004	1004.0	19.1	3.045833	87.4	55.4	42.4	NaN	98.4	89.2	6.807450	...	32.8	SK Neumünster	NaN	NaN	NaN	unknown	NaN	NaN	NaN	NaN
1051	1051.0	24.9	7.313333	53.0	61.8	19.5	0.64	17.9	37.1	16.465871	...	20.1	LK Dithmarschen	1051.0	Dithmarschen	HEI, MED	SH SH	Heide	133.210	1.428,13	93.0

	district_id	Minijob	Zahnarzt	Krankenhausbetten	Hausarzt_je_EW	Grundsicherung	Soziooekonomische_Lage	Breitband	Baulandpreise	Facharzt	...	tinf_2020-03-10	tinf_2020-03-11	tinf_2020-03-12	tinf_2020-03-13	tinf_2020-03-14	tinf_2020-03-15	tinf_2020-03-16	tinf_2020-03-17	tinf_2020-03-18	n_nan
1001	1001.0	21.1	2.625000	101.9	66.1	52.9	NaN	99.7	105.6	5.111402	...	0	0	0	0	4	4	4	4	6	12
1002	1002.0	20.0	2.265833	93.2	65.0	53.7	NaN	100.0	194.7	5.206772	...	3	7	11	12	13	13	18	20	25	8
1003	1003.0	18.2	4.010000	80.2	62.0	54.1	NaN	99.0	120.7	7.588854	...	2	7	8	9	9	10	10	12	17	8
1004	1004.0	19.1	3.045833	87.4	55.4	42.4	NaN	98.4	89.2	6.807450	...	0	0	0	0	0	0	1	1	1	12
1051	1051.0	24.9	7.313333	53.0	61.8	19.5	0.64	17.9	37.1	16.465871	...	1	1	1	1	1	6	6	6	10	4