Michael Najjar, high altitude (2008-2010), dow jones_80-09 Spatial Analysis – Correlation
Module Summary
In this module we discuss analytic methods commonly used to interrogate spatial data, namely, spatial correlation. Spatial auto-correlation explores the relationship between properties in geographic space – seeing whether an objects placement can tell us something about another phenomena. For example, we might consider income or house prices in NYC and ask: Is there a relationship between the density of trees in a neighborhood and the income of it's residents? Can a properties distance to Manhattan tell us anything about it's price? Correlation helps us answer these questions quantitatively and visualize or describe them statistically.
For this module we will be using the Pandas, GeoPandas, Matplotlib, Seaborn and Shapely Python libraries. Ensure to pip install
them on your machine, if you don't already have them, and import like so:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
from shapely.geometry import Point, Polygon
Correlation
According to Waldo Tobler's First Law of Geography: "everything is related to everything else, but near things are more related than distant things." This simple but profound statement assumes that geographic distance can explain the processes behind why things are located where they are. Correlation is a widely used statistic that helps us understand the relationship between two variables. Spatial correlation helps us quantify the relationship between objects in geographic space and their numeric values, based on proximity. A real world example of this is house prices. If you picked a random house in New York City from a listing website like Zillow and found that the price of the house was $1.6M, there is a very high likelihood that the house next door (if similar in size and amenities) would be selling for a similar price, versus a house 100 miles away would be selling for a lot more or less. Distance and proximity can help explain phenomena in geographic space from house prices, to the location of services and businesses, to topography and climate – near things are more closely related than further things. Spatial correlation helps us understand:
- If things are clustered or dispersed over a geographic region.
- Measure spatial inequality be it in terms of income, access to social services, demographics, etc.
- Describing ecological and environmental conditions.
In it's simplest sense, correlation is the linear relationship between a pair of variables. If the value of one variable goes up as the other also increases, then the relationship is said to be positively correlated. Whereas, if the value of one variable goes the other variable increases, then the relationship is said to be negatively correlated. The strength of the relationship (correlation) is usually represented through the (Pearson) correlation coefficient r. Although an intuitive understanding is more than enough to complete this demonstration and explore correlations independently, it can be written formally as:
cov(X, Y)
σX σY
|
Where:
- X is the first variable
- Y is the second variable
- cov is the covariance
- σX and σY are the standard deviations of the two variables
For this demonstration we will look at the relationship between Brooklyn houses prices in relation to their distance to Manhattan. We will be studying the clustering of higher price houses around the island of Manhattan, the hypothesis being that the closer a house is to Manhattan the higher the sales price. We will utilize multiple datasets provided by NYC Open Data:
Correlation can be visualized and explored in numerous ways, typically it is shown via a scatter plot which show the numeric value of two variables along an X and Y axis. But it can also be shown spatially on a map, which is intuitive but less precise, using the values of one value to color objects while distances between objects are seen relative to locations in geographic space. Let's see how we could first show the relationship between house prices and distance to Manhattan using points on a map.
The first step will be to load the sales data as a Pandas DataFrame:
path = '/Users/cbailey/Downloads/NYC_Citywide_Annualized_Calendar_Sales_Update.csv'
houses = pd.read_csv(path)
Next we will create a function to convert the lat long columns in a Shapely Point object, allowing us to convert the DataFrame into a geographic GeoDataFrame:
def lat_lng_to_point(data):
return Point([data['Longitude'], data['Latitude']])
houses['geom'] = houses.apply(lat_lng_to_point, axis=1)
gdf = gpd.GeoDataFrame(houses, geometry='geom', crs='EPSG:4326')
Next we will load the NYC borough boundaries GeoJSON file as a GeoDataFrame:
path = '/Users/cbailey/Downloads/Borough Boundaries.geojson'
nyc_geo = gpd.read_file(path)
We will use the the boundary of Brooklyn to filter the sales data, using the GeoPandas .sjoin()
function:
bk = nyc_geo.loc[nyc_geo['boro_name']=='Brooklyn']
bk_houses = gpd.sjoin(gdf, bk)
With every data analytics project, spatial or otherwise, there is always a process of data cleaning required to ensure your data is of relevance. The sales data includes sales of every property type the City of New York tracks, so the first step in cleaning will be to only observe sales of single family dwellings. And If we look at the distribution of sales prices, we can see that there are outliers in the dataset:
# filter the data to only have single family homes
mask = bk_houses['BUILDING CLASS CATEGORY']=='01 ONE FAMILY DWELLINGS'
bk_houses = bk_houses.loc[mask]
bk_houses.hist(bins=20);
mask1 = (bk_houses['SALE PRICE'] > 1e6) & (bk_houses['SALE PRICE'] < 3e6)
bk_houses = bk_houses.loc[mask]
Now we are ready to plot the data on a map to explore any inherent relationships:
# plot the boundaries of Manhattan & Brooklyn as a base
ax = nyc_geo.loc[nyc_geo['boro_name'].isin(['Manhattan','Brooklyn'])].boundary.plot(
figsize=(18,16), linewidth=0.75, color='black')
# draw rings around the centroid of Manhattan (optional)
nyc_geo.loc[nyc_geo['boro_name']=='Manhattan'].boundary.\
centroid.buffer(0.1).boundary.plot(ax=ax, linestyle='--', color='grey',
linewidth=0.75)
nyc_geo.loc[nyc_geo['boro_name']=='Manhattan'].boundary.\
centroid.buffer(0.15).boundary.plot(ax=ax, linestyle='--', color='grey',
linewidth=0.75)
nyc_geo.loc[nyc_geo['boro_name']=='Manhattan'].boundary.\
centroid.buffer(0.2).boundary.plot(ax=ax, linestyle='--', color='grey',
linewidth=0.75)
# plot individual houses using sales price as color
bk_houses[(bk_houses['SALE PRICE'] > 1e6) &
(bk_houses['SALE PRICE'] < 3e6)].plot(
ax=ax, column='SALE PRICE', legend=True, cmap='Spectral', alpha=0.5,
legend_kwds={'shrink': 0.6})
ax.set_axis_off();
Next we need to measure the distance between every house in Brooklyn to the Manhattan shoreline. To do this we need to find the closest point on the Manhattan boundary/shoreline to a given house. Shapely has a .nearest_points()
function that makes this process easier. Below we create a function call nearest
that takes a point and a collection of geometries as input and returns the distance between the first point and the closest point on the geometries. We will use the unary_union
method in Geopandas on the Manhattan geometry to get the union of it's boundaries:
# get just the geometry of Manhattan
mn = nyc_geo.loc[nyc_geo['boro_name']=='Manhattan']
# nearest function to return the closest distance
def nearest(point, pts):
return nearest_points(
point, pts)[1].distance(
point)
# unionize the Manhattan geometry
mn_pt = mn.geometry.unary_union
# create a new column with the distances for every house in Brooklyn
bk_houses['dist_manhattan'] = bk_houses['geom'].apply(lambda x: nearest(x, mn_pt))
To measure the correlation between to variables in Python, we can use the Scipy (scientific computing) library that comes built in with many instances of Python. The library has a statistical computing module called stats
containing a pearsonr
class that we will use to analyze the relationship between house prices and their distances to Manhattan. The pearsonr function returns the P value and correlation coefficient (values between -1 to +1), which tell us the strength of the relationship, 1 being a perfect positive correlation and -1 perfect negative correlation. However, correlation is usually visualized on a scatter plot to get an immediate visual understanding. We can also utilize the matplotlib library to do the scatter plot:
pearsonr(bk_houses['SALE PRICE'], bk_houses['dist_manhattan'])
You should see approximate values of: (-0.4610971045980508, 0.0)
The first number is the correlation coefficient which suggests a negative correlation, meaning the smaller the distance to Manhattan, the higher the house prices. A score of -0.46 implies that distance to Manhattan explains ~46% of the variation in Brooklyn houses, almost validating our initial assumption. While the second number is the P value, this is the likelihood that we'd see the same results after numerous attempts of the same piece of analysis, the lower the P value the more confident we can be in the results – a good thing.
Plotting this on a scatter will show:
plt.scatter(bk_houses['dist_manhattan'], bk_houses['SALE PRICE']);
The GROSS SQUARE FEET
column has a text data type, so we need to confert it to a float by removing commas and negative signs, and then utilizing the .astype()
function to convert to a float:
sqft = bk_houses['GROSS SQUARE FEET'].str.replace(',|- ','').astype(float)
Next we will create a series of conditional masks to constrain the square footage, sale price, year built and ensure there are no commercial units within a house:
mask1 = (sqft <= 2000) & (sqft >= 1000)
mask3 = (bk_houses['SALE PRICE'] > 1e6) & (bk_houses['SALE PRICE'] < 3e6)
mask5 = bk_houses['COMMERCIAL UNITS'] < 1.
mask6 = bk_houses['YEAR BUILT'] <= 1980
Finally, in order to ensure the sale date is within 2019, we need to convert the SALE DATE
column to a datetime object. We will utlize the Pandas .to_datetime()
function to do this:
dates = pd.to_datetime(bk_houses['SALE DATE'])
mask4 = (dates< '2019-12-31') & (dates>'2019-01-01')
Now if we apply all of these to our GeoDataFrame we should see a much smaller sample of data:
data = bk_houses[(mask1) & (mask2) & (mask3) & (mask4) & (mask5) & (mask6)]
print(data.shape)
You should see a sample of (217, 37)
. Now we can rerun the the Pearson R correlation and plot a scatter with a regression line to see the true nature of the relationship between house prices and distances to Manhattan. This time we will add a title, make the dots transparent, and give them an edge color to make the visualization pop. We will once again use the Seaborn library which makes this type of visualization easy:
print(pearsonr(data['SALE PRICE'], data['dist_manhattan']))
# output
(-0.33425418464725665, 4.6266592679351666e-07)
# define x and y variables used for plotting
x = data['dist_manhattan']
y = data['SALE PRICE']
# set the background color and size
sns.set(rc={'figure.figsize':(14.7,11.27)})
sns.set_style("white")
# create the Seaborn regression plot
# we will use dictionaries to change the line styles of the plot
ax = sns.regplot(x, y, ci=None,
scatter_kws={
'color':'orange',
'edgecolor':'black',
'alpha':0.75},
line_kws={
'color':'red',
'linestyle':'dashed',
'linewidth':0.75
});
# label the x and y axis
plt.xlabel('Distance to Manhattan')
plt.ylabel('Sale price $ (millions)')
# remove the top and right axis
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
Challenge
Using the closest nearest point function, see if you can calculate the nearest tree for each house in the sale price dataset.