2026-02-25
Albert in a bike (catmapper:Wikimedia)
biketown bikes (Steve Morgan: Wikimedia)
how biketown works: screenshot
Here is data from August, 2018.
| RouteID | PaymentPlan | StartHub | StartLatitude | StartLongitude | StartDate | StartTime | EndHub | EndLatitude | EndLongitude | EndDate | EndTime | TripType | BikeID | BikeName | Distance_Miles | Duration | RentalAccessPath | MultipleRental | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8530077 | Casual | NW Couch at 11th | 45.523742 | -122.681813 | 8/1/2018 | 0:00 | NE Couch at MLK | 45.523588 | -122.661918 | 8/1/2018 | 0:09 | NaN | 6841 | 0075 BIKETOWN | 1.09 | 0:08:41 | mobile | False |
| 1 | 8530088 | Casual | NE Couch at MLK | 45.523588 | -122.661918 | 8/1/2018 | 0:01 | SW 2nd at Pine | 45.521421 | -122.672629 | 8/1/2018 | 0:08 | NaN | 7312 | 0987 BETRUE TRAINER | 0.62 | 0:07:24 | mobile | False |
| 2 | 8530156 | Subscriber | SW 3rd at Ankeny | 45.522479 | -122.673297 | 8/1/2018 | 0:07 | NaN | 45.522987 | -122.681129 | 8/1/2018 | 0:31 | NaN | 6145 | 0567 BIKETOWN | 2.62 | 0:23:36 | keypad | False |
| 3 | 8530170 | Subscriber | NaN | 45.520015 | -122.682115 | 8/1/2018 | 0:08 | NaN | 45.536330 | -122.685748 | 8/1/2018 | 0:18 | NaN | 7498 | 0181 BIKETOWN | 1.36 | 0:09:51 | keypad_rfid_card | False |
| 4 | 8530168 | Subscriber | NaN | 45.520015 | -122.682115 | 8/1/2018 | 0:08 | NaN | 45.536330 | -122.685748 | 8/1/2018 | 0:18 | NaN | 6879 | 0747 BIKETOWN | 1.36 | 0:09:39 | keypad_rfid_card | False |
First, we’d like actual spatial points in a sensible (isometric) coordinate system:
df['start_points'] = gpd.points_from_xy(df.StartLongitude, df.StartLatitude, crs="EPSG:4326").to_crs("EPSG:5070")
df['end_points'] = gpd.points_from_xy(df.EndLongitude, df.EndLatitude, crs="EPSG:4326").to_crs("EPSG:5070")
gdf = gpd.GeoDataFrame(df, geometry=df['start_points'], crs=5070)
gdf[:1000].explore()We’ll make some maps with spatial smoothing to answer this, and evaluate if that was a good tool for the job.
In this class we’ve mostly not worried about computation. Modern computers deal with vectors of length \(10^7\)–\(10^8\) pretty much instantly.
But, what if you’ve got more? Options:
Considerations:
It’s real easy to run into “too much data” in spatial contexts, basically because \(N^2 > N\).
Why?
Images/rasters: my display is 2256 x 1504, which is 3.3M pixels. It’s easy to have a lot of images.
Pairwise comparisons: Many spatial operations require operating “nearby” objects. Naively, that requires working with all \(O(N^2)\) pairwise distances.
Suppose we have a bunch of points
\[ (x_1, y_1), \ldots, (x_n, y_n) \]
with density \(f(x,y)\), i.e.,
\[ f(x,y) \approx \frac{\#(\text{of points within $r$ of $(x,y)$})}{\pi r^2} \]
Using (for instance)
\[ \rho_\sigma(x,y) = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2+y^2}{2\sigma^2}} \]
we can estimate \(f(x,y)\) by
\[ \hat f(x,y) = \sum_{i=1}^n \rho_\sigma(x_i-x, y_i-y) . \]
(interlude: why?)
Simulate 200 spatial locations with a density having two “bumps”. Plot these points. (rng.normal(size=n, loc=-3), rng.normal(size=n, loc=+3))
Make 20 “reference” locations.
Compute the kernel density estimate for each reference location, with \(\sigma\) chosen appropriately, and plot it. (scipy.stats.norm( ))
Note that this required \(O(NM)\) operations, where \(N\) is the number of data points, and \(M\) is the number of reference locations.
This is not as bad as \(O(N^2)\)!
But, suppose we wanted “density estimate for each of our \(N\) points”. The naive method would use data points = reference points: \(O(N^2)\).
Better: get estimated density at a grid of references, and use linear interpolation.
First, we’ll bin things into a 2D grid:
sigma = 500 # meters
delta = 100 # meters
xy = np.array([x.coords[0] for x in gdf['start_points']])
extent = [np.min(xy[:,1]), np.max(xy[:,1]), np.min(xy[:,0]), np.max(xy[:,0])]
xx = np.linspace(extent[2], extent[3], int(abs(extent[3]-extent[2])/delta))
yy = np.linspace(extent[0], extent[1], int(abs(extent[1]-extent[0])/delta))
heatmap, xedges, yedges = np.histogram2d(xy[:,0], xy[:,1], bins=[xx, yy])
plt.imshow(heatmap)
Then, we smooth it: (question: how’s this relate to how we did KDE above?)
There’s various ways to do this. One is to use contextily as demonstrated here. First, we need to convert to “EPSG:3857 (Web Mercator)”:
xpts = gpd.points_from_xy(xx, np.full((len(xx),), np.mean(yy)), crs="EPSG:5070").to_crs("EPSG:3857")
ypts = gpd.points_from_xy(np.full((len(yy),), np.mean(xx)), yy, crs="EPSG:5070").to_crs("EPSG:3857")
gpd.GeoDataFrame(geometry=gpd.points_from_xy(
np.concat([xx, np.full((len(yy),), np.mean(xx))]),
np.concat([np.full((len(xx),), np.mean(yy)), yy])),
crs="EPSG:5070").to_crs("EPSG:3857").explore()Now, we can superimpose on the OSM map:
Let’s zoom in:
Above I set:
sigma = 500 # meters
delta = 100 # meters
What are these numbers, in words, and are these good choices?
Explore different choices of these two parameters. To do this:
sigma and delta and makes a plot.There are various automatic methods.
But… crossvalidation!
For each bandwidth:
… but wait, what’s this mean, here?
For each bandwidth:
Why log?:
If \(f\) and \(g\) are probability distributions, and \(x_1, \ldots, x_n\) are drawn from \(f\), then \[ L(g) = \sum_{i=1}^n \log g(x_i) \lessapprox L(f) , \] i.e., is maximized for \(g=f\). (see: cross entropy)