This summer, I had the honor to — once again — speak at the OpenGeoHub Summer School. This time, I wanted to challenge the students and myself by not just doing MovingPandas but by introducing both MovingPandas and DVC for Mobility Data Science.
I’ve previously written about DVC and how it may be used to track geoprocessing workflows with QGIS & DVC. In my summer school session, we go into details on how to use DVC to keep track of MovingPandas movement data analytics workflow.
written together with my fellow “Geocomputation with Python” co-authors Robin Lovelace, Michael Dorman, and Jakub Nowosad.
In this blog post, we talk about our experience teaching R and Python for geocomputation. The context of this blog post is the OpenGeoHub Summer School 2023 which has courses on R, Python and Julia. The focus of the blog post is on geographic vector data, meaning points, lines, polygons (and their ‘multi’ variants) and the attributes associated with them. We plan to cover raster data in a future post.
Das QGIS swiss locator Plugin erleichtert in der Schweiz vielen Anwendern das Leben dadurch, dass es die umfangreichen Geodaten von swisstopo und opendata.swiss zugänglich macht. Darunter ein breites Angebot an GIS Layern, aber auch Objektinformationen und eine Ortsnamensuche.
Dank eines Förderprojektes der Anwendergruppe Schweiz durfte OPENGIS.ch ihr Plugin um eine zusätzliche Funktionalität erweitern. Dieses Mal mit der Integration von WMTS als Datenquelle, eine ziemlich coole Sache. Doch was ist eigentlich der Unterschied zwischen WMS und WMTS?
WMS vs. WMTS
Zuerst zu den Gemeinsamkeiten: Beide Protokolle – WMS und WMTS – sind dazu geeignet, Kartenbilder von einem Server zu einem Client zu übertragen. Dabei werden Rasterdaten, also Pixel, übertragen. Ausserdem werden dabei gerenderte Bilder übertragen, also keine Rohdaten. Diese sind dadurch für die Präsentation geeignet, im Browser, im Desktop GIS oder für einen PDF Export.
Der Unterschied liegt im T. Das T steht für “Tiled”, oder auf Deutsch “gekachelt”. Bei einem WMS (ohne Kachelung) können beliebige Bildausschnitte angefragt werden. Bei einem WMTS werden die Daten in einem genau vordefinierten Gitternetz — als Kacheln — ausgeliefert.
Der Hauptvorteil von WMTS liegt in dieser Standardisierung auf einem Gitternetz. Dadurch können diese Kacheln zwischengespeichert (also gecached) werden. Dies kann auf dem Server geschehen, der bereits alle Kacheln vorberechnen kann und bei einer Anfrage direkt eine Datei zurückschicken kann, ohne ein Bild neu berechnen zu müssen. Es erlaubt aber auch ein clientseitiges Caching, das heisst der Browser – oder im Fall von Swiss Locator QGIS – kann jede Kachel einfach wiederverwenden, ganz ohne den Server nochmals zu kontaktieren. Dadurch kann die Reaktionszeit enorm gesteigert werden und flott mit Applikationen gearbeitet werden.
Warum also noch WMS verwenden?
Auch das hat natürlich seine Vorteile. Der WMS kann optimierte Bilder ausliefern für genau eine Abfrage. Er kann Beispielsweise alle Beschriftungen optimal platzieren, so dass diese nicht am Kartenrand abgeschnitten sind, bei Kacheln mit den vielen Rändern ist das schwieriger. Ein WMS kann auch verschiedene abgefragte Layer mit Effekten kombinieren, Blending-Modi sind eine mächtige Möglichkeit, um visuell ansprechende Karten zu erzeugen. Weiter kann ein WMS auch in beliebigen Auflösungen arbeiten (DPI), was dazu führt, dass Schriften und Symbole auf jedem Display in einer angenehmen Grösse angezeigt werden, währenddem das Kartenbild selber scharf bleibt. Dasselbe gilt natürlich auch für einen PDF Export.
Ein WMS hat zudem auch die Eigenschaft, dass im Normalfall kein Caching geschieht. Bei einer dahinterliegenden Datenbank wird immer der aktuelle Datenstand ausgeliefert. Das kann auch gewünscht sein, zum Beispiel soll nicht zufälligerweise noch der AV-Datensatz von gestern ausgeliefert werden.
Dies bedingt jedoch immer, dass der Server das auch entsprechend umsetzt. Bei den von swisstopo via map.geo.admin.ch publizierten Karten ist die Schriftgrösse auch bei WMS fix ins Kartenbild integriert und kann nicht vom Server noch angepasst werden.
Im Falle von QGIS Swiss Locator geht es oft darum, Hintergrundkarten zu laden, z.B. Orthofotos oder Landeskarten zur Orientierung. Daneben natürlich oft auch auch weitere Daten, von eher statischer Natur. In diesem Szenario kommen die Vorteile von WMTS bestmöglich zum tragen. Und deshalb möchten wir der QGIS Anwendergruppe Schweiz im Namen von allen Schweizer QGIS Anwender dafür danken, diese Umsetzung ermöglicht zu haben!
Der QGIS Swiss Locator ist das schweizer Taschenmesser von QGIS. Fehlt dir ein Werkzeug, das du gerne integrieren würdest? Schreib uns einen Kommentar!
Kaggle’s “Taxi Trajectory Data from ECML/PKDD 15: Taxi Trip Time Prediction (II) Competition” is one of the most used mobility / vehicle trajectory datasets in computer science. However, in contrast to other similar datasets, Kaggle’s taxi trajectories are provided in a format that is not readily usable in MovingPandas since the spatiotemporal information is provided as:
TIMESTAMP: (integer) Unix Timestamp (in seconds). It identifies the trip’s start;
POLYLINE: (String): It contains a list of GPS coordinates (i.e. WGS84 format) mapped as a string. The beginning and the end of the string are identified with brackets (i.e. [ and ], respectively). Each pair of coordinates is also identified by the same brackets as [LONGITUDE, LATITUDE]. This list contains one pair of coordinates for each 15 seconds of trip. The last list item corresponds to the trip’s destination while the first one represents its start;
Therefore, we need to create a DataFrame with one point + timestamp per row before we can use MovingPandas to create Trajectories and analyze them.
But first things first. Let’s download the dataset:
import datetime
import pandas as pd
import geopandas as gpd
import movingpandas as mpd
import opendatasets as od
from os.path import exists
from shapely.geometry import Point
input_file_path = 'taxi-trajectory/train.csv'
def get_porto_taxi_from_kaggle():
if not exists(input_file_path):
od.download("https://www.kaggle.com/datasets/crailtap/taxi-trajectory")
get_porto_taxi_from_kaggle()
df = pd.read_csv(input_file_path, nrows=10, usecols=['TRIP_ID', 'TAXI_ID', 'TIMESTAMP', 'MISSING_DATA', 'POLYLINE'])
df.POLYLINE = df.POLYLINE.apply(eval) # string to list
df
And now for the remodelling:
def unixtime_to_datetime(unix_time):
return datetime.datetime.fromtimestamp(unix_time)
def compute_datetime(row):
unix_time = row['TIMESTAMP']
offset = row['running_number'] * datetime.timedelta(seconds=15)
return unixtime_to_datetime(unix_time) + offset
def create_point(xy):
try:
return Point(xy)
except TypeError: # when there are nan values in the input data
return None
new_df = df.explode('POLYLINE')
new_df['geometry'] = new_df['POLYLINE'].apply(create_point)
new_df['running_number'] = new_df.groupby('TRIP_ID').cumcount()
new_df['datetime'] = new_df.apply(compute_datetime, axis=1)
new_df.drop(columns=['POLYLINE', 'TIMESTAMP', 'running_number'], inplace=True)
new_df
And that’s it. Now we can create the trajectories:
That’s it. Now our MovingPandas.TrajectoryCollection is ready for further analysis.
By the way, the plot above illustrates a new feature in the recent MovingPandas 0.16 release which, among other features, introduced plots with arrow markers that show the movement direction. Other new features include a completely new custom distance, speed, and acceleration unit support. This means that, for example, instead of always getting speed in meters per second, you can now specify your desired output units, including km/h, mph, or nm/h (knots).
Similarly, we’ve seen posts on using PyQGIS in Jupyter notebooks. However, I find the setup with *.bat files rather tricky.
This post presents a way to set up a conda environment with QGIS that is ready to be used in Jupyter notebooks.
The first steps are to create a new environment and install QGIS. I use mamba for the installation step because it is faster than conda but you can use conda as well:
If we now try to import the qgis module in Python, we get an error:
(qgis) PS C:\Users\anita> python
Python 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:41:22) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import qgis
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'qgis'
To fix this error, we need to get the paths from the Python console inside QGIS:
Amongst all the processing algorithms already available in QGIS, sometimes the one thing you need is missing.
This happened not a long time ago, when we were asked to find a way to continuously visualise traffic on the Swiss motorway network (polylines) using frequently measured traffic volumes from discrete measurement stations (points) alongside the motorways. In order to keep working with the existing polylines, and be able to attribute more than one value of traffic to each feature, we chose to work with the M-values. M-values are a per-vertex attribute like X, Y or Z coordinates. They contain a measure value, which typically represents time or distance. But they can hold any numeric value.
In our example, traffic measurement values are provided on a separate point layer and should be attributed to the M-value of the nearest vertex of the motorway polylines. Of course, the motorway features should be of type LineStringM in order to hold an M-value. We then should interpolate the M-values for each feature over all vertices in order to get continuous values along the line (i.e. a value on every vertex). This last part is not yet existing as a processing algorithm in QGIS.
This article describes how to write a feature-based processing algorithm based on the example of M-value interpolation along LineStrings.
Feature-based processing algorithm
The pyqgis class QgsProcessingFeatureBasedAlgorithmis described as follows: “An abstract QgsProcessingAlgorithm base class for processing algorithms which operates “feature-by-feature”.
Feature based algorithms are algorithms which operate on individual features in isolation. These are algorithms where one feature is output for each input feature, and the output feature result for each input feature is not dependent on any other features present in the source. […]
Using QgsProcessingFeatureBasedAlgorithm as the base class for feature based algorithms allows shortcutting much of the common algorithm code for handling iterating over sources and pushing features to output sinks. It also allows the algorithm execution to be optimised in future (for instance allowing automatic multi-thread processing of the algorithm, or use of the algorithm in “chains”, avoiding the need for temporary outputs in multi-step models).”
In other words, when connecting several processing algorithms one after the other – e.g. with the graphical modeller – these feature-based processing algorithms can easily be used to fill in the missing bits.
Compared to the standard QgsProcessingAlgorithm the feature-based class implicitly iterates over each feature when executing and avoids writing wordy loops explicitly fetching and applying the algorithm to each feature.
Just like for the QgsProcessingAlgorithm (a template can be found in the Processing Toolbar > Scripts > Create New Script from Template), there is quite some boilerplate code in the QgsProcessingFeatureBasedAlgorithm. The first part is identical to any QgsProcessingAlgorithm.
After the description of the algorithm (name, group, short help, etc.), the algorithm is initialised with def initAlgorithm, defining input and output.
While in a regular processing algorithm now follows def processAlgorithm(self, parameters, context, feedback), in a feature-based algorithm we use def processFeature(self, feature, context, feedback). This implies applying the code in this block to each feature of the input layer.
! Do not use def processAlgorithm in the same script, otherwise your feature-based processing algorithm will not work !
Interpolating M-values
This actual processing part can be copied and added almost 1:1 from any other independent python script, there is little specific syntax to make it a processing algorithm. Only the first line below really.
In our M-value example:
def processFeature(self, feature, context, feedback):
try:
geom = feature.geometry()
line = geom.constGet()
vertex_iterator = QgsVertexIterator(line)
vertex_m = []
# Iterate over all vertices of the feature and extract M-value
while vertex_iterator.hasNext():
vertex = vertex_iterator.next()
vertex_m.append(vertex.m())
# Extract length of segments between vertices
vertices_indices = range(len(vertex_m))
length_segments = [sqrt(QgsPointXY(line[i]).sqrDist(QgsPointXY(line[j])))
for i,j in itertools.combinations(vertices_indices, 2)
if (j - i) == 1]
# Get all non-zero M-value indices as an array, where interpolations
have to start
vertex_si = np.nonzero(vertex_m)[0]
m_interpolated = np.copy(vertex_m)
# Interpolate between all non-zero M-values - take segment lengths between
vertices into account
for i in range(len(vertex_si)-1):
first_nonzero = vertex_m[vertex_si[i]]
next_nonzero = vertex_m[vertex_si[i+1]]
accum_dist = itertools.accumulate(length_segments[vertex_si[i]
:vertex_si[i+1]])
sum_seg = sum(length_segments[vertex_si[i]:vertex_si[i+1]])
interp_m = [round(((dist/sum_seg)*(next_nonzero-first_nonzero)) +
first_nonzero,0) for dist in accum_dist]
m_interpolated[vertex_si[i]:vertex_si[i+1]] = interp_m
# Copy feature geometry and set interpolated M-values,
attribute new geometry to feature
geom_new = QgsLineString(geom.constGet())
for j in range(len(m_interpolated)):
geom_new.setMAt(j,m_interpolated[j])
attrs = feature.attributes()
feat_new = QgsFeature()
feat_new.setAttributes(attrs)
feat_new.setGeometry(geom_new)
except Exception:
s = traceback.format_exc()
feedback.pushInfo(s)
self.num_bad += 1
return []
return [feat_new]
In our example, we get the feature’s geometry, iterate over all its vertices (using the QgsVertexIterator) and extract the M-values as an array. This allows us to assign interpolated values where we don’t have M-values available. Such missing values are initially set to a value of 0 (zero).
We also extract the length of the segments between the vertices. By gathering the indices of the non-zero M-values of the array, we can then interpolate between all non-zero M-values, considering the length that separates the zero-value vertex from the first and the next non-zero vertex.
For the iterations over the vertices to extract the length of the segments between them as well as for the actual interpolation between all non-zero M-value vertices we use the library itertools. This library provides different iterator building blocks that come in quite handy for our use case.
Finally, we create a new geometry by copying the one which is being processed and setting the M-values to the newly interpolated ones.
And that’s all there is really!
Alternatively, the interpolation can be made using the interp function of the numpy library. Some parts where our manual method gave no values, interp.numpy seemed more capable of interpolating. It remains to be judged which version has the more realistic results.
Styling the result via M-values
The last step is styling our output layer in QGIS, based on the M-values (our traffic M-values are categorised from 1 [a lot of traffic -> dark red] to 6 [no traffic -> light green]). This can be achieved by using a Single Symbol symbology with a Marker Line type “on every vertex”. As a marker type, we use a simple round point. Stroke style is “no pen” and Stroke fill is based on an expression:
with_variable(
'm_value', m(point_n($geometry, @geometry_point_num)),
CASE WHEN @m_value = 6
THEN color_rgb(140, 255, 159)
WHEN @m_value = 5
THEN color_rgb(244, 252, 0)
WHEN @m_value = 4
THEN color_rgb(252, 176, 0)
WHEN @m_value = 3
THEN color_rgb(252, 134, 0)
WHEN @m_value = 2
THEN color_rgb(252, 29, 0)
WHEN @m_value = 1
THEN color_rgb(140, 255, 159)
ELSE
color_hsla(0,100,100,0)
END
)
And voilà! Wherever we have enough measurements on one line feature, we get our motorway network continuously coloured according to the measured traffic volume.
Motorway network – the different lanes are regrouped for each direction. M-values of the vertices closest to measurement points are attributed the measured traffic volume. The vertices are coloured accordingly.Trafic on motorway network after “manual” M-value interpolation. Trafic on motorway network after M-value interpolation using numpy.
One disclaimer at the end: We get this seemingly continuous styling only because of the combination of our “complex” polylines (containing many vertices) and the zoomed-out view of the motorway network. Because really, we’re styling many points and not directly the line itself. But in our case, this is working very well.
If you’d like to make your custom processing algorithm available through the processing toolbox in your QGIS, just put your script in the folder containing the files related to your user profile:
profiles > default > processing > scripts
You can directly access this folder by clicking on Settings > User Profiles > Open Active Profile Folder in the QGIS menu.
That way, it’s also available for integration in the graphical modeller.
Extract of the GraphicalModeler sequence. “Interpolate M-values neg” refers to the custom feature-based processing algorithm described above.
You can download the above-mentioned processing scripts (with numpy and without numpy) here.
New functions to add a timedelta column and get the trajectory sampling interval #233
As always, all tutorials are available from the movingpandas-examples repository and on MyBinder:
The new distance measures are covered in tutorial #11:
Computing distances between trajectories, as illustrated in tutorial #11Computing distances between a trajectory and other geometry objects, as illustrated in tutorial #11
But don’t miss the great features covered by the other notebooks, such as outlier cleaning and smoothing:
Trajectory cleaning and smoothing, as illustrated in tutorial #10
If you have questions about using MovingPandas or just want to discuss new ideas, you’re welcome to join our discussion forum.
Speed and direction column names can now be customized
If you have questions about using MovingPandas or just want to discuss new ideas, you’re welcome to join our recently opened discussion forum.
As always, all tutorials are available from the movingpandas-examples repository and on MyBinder:
Besides others examples, the movingpandas-examples repo contains the following tech demo: an interactive app built with Panel that demonstrates different MovingPandas stop detection parameters
To start the app, open the stopdetection-app.ipynb notebook and press the green Panel button in the Jupyter Lab toolbar:
Here’s a quick preview of the resulting app in action:
To create this app, I defined a single function called my_plot which takes the address and desired buffer size as input parameters. Using Panel’s interact and servable methods, I’m then turning this function into the interactive app you’ve seen above:
To open the Panel preview, press the green Panel button in the Jupyter Lab toolbar:
I really enjoy building spatial data exploration apps this way, because I can start off with a Jupyter notebook and – once I’m happy with the functionality – turn it into a pretty app that provides a user-friendly exterior and hides the underlying complexity that might scare away stakeholders.
Give it a try and share your own adventures. I’d love to see what you come up with.
Many of you certainly have already heard of and/or even used Leafmap by Qiusheng Wu.
Leafmap is a Python package for interactive spatial analysis with minimal coding in Jupyter environments. It provides interactive maps based on folium and ipyleaflet, spatial analysis functions using WhiteboxTools and whiteboxgui, and additional GUI elements based on ipywidgets.
This way, Leafmap achieves a look and feel that is reminiscent of a desktop GIS:
Recently, Qiusheng has started an additional project: the geospatial meta package which brings together a variety of different Python packages for geospatial analysis. As such, the main goals of geospatial are to make it easier to discover and use the diverse packages that make up the spatial Python ecosystem.
Besides the usual suspects, such as GeoPandas and of course Leafmap, one of the packages included in geospatial is MovingPandas. Thanks, Qiusheng!
I’ve tested the mamba install today and am very happy with how this worked out. There is just one small hiccup currently, which is related to an upstream jinja2 issue. After installing geospatial, I therefore downgraded jinja:
Of course, I had to try Leafmap and MovingPandas in action together. Therefore, I fired up one of the MovingPandas example notebook (here the example on clipping trajectories using polygons). As you can see, the integration is pretty smooth since Leafmap already support drawing GeoPandas GeoDataFrames and MovingPandas can convert trajectories to GeoDataFrames (both lines and points):
Clipped trajectory segments as linestrings in LeafmapLeafmap includes an attribute table view that can be activated on user request to show, e.g. trajectory informationAnd, of course, we can also map the original trajectory points
Geospatial also includes the new dask-geopandas library which I’m very much looking forward to trying out next.
MovingPandas 0.9rc3 has just been released, including important fixes for local coordinate support. Sports analytics is just one example of movement data analysis that uses local rather than geographic coordinates.
Many movement data sources – such as soccer players’ movements extracted from video footage – use local reference systems. This means that x and y represent positions within an arbitrary frame, such as a soccer field.
Since Geopandas and GeoViews support handling and plotting local coordinates just fine, there is nothing stopping us from applying all MovingPandas functionality to this data. For example, to visualize the movement speed of players:
Of course, we can also plot other trajectory attributes, such as the team affiliation.
But one particularly useful feature is the ability to use custom background images, for example, to show the soccer field layout:
The Kalman filter in action on the Geolife sample: smoother, less jiggly trajectories.Top-Down Time Ratio generalization aka trajectory compression in action: reduces the number of positions along the trajectory without altering the spatiotemporal properties, such as speed, too much.
Behind the scenes, Ray Bell took care of moving testing from Travis to Github Actions, and together we worked through the steps to ensure that the source code is now properly linted using flake8 and black.
Being able to work with so many awesome contributors has made this release really special for me. It’s great to see the project attracting more developer interest.
As always, all tutorials are available from the movingpandas-examples repository and on MyBinder:
After writing “Towards a template for exploring movement data” last year, I spent a lot of time thinking about how to develop a solid approach for movement data exploration that would help analysts and scientists to better understand their datasets. Finally, my search led me to the excellent paper “A protocol for data exploration to avoid common statistical problems” by Zuur et al. (2010). What they had done for the analysis of common ecological datasets was very close to what I was trying to achieve for movement data. I followed Zuur et al.’s approach of a exploratory data analysis (EDA) protocol and combined it with a typology of movement data quality problems building on Andrienko et al. (2016). Finally, I brought it all together in a Jupyter notebook implementation which you can now find on Github.
There are two options for running the notebook:
The repo contains a Dockerfile you can use to spin up a container including all necessary datasets and a fitting Python environment.
Alternatively, you can download the datasets manually and set up the Python environment using the provided environment.yml file.
The dataset contains over 10 million location records. Most visualizations are based on Holoviz Datashader with a sprinkling of MovingPandas for visualizing individual trajectories.
Point density map of 10 million location records, visualized using Datashader
Line density map for detecting gaps in tracks, visualized using Datashader
Example trajectory with strong jitter, visualized using MovingPandas & GeoViews
I hope this reference implementation will provide a starting point for many others who are working with movement data and who want to structure their data exploration workflow.
(If you don’t have institutional access to the journal, the publisher provides 50 free copies using this link. Once those are used up, just leave a comment below and I can email you a copy.)
Data sourcing and preparation is one of the most time consuming tasks in many spatial analyses. Even though the Austrian data.gv.at platform already provides a central catalog, the individual datasets still vary considerably in their accessibility or readiness for use.
OGD.AT Lab is a new repository collecting Jupyter notebooks for working with Austrian Open Government Data and other auxiliary open data sources. The notebooks illustrate different use cases, including so far:
Accessing geodata from the city of Vienna WFS
Downloading environmental data (heat vulnerability and air quality)
Geocoding addresses and getting elevation information
Exploring urban movement data
Data processing and visualization are performed using Pandas, GeoPandas, and Holoviews. GeoPandas makes it straighforward to use data from WFS. Therefore, OGD.AT Lab can provide one universal gdf_from_wfs() function which takes the desired WFS layer as an argument and returns a GeoPandas.GeoDataFrame that is ready for analysis:
Many other datasets are provided as CSV files which need to be joined with spatial datasets to use them in spatial analysis. For example, the “Urban heat vulnerability index” dataset which needs to be joined to statistical areas.
Another issue with many CSV files is that they use German number formatting, where commas are used as a decimal separater instead of dots:
Besides file access, there are also open services provided by other developers, for example, Manfred Egger developed an elevation service that provides elevation information for any point in Austria. In combination with geocoding services, such as Nominatim, this makes is possible to, for example, find the elevation for any address in Austria:
The utility functions for data access included in this repository will continue to grow as new data sources are included. Eventually, it may make sense to extract the data access function into a dedicated library, similar to geofi (Finland) or geobr (Brazil).
If you’re aware of any interesting open datasets or services that should be included in OGD.AT, feel free to reach out here or on Github through the issue tracker or by providing a pull request.
In the previous post, we explored how hvPlot and Datashader can help us to visualize large CSVs with point data in interactive map plots. Of course, the spatial distribution of points usually only shows us one part of the whole picture. Today, we’ll therefore look into how to explore other data attributes by linking other (non-spatial) plots to the map.
This functionality, referred to as “linked brushing” or “crossfiltering” is under active development and the following experiment was prompted by a recent thread on Twitter launched by @plotlygraphs announcement of HoloViews 1.14:
Dash HoloViews was just released as part of @HoloViews version 1.14!
To link the two plots, we use HoloViews’ link_selections function:
from holoviews.selection import link_selections
linked_plots = link_selections(map_plot + hist_plot)
That’s all! We can now perform spatial filters in the map and attribute value filters in the histogram and the filters are automatically applied to the linked plots:
Linked brushing demo using ship movement data (AIS): filtering records by speed (SOG) reveals spatial patterns of fast and slow movement.
You’ve probably noticed that there is no background map in the above plot. I had to remove the background map tiles to get rid of an error in Holoviews 1.13.4. This error has been fixed in 1.14.0 but I ran into other issues with the datashaded Scatterplot.
Even with all their downsides, CSV files are still a common data exchange format – particularly between disciplines with different tech stacks. Indeed, “How to Specify Data Types of CSV Columns for Use in QGIS” (originally written in 2011) is still one of the most popular posts on this blog. QGIS continues to be quite handy for visualizing CSV file contents. However, there are times when it’s just not enough, particularly when the number of rows in the CSV is in the range of multiple million. The following example uses a 12 million point CSV:
To give you an idea of the waiting times in QGIS, I’ve run the following script which loads and renders the CSV:
Panning and zooming around are no fun either since rendering takes so long. Changing from a single symbol renderer to, for example, a heatmap renderer does not improve the rendering times. So we need a different solutions when we want to efficiently explore large point CSV files.
The Pandas data analysis library is well-know for being a convenient tool for handling CSVs. However, it’s less clear how to use it as a replacement for desktop GIS for exploring large CSVs with point coordinates. My favorite solution so far uses hvPlot + HoloViews + Datashader to provide interactive Bokeh plots in Jupyter notebooks.
hvPlot provides a high-level plotting API built on HoloViews that provides a general and consistent API for plotting data in (Geo)Pandas, xarray, NetworkX, dask, and others. (Image source: https://hvplot.holoviz.org)
But first things first! Loading the CSV as a Pandas Dataframe takes 10.7 seconds. Pandas’ default plotting function (based on Matplotlib), however, takes around 13 seconds and only produces a static scatter plot.
Loading and plotting the CSV with Pandas
hvPlot to the rescue!
We only need two more steps to get faster and interactive map plots (plus background maps!): First, we need to reproject the lat/lon values. (There’s a warning here, most likely since some of the input lat/lon values are invalid.) Then, we replace plot() with hvplot() and voilà:
Plotting the CSV with Datashader
As you can see from the above GIF, the whole process barely takes 2 seconds and the resulting map plot is interactive and very responsive.
12 million points are far from the limit. As long as the Pandas DataFrame fits into memory, we are good and when the datasets get bigger than that, there are Dask DataFrames. But that’s a story for another day.