Missing data imputation in human mobility datasets using Gaussian Process Regression

In this brief, introductory post, we aim to introduce Gaussian Processes (GPs), their benefits, and why we have chosen to use them in this project. Gaussian Process Regression (GPR) is a well-known, non-parametric, Bayesian approach to regression with a probabilistic nature. GPR works efficiently and accurately on small datasets, and has the ability to provide uncertainty measurements on the predictions as well. The prior distribution in GPs is an infinite function space–as the model receives training data, the posterior distribution narrows down the distribution of allowable functions, making GPR a flexible and highly-expressive tool for modeling complex datasets.

In this work, we employ GPR to fill in the so-called “missing data” in human mobility trajectories, which commonly comprise of discrete observations in time and space. Figure 1 shows an example of an anonymized user’s trajectory with (standardized) latitudes and longitudes on the y-axis and time on the x-axis. The blue line shows the posterior mean function, which depends on two parameters: (1) Lengthscale and (2) Variance. Lengthscale specifies the temporal correlations of the data (i.e. after how many minutes should a training point be irrelevant in the prediction?), while the variance specifies the spatial correlations (i.e. what are the boundaries for the mean function and the uncertainty?).

For more information on GPR, check out this easy-to-follow video–or if you’re feeling more academic, this paper is excellent as well. Stay tuned for more updates on our progress!

Figure 1: GPR model inputs and outputs plotted against time 

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *