Sampling Bias in Cuebiq Dataset in MSAs

We study the bias in mobility data across 11 Metropolitan Statistical Areas (MSAs) in the U.S, including New York NY, Los Angeles CA, Boston MA, Seattle WA, Baltimore MD, Tulsa OK, Fresno CA, Tyler TX, Champaign-Urbana IL, Sebring-Avon Park FL, and Cheyenne WY. These MSAs are selected to include diverse socioeconomic and racial compositions.

Based on detected home locations, sampling bias is analyzed according to the ratio between the number of devices and the population. Here we assume that there is no bias among population groups in the same block group. In the violin plot below, 6 MSAs with a population over 500,000 are visualized. In Los Angeles, CA, despite that both White and other races are under-represented in our dataset, the bias in other race groups reaches -9.60% in total which is much more significant than that in the White population at -1.04%. A similar pattern also exists in Fresno, CA. In Boston, MA, and Seattle, WA, the pattern reversed that the White population is less represented than other races. In Tulsa, OK, both White and other racial groups are over-represented. And in Baltimore, MD, the bias in both groups seems balanced when outliers are excluded.

Missing data imputation in human mobility datasets using Gaussian Process Regression

In this brief, introductory post, we aim to introduce Gaussian Processes (GPs), their benefits, and why we have chosen to use them in this project. Gaussian Process Regression (GPR) is a well-known, non-parametric, Bayesian approach to regression with a probabilistic nature. GPR works efficiently and accurately on small datasets, and has the ability to provide uncertainty measurements on the predictions as well. The prior distribution in GPs is an infinite function space–as the model receives training data, the posterior distribution narrows down the distribution of allowable functions, making GPR a flexible and highly-expressive tool for modeling complex datasets.

In this work, we employ GPR to fill in the so-called “missing data” in human mobility trajectories, which commonly comprise of discrete observations in time and space. Figure 1 shows an example of an anonymized user’s trajectory with (standardized) latitudes and longitudes on the y-axis and time on the x-axis. The blue line shows the posterior mean function, which depends on two parameters: (1) Lengthscale and (2) Variance. Lengthscale specifies the temporal correlations of the data (i.e. after how many minutes should a training point be irrelevant in the prediction?), while the variance specifies the spatial correlations (i.e. what are the boundaries for the mean function and the uncertainty?).

For more information on GPR, check out this easy-to-follow video–or if you’re feeling more academic, this paper is excellent as well. Stay tuned for more updates on our progress!

Figure 1: GPR model inputs and outputs plotted against time