2022 – CIS Big Data

May 30, 2022bigdata4mobility No Comments

Sampling Bias in Cuebiq Dataset in MSAs

We study the bias in mobility data across 11 Metropolitan Statistical Areas (MSAs) in the U.S, including New York NY, Los Angeles CA, Boston MA, Seattle WA, Baltimore MD, Tulsa OK, Fresno CA, Tyler TX, Champaign-Urbana IL, Sebring-Avon Park FL, and Cheyenne WY. These MSAs are selected to include diverse socioeconomic and racial compositions.

Based on detected home locations, sampling bias is analyzed according to the ratio between the number of devices and the population. Here we assume that there is no bias among population groups in the same block group. In the violin plot below, 6 MSAs with a population over 500,000 are visualized. In Los Angeles, CA, despite that both White and other races are under-represented in our dataset, the bias in other race groups reaches -9.60% in total which is much more significant than that in the White population at -1.04%. A similar pattern also exists in Fresno, CA. In Boston, MA, and Seattle, WA, the pattern reversed that the White population is less represented than other races. In Tulsa, OK, both White and other racial groups are over-represented. And in Baltimore, MD, the bias in both groups seems balanced when outliers are excluded.

May 30, 2022bigdata4mobility No Comments

Community Call Plan

The team plans to engage stakeholders and collaborative labs to develop and issue the community call to all mobility labs in the US and around the world. The community call will specify a set of mobility metrics to be submitted along with a set of information that describes the data and the techniques used. A specific due date will be identified in the call by which participating labs shall submit their results to the PI team. Upon receiving the submissions from many labs around the world, the PI team will conduct a meta-analysis of the submissions. The results of the meta-analysis will be presented to all participating labs in a virtual meeting, also attended by the stakeholders. The culminating event of the research project is an in-person whole-community workshop to be held at the University of Washington involving all participating labs and stakeholders. This whole-community workshop has two important purposes: 1) to report back the findings of this community-coordinated effort and formulate (with all participating labs and stakeholders) recommendations for the research community and for the stakeholders (policymakers); and 2) to develop a result dissemination mechanism involving all participating labs and stakeholders. Planned dissemination methods include: manuscripts, white papers, short essays for non-technical audiences, and policy briefs, and short videos.

We expect to release the community call in late 2022 or early 2023. Please check back.

May 25, 2022bigdata4mobility No Comments

Missing data imputation in human mobility datasets using Gaussian Process Regression

In this brief, introductory post, we aim to introduce Gaussian Processes (GPs), their benefits, and why we have chosen to use them in this project. Gaussian Process Regression (GPR) is a well-known, non-parametric, Bayesian approach to regression with a probabilistic nature. GPR works efficiently and accurately on small datasets, and has the ability to provide uncertainty measurements on the predictions as well. The prior distribution in GPs is an infinite function space–as the model receives training data, the posterior distribution narrows down the distribution of allowable functions, making GPR a flexible and highly-expressive tool for modeling complex datasets.

In this work, we employ GPR to fill in the so-called “missing data” in human mobility trajectories, which commonly comprise of discrete observations in time and space. Figure 1 shows an example of an anonymized user’s trajectory with (standardized) latitudes and longitudes on the y-axis and time on the x-axis. The blue line shows the posterior mean function, which depends on two parameters: (1) Lengthscale and (2) Variance. Lengthscale specifies the temporal correlations of the data (i.e. after how many minutes should a training point be irrelevant in the prediction?), while the variance specifies the spatial correlations (i.e. what are the boundaries for the mean function and the uncertainty?).

For more information on GPR, check out this easy-to-follow video–or if you’re feeling more academic, this paper is excellent as well. Stay tuned for more updates on our progress!

Figure 1: GPR model inputs and outputs plotted against time