Policymakers, businesses and NGOs
need disaggregated data to guide their actions. But typically this data does
not exist below the first administrative level, if at all.
A recent paper in Science by Blumenstock et
al. suggests that anonymized mobile phone metadata might provide an answer. There has been prior work linking the volume of
call use, aggregated up, to population statistics. But this paper is a departure because it
focuses on “understanding how the digital footprints of a single individual can
be used to accurately predict that same individual’s economic characteristics”.
To do this, they gained access to
an anonymized 2009 database from Rwanda’s largest mobile operator containing
records of billions of interactions.
They then draw a random sample of 856 of these individual subscribers
(from 1.5 million), geographically stratified (covering 30 districts and 300
cell towers). They then conduct a phone interview with them (asking 75
questions), and get permission from each of them to merge the survey data with
the mobile phone interaction data.
The authors use a “structured
combinatorial method to automatically generate several thousand metrics from
the phone logs that quantify factors such as the total volume, intensity,
timing and directionality of communication; the structure of the individual’s
contact network; patterns of mobility and migration based on geospatial markers
in the data, and so forth”. The authors then use an “elasticated net
regularization” to remove less relevant metrics to get a manageable model. Actual wealth (as constructed from answers
given in the phone survey) is correlated with predicted wealth (using the
regression model developed) at the individual level at R=0.68.
Then they use this model to
generate out of sample predictions for the remaining 1.5 million Rwandan mobile
phone users (the whole point of the exercise—can we predict the wealth of
individual subscribers based solely on anonymized transactions records?) and
compare various predictions with 2007 and 2010 DHS survey data (using those
reporting mobile phone ownership or not in the DHS survey) which allow for
wealth indices to be constructed.
These out of sample predicted
wealth estimates are aggregated up to the district level (30 of them) and
compared to the DHS wealth estimates at the district level. For the 3 big districts (more than 400k people) the
correlations are tight. For the small
ones (less than 200k people) the correlations are looser. In fact the district wealth maps don’t look
THAT similar for the two data sources (Figure 3 A and B) even though the r value
is about 0.91. The r value for
correlations between the 2 data sources at the cluster level is 0.79.
The paper does not (as far as I
can tell) dwell on its limitations, but there are several. First, obviously, how representative are
mobile phone owners relative to the general population? It would have been good
to see some geographic data on this, based on the last census. Second, how representative is the sample of
856 of the 1.5 million mobile phone users? How were refusals dealt with? Third,
how contemporaneous are the survey data (year not given) and the phone data year
(2009)? Fourth, what is the elasticated net regularizations? It would be nice to see the workings. Firth, I’m not sure but I’d be surprised if
the DHS is representative at the district level—so how valid are the
comparisons at that level? Sixth, I
would have liked to have seen the Spearman correlations (on the ranks of the
districts by wealth) to get a sense of what difference data source makes for
action. Seventh, what are the safeguards
for the subscribers whose records were used (presumably 1.5 million did not
give their permission)?
But these are all things that can
be worked on and improved (and perhaps they have been). I’m not going to be too critical because I
admire the creativity of the work and its potential value—mobile phones are
everywhere, survey enumerators are rare sightings.
I would like to see if the correlations are
any good with stunting of under 3’s or under 5s, perhaps keying the model into
health information seeking behaviors used by subscribers.
When the data you want are
scarce, you have to creatively use the data you have to patch things over while
you make the case for investment in the former.
I would like to see those district rankings however. That would give us a sense of how serious
misclassification is likely to be (do we implement an infant complementary feeding
programme in District A or District B?).
1 comment:
Hi Lawrence,
Very interesting paper indeed! I applaude the innovative way of using cell-phone data to get disaggregated information at low cost. This is dramatically needed by program implementers and also for evaluation purposes.
In addition to the limitations you pointed out, however, when reading the paper I couldn't help wondering "why use DHS data to validate the poverty diagnosis the authors based on cell-phone interviews + big data, and why didn't they use poverty surveys?".
Indeed the map we can find page 17 of the national poverty analysis (http://eeas.europa.eu/delegations/rwanda/documents/press_corner/news/poverty_report_en.pdf) doesn't look really like the one in the article.
I would also have been interested, for the practical use of poverty data (e.g. program targeting), to see how the authors' results fit with the Ubudehe social categorization that Goverment of Rwanda uses to identify and actually target the poorest.
This would be a useful add-on to the publication.
Nevertheless, I join you in admiring the innovative work researchers have done, thus highlighting how we can make better use from data that already exists.
All the best
Yves Martin-Prével (IRD - Nutripass - Montpellier France; member of the IEG of the Global Nutrition Report)
PS: You are right; the Rwanda 2010 DHS is representative at the region (+Kigali) level, not at the district level.
Post a Comment