Feeds:
Posts
Comments

Posts Tagged ‘big data’

Traditional development evaluation has been characterized as ‘backward looking’ rather than forward looking and too focused on proving over improving. Some believe applying an ‘agile’ approach in development would be more useful — the assumption being that if you design a program properly and iterate rapidly and constantly based on user feedback and data analytics, you are more likely achieve your goal or outcome without requiring expensive evaluations. The idea is that big data could eventually allow development agencies to collect enough passive data about program participants that there would no longer be a need to actively survey people or conduct a final evaluation, because there would be obvious patterns that would allow implementers to understand behaviors and improve programs along the way.

The above factors have made some evaluators and data scientists question whether big data and real-time availability of multiple big data sets, along with the technology that enables their collection and analysis, will make evaluation as we know it obsolete. Others have argued that it’s not the end of evaluation, but rather we will see a blending of real-time monitoring, predictive modeling, and impact evaluation, depending on the situation. Big questions remain, however, about the feasibility of big data in some contexts. For example, are big data approaches useful when it comes to people who are not producing very much digital data? How will the biases in big data be addressed to ensure that the poorest, least connected, and/or most marginalized are represented?

The Technology Salon on Big Data and Evaluation hosted during November’s  American Evaluation Association Conference in Chicago opened these questions up for consideration by a roomful of evaluators and a few data scientists. We discussed the potential role of new kinds and quantities of data. We asked how to incorporate static and dynamic big data sources into development evaluation. We shared ideas on what tools, skills, and partnerships we might require if we aim to incorporate big data into evaluation practice. This rich and well-informed conversation was catalyzed by our lead discussants: Andrew Means, Associate Director of the Center for Data Science & Public Policy at the University of Chicago and Founder of Data Analysts for Social Good and The Impact Lab; Michael Bamberger, Independent Evaluator and co-author of Real World Evaluation; and Veronica Olazabal from The Rockefeller Foundation. The Salon was supported by ITAD via a Rockefeller Foundation grant.

What do we mean by ‘big data’?

The first task was to come up with a general working definition of what was understood by ‘big data.’ Very few of the organizations present at the Salon were actually using ‘big data’ and definitions varied. Some talked about ‘big data sets’ as those that could not be collected or analyzed by a human on a standard computer. Others mentioned that big data could include ‘static’ data sets (like government census data – if digitized — or cellphone record data) and ‘dynamic’ data sets that are being constantly generated in real time (such as streaming data input from sensors or ‘cookies’ and ‘crumbs’ generated through use of the Internet and social media). Others considered big data to be real time, socially-created and socially-driven data that could be harvested without having to purposely collect it or budget for its collection. ‘It’s data that has a life of its own. Data that just exists out there.’ Yet others felt that for something to be ‘big data’ multiple big data sets needed to be involved, for example, genetic molecular data crossed with clinical trial data and other large data sets, regardless of static or dynamic nature. Big data, most agreed, is data that doesn’t easily fit on a laptop and that requires a specialized skill set that most social scientists don’t have. ‘What is big data? It’s hard to define exactly, but I know it when I see it,’ concluded one discussant.

Why is big data a ‘thing’?

As one discussant outlined, recent changes in technology have given rise to big data. Data collection, data storage and analytical power are becoming cheaper and cheaper. ‘We live digitally now and we produce data all the time. A UPS truck has anywhere from 50-75 sensors on it to do everything from optimize routes to indicate how often it visits a mechanic,’ he said. ‘The analytic and computational power in my iPhone is greater than what the space shuttle had.’ In addition, we have ‘seamless data collection’ in the case of Internet-enabled products and services, meaning that a person creates data as they access products or services, and this can then be monetized, which is how companies like Google make their money. ‘There is not someone sitting at Google going — OK, Joe just searched for the nearest pizza place, let me enter that data into the system — Joe is creating the data about his search while he is searching, and this data is a constant stream.’

What does big data mean for development evaluation?

Evaluators are normally tasked with making a judgment about the merit of something, usually for accountability, learning and/or to improve service delivery, and usually looking back at what has already happened. In the wider sense, the learning from evaluation contributes to program theory, needs assessment, and many other parts of the program cycle.

This approach differs in some key ways from big data work, because most of the new analytical methods used by data scientists are good at prediction but not very good at understanding causality, which is what social scientists (and evaluators) are most often interested in. ‘We don’t just look at giant data sets and find random correlations,’ however, explained one discussant. ‘That’s not practical at all. Rather, we start with a hypothesis and make a mental model of how different things might be working together. We create regression models and see which performs better. This helps us to know if we are building the right hypothesis. And then we chisel away at that hypothesis.’

Some challenges come up when we think about big data for development evaluation because the social sector lacks the resources of the private sector. In addition, data collection in the world of international development is not often seamless because ‘we care about people who do not live in the digital world,’ as one person put it. Populations we work with often do not leave a digital trail. Moreover, we only have complete data about the entire population in some cases (for example, when it comes to education in the US), meaning that development evaluators need to figure out how to deal with bias and sampling.

Satellite imagery can bring in some data that was unavailable in the past, and this is useful for climate and environmental work, but we still do not have a lot of big data for other types of programming, one person said. What’s more, wholly machine-based learning, and the kind of ‘deep learning’ made possible by today’s computational power is currently not very useful for development evaluation.

Evaluators often develop counterfactuals so that they can determine what would have happened without an intervention. They may use randomized controlled trials (RCTs), differentiation models, statistics and economics research approaches to do this. One area where data science may provide some support is in helping to answer questions about counterfactuals.

More access to big data (and open data) could also mean that development and humanitarian organizations stop duplicating data collection functions. Perhaps most interestingly, big data’s predictive capabilities could in the future be used in the planning phase to inform the kinds of programs that agencies run, where they should be run, and who should be let into them to achieve the greatest impact, said one discussant. Computer scientists and social scientists need to break down language barriers and come together more often so they can better learn from one another and determine where their approaches can overlap and be mutually supportive.

Are we all going to be using big data?

Not everyone needs to use big data. Not everyone has the capacity to use it, and it doesn’t exist for offline populations, so we need to be careful that we are not forcing it where it’s not the best approach. As one discussant emphasized, big data is not magic, and it’s not universally applicable. It’s good for some questions and not others, and it should be considered as another tool in the toolbox rather than the only tool. Big data can provide clues to what needs further examination using other methods, and thus most often it should be part of a mixed methods approach. Some participants felt that the discussion about big data was similar to the one 20 years ago on electronic medical records or to the debate in the evaluation community about quantitative versus qualitative methods.

What about groups of people who are digitally invisible?

There are serious limitations when it comes to the data we have access to in the poorest communities, where there are no tablets and fewer cellphones. We also need to be aware of ‘micro-exclusion’ (who within a community or household is left out of the digital revolution?) and intersectionality (how do different factors of exclusion combine to limit certain people’s digital access?) and consider how these affect the generation and interpretation of big data. There is also a question about the intensity of the digital footprint: How much data and at what frequency is it required for big data to be useful?

Some Salon participants felt that over time, everyone would have a digital presence and/or data trail, but others were skeptical. Some data scientists are experimenting with calibrating small amounts of data and comparing them to human-collected data in an attempt to make big data less biased, a discussant explained. Another person said that by digitizing and validating government data on thousands (in the case of India, millions) of villages, big data sets could be created for those that are not using mobiles or data.

Another person pointed out that generating digital data is a process that involves much more than simple access to technology. ‘Joining the digital discussion’ also requires access to networks, local language content, and all kinds of other precursors, she said. We also need to be very aware that these kinds of data collection processes impact on people’s participation and input into data collection and analysis. ‘There’s a difference between a collective evaluation activity where people are sitting around together discussing things and someone sitting in an office far from the community getting sound bites from a large source of data.’

Where is big data most applicable in evaluation?

One discussant laid out areas where big data would likely be the most applicable to development evaluation:

Screen Shot 2015-11-23 at 9.32.07 AM

It would appear that big data has huge potential in the evaluation of complex programs, he continued. ‘It’s fairly widely accepted that conventional designs don’t work well with multiple causality, multiple actors, multiple contextual variables, etc. People chug on valiantly, but it’s expected that you may get very misleading results. This is an interesting area because there are almost no evaluation designs for complexity, and big data might be a possibility here.’

In what scenarios might we use big data for development evaluation?

This discussant suggested that big data might be considered useful for evaluation in three areas:

  1. Supporting conventional evaluation design by adding new big data generated variables. For example, one could add transaction data from ATMs to conventional survey generated poverty indicators
  2. Increasing the power of a conventional evaluation design by using big data to strengthen the sample selection methodology. For example, satellite images were combined with data collected on the ground and propensity score matching was used to strengthen comparison group selection for an evaluation of the effects of interventions on protecting forest cover in Mexico.
  3. Replacing a conventional design with a big data analytics design by replacing regression based models with systems analysis. For example, one could use systems analysis to compare the effectiveness of 30 ongoing interventions that may reduce stunting in a sample of villages. Real-time observations could generate a time-series that could help to estimate the effectiveness of each intervention in different contexts.

It is important to remember construct validity too. ‘If big data is available, but it’s not quite answering the question that you want to ask, it might be easy to decide to do something with it, to run some correlations, and to think that maybe something will come out. But we should avoid this temptation,’ he cautioned. ‘We need to remember and respect construct validity and focus on measuring what we think we are measuring and what we want to measure, not get distracted by what a data set might offer us.’

What about bias in data sets?

We also need to be very aware that big data carries with it certain biases that need to be accounted for, commented several participants; notably, when working with low connectivity populations and geographies or when using data from social media sites that cater to a particular segment of the population. One discussant shared an example where Twitter was used to identify patterns in food poisoning, and suddenly the upscale, hipster restaurants in the city seemed to be the problem. Obviously these restaurants were not the sole source of the food poisoning, but rather there was a particular kind of person that tended to use Twitter.

‘People are often unclear about what’s magical and what’s really possible when it comes to big data. We want it to tell us impossible things and it can’t. We really need to engage human minds in this process; it’s not a question of everything being automated. We need to use our capacity for critical thinking and ask: Who’s creating the data? How’s it created? Where’s it coming from? Who might be left out? What could go wrong?’ emphasized one discussant. ‘Some of this information can come from the metadata, but that’s not always enough to make certain big data is a reliable source.’ Bias may also be introduced through the viewpoints and unconscious positions, values and frameworks of the data scientists themselves as they are developing algorithms and looking for/finding patterns in data.

What about the ethical and privacy implications?

Big Data has a great deal of ethical and privacy implications. Issues of consent and potential risk are critical considerations, especially when working with populations that are newly online and/or who may not have a good understanding of data privacy and how their data may be used by third parties who are collecting and/or selling it. However, one participant felt that a protectionist mentality is misguided. ‘We are pushing back and saying that social media and data tracking are bad. Instead, we should realize that having a digital life and being counted in the world is a right and it’s going to be inevitable in the future. We should be working with the people we serve to better understand digital privacy and help them to be more savvy digital citizens.’ It’s also imperative that aid and development agencies abandon our slow and antiquated data collection systems, she said, and to use the new digital tools that are available to us.

How can we be more responsible with the data we gather and use?

Development and humanitarian agencies do need be more responsible with data policies and practices, however. Big data approaches may contribute to negative data extraction tendencies if we mine data and deliver it to decision-makers far away from the source. It will be critical for evaluators and big data practitioners to find ways to engage people ‘on the ground’ and involve more communities in interpreting and querying their own big data. (For more on responsible data use, see the Responsible Development Data Book. Oxfam also has a responsible data policy that could serve as a reference. The author of this blog is working on a policy and practice guide for protecting girls digital safety, security and privacy as well.)

Who should be paying for big data sets to be made available?

One participant asked about costs and who should bear the expense of creating big data sets and/or opening them up to evaluators and/or data scientists. Others asked for examples of the private sector providing data to the social sector. This highlighted additional ethical and privacy issues. One participant gave an example from the healthcare space where there is lots of experience in accessing big data sets generated by government and the private sector. In this case, public and private data sets needed to be combined. There were strict requirements around anonymization and the effort ended up being very expensive, which made it difficult to build a business case for the work.

This can be a problem for the development sector, because it is difficult to generate resources for resolving social problems; there is normally only investment if there is some kind of commercial gain to be had. Some organizations are now hiring ‘data philanthropist’ positions that help to negotiate these kinds of data relationships with the private sector. (Global Pulse has developed a set of big data privacy principles to guide these cases.)

So, is big data going to replace evaluation or not?

In conclusion, big data will not eliminate the need for evaluation. Rather, it’s likely that it will be integrated as another source of information for strengthening conventional evaluation design. ‘Big Data and the underlying methods of data science are opening up new opportunities to answer old questions in new ways, and ask new kinds of questions. But that doesn’t mean that we should turn to big data and its methods for everything,’ said one discussant. ‘We need to get past a blind faith in big data and get more practical about what it is, how to use it, and where it adds value to evaluation processes,’ said another.

Thanks again to all who participated in the discussion! If you’d like to join (or read about) conversations like this one, visit Technology Salon. Salons run under Chatham House Rule, so no attribution has been made in this summary post.

Read Full Post »

The NYC Technology Salon on February 28th examined the connection between bigger, better data and resilience. We held morning and afternoon Salons due to the high response rate for the topic. Jake Porway, DataKind; Emmanuel Letouzé, Harvard Humanitarian Initiative; and Elizabeth Eagen, Open Society Foundations; were our lead discussants for the morning. Max Shron, Data Strategy; joined Emmanuel and Elizabeth for the afternoon session.

This post summarizes key discussions from both Salons.

What the heck do we mean by ‘big data’?

The first question at the morning salon was: What precisely do we mean by the term ‘big data’? Participants and lead discussants had varying definitions. One way of thinking about big data is that it is comprised of small bits of unintentionally produced ‘data exhaust’ (website cookies, cellphone data records, etc.) that add up to a dataset. In this case, the term big data refers to the quality and nature of the data, and we think of non-sampled data that are messy, noisy and unstructured. The mindset that goes with big data is one of ‘turning mess into meaning.’

Some Salon participants understood big data as datasets that are too large to be stored, managed and analyzed via conventional database technologies or managed on normal computers. One person suggested dropping the adjective ‘big,’ forgetting about the size, and instead considering the impact of the contribution of the data to understanding. For example, if there were absolutely no data on something and 1000 data points were contributed, this might have a greater impact than adding another 10,000 data points to an existing set of 10 million.

The point here was that when the emphasis is on big (understood as size and/or volume), someone with a small data set (for example, one that fits into an excel sheet) might feel inadequate, yet their data contribution may be actually ‘bigger’ than a physically larger data set (aha! it’s not the size of the paintbrush…). There was a suggestion that instead of talking about big data we should talk about smart data.

How can big data support development?

Two frameworks were shared for thinking about big data in development. One from UN Global Pulse considers that big data can improve a) real-time awareness, b) early warning and c) real-time monitoring. Another looks at big data being used for three kinds of analysis: a) descriptive (providing a summary of something that has already happened), b) predictive (likelihood and probability of something occurring in the future), and c) diagnostic (causal inference and understanding of the world).

What’s the link between big data and resilience?

‘Resilience’ as a concept is contested, difficult to measure and complex. In its most simple definition, resilience can be thought of as the ability to bounce back or bounce forward. (For an interesting discussion on whether we should be talking about sustainability or resilience, see this piece). One discussant noted that global processes and structures are not working well for the poor, as evidenced from continuing cycles of poverty and glaring wealth inequalities. In this view, people are poor as a result of being more exposed and vulnerable to shocks, at the same time, their poverty increases their vulnerability, and it’s difficult to escape from the cycle where over time, small and large shocks deplete assets. An assets-based model of resilience would help individuals, families and communities who are hit by a shock in one sphere — financial, human, capital, social, legal and/or political — to draw on the assets within another sphere to bounce back or forward.

Big data could help this type of an assets-based model of resilience by predicting /helping poor and vulnerable people predict when a shock might happen and preparing for it. Big data analytics, if accessible to the poor, could help them to increase their chances of making better decisions now and for the future. Big data then, should be made accessible and available to communities so that they can self-organize and decrease their own exposure to shocks and hazards and increase their ability to bounce back and bounce forward. Big data could also help various actors to develop a better understanding of the human ecosystem and contribute to increasing resilience.

Can ivory tower big data approaches contribute to resilience?

The application of big data approaches to efforts that aim to increase resilience and better understand human ecosystems often comes at things from the wrong angle, according to one discussant. We are increasingly seeing situations where a decision is made at the top by people who know how to crunch data yet have no way of really understanding the meaning of the data in the local context. In these cases, the impact of data on resilience will be low, because resilience can only truly be created and supported at the local level. Instead of large organizations thinking about how they can use data from afar to ‘rescue’ or ‘help’ the poor, organizations should be working together with communities in crisis (or supporting local or nationally based intermediaries to facilitate this process) so that communities can discuss and pull meaning from the data, contextualize it and use it to help themselves. They can also be more informed what data exist about them and more aware of how these data might be used.

For the Human Rights community, for example, the story is about how people successfully use data to advocate for their own rights, and there is less emphasis on large data sets. Rather, the goal is to get data to citizens and communities. It’s to support groups to define and use data locally and to think about what the data can tell them about the advocacy path they could take to achieve a particular goal.

Can data really empower people?

To better understand the opportunities and challenges of big data, we need to unpack questions related to empowerment. Who has the knowledge? The access? Who can use the data? Salon participants emphasized that change doesn’t come by merely having data. Rather it’s about using big data as an advocacy tool to tell the world to change processes and to put things normally left unsaid on the table for discussion and action. It is also about decisions and getting ‘big data’ to the ‘small world,’ e.g., the local level. According to some, this should be the priority of ‘big data for development’ actors over the next 5 years.

Though some participants at the Salon felt that data on their own do not empower individuals; others noted that knowing your credit score or tracking how much you are eating or exercising can indeed be empowering to individuals. In addition, the process of gathering data can help communities understand their own realities better, build their self-esteem and analytical capacities, and contribute to achieving a more level playing field when they are advocating for their rights or for a budget or service. As one Salon participant said, most communities have information but are not perceived to have data unless they collect it using ‘Western’ methods. Having data to support and back information, opinions and demands can serve communities in negotiations with entities that wield more power. (See the book “Who Counts, the power of participatory statistics” on how to work with communities to create ‘data’ from participatory approaches).

On the other hand, data are not enough if there is no political will to make change to respond to the data and to the requests or demands being made based on the data. As one Salon participant said: “giving someone a data set doesn’t change politics.”

Should we all jump on the data bandwagon?

Both discussants and participants made a plea to ‘practice safe statistics!’ Human rights organizations wander in and out of statistics and don’t really understand how it works, said one person. ‘You wouldn’t go to court without a lawyer, so don’t try to use big data unless you can ensure it’s valid and you know how to manage it.’ If organizations plan to work with data, they should have statisticians and/or data scientists on staff or on call as partners and collaborators. Lack of basic statistical literacy is a huge issue amongst the general population and within many organizations, thought leaders, and journalists, and this can be dangerous.

As big data becomes more trendy, the risk of misinterpretation is growing, and we need to place more attention on the responsible use of statistics and data or we may end up harming people by bad decisions. ‘Everyone thinks they are experts who can handle statistics – bias, collection, correlation’ these days. And ‘as a general rule, no matter how many times you say the data show possible correlation not causality, the public will understand that there is causality,’ commented one discussant. And generally, he noted, ‘when people look at data, they believe them as truth because they include numbers, statistics, science.’ Greater statistical literacy could help people to not just read or access data and information but to use them wisely, to understand and question how data are interpreted, and to detect political or other biases. What’s more, organizations today are asking questions about big data that have been on statisticians’ minds for a very long time, so reaching out to those who understand these issues can be useful to avoid repeating mistakes and re-learning lessons that have already been well-documented.

This poor statistical literacy becomes a serious ethical issue when data are used to determine funding or actions that impact on people’s lives, or when they are shared openly, accidentally or in ways that are unethical. In addition, privacy and protection are critical elements in using and working with data about people, especially when the data involve vulnerable populations. Organizations can face legal action and liability suits if their data put people at harm, as one Salon participant noted. ‘An organization could even be accused of manslaughter… and I’m speaking from experience,’ she added.

What can we do to move forward?

Some potential actions for moving forward included:

  • Emphasis with donors that having big data does not mean that in order to cut costs, you should eliminate community level processes related to data collection, interpretation, analysis, and ownership;
  • Evaluations and literature/documentation on the effectiveness of different tools and methods, and when and in which contexts they might be applicable, including things like cost-benefit analyses of using big data and evaluation of its impact on development/on communities when combined with community level processes vs used alone/without community involvement — practitioner gut feelings are that big data without community involvement is irresponsible and ineffective in terms of resilience, and it would be good to have evidence to help validate or disprove this;
  • More and better tools and resources to support data collection, visualization and use and to help organizations with risk analysis, privacy impact assessments, strategies and planning around use of big data; case studies and a place to share and engage with peers, creation of a ‘cook book’ to help organizations understand the ingredients, tools, processes of using data/big data in their work;
  • ‘Normative conventions’ on how big data should be used to avoid falling into tech-driven dystopia;
  • Greater capacity for ‘safe statistics’ among organizations;
  • A community space where frank and open conversations around data/big data can occur in an ongoing way with the right range of people and cross-section of experiences and expertise from business, data, organizations, etc.

In conclusion?

We touched upon all types of data and various levels of data usage for a huge range of purposes at the two Salons. One closing thought was around the importance of having a solid idea of what questions we trying to answer before moving on to collecting data, and then understanding what data collection methods are adequate for our purpose, what ICT tools are right for which data collection and interpretation methods, what will done with the data/what is the purpose of collecting data, how we’ll interpret them, and how data will be shared, with whom, and in what format.

See this growing list of resources related to Data and Resilience here and add yours!

Thanks to participants and lead discussants for the fantastic exchange, and a big thank you to ThoughtWorks for hosting us at their offices for this Salon. Thanks also to Hunter Goldman, Elizabeth Eagen and Emmanuel Letouzé for their support developing this Salon topic, and to Somto Fab-Ukozor for support with notes and the summary. Salons are held under Chatham House Rule, therefore no attribution has been made in this post. If you’d like to attend future Salons, sign up here!

Read Full Post »