United Nations researchers had a sobering realization in 2010. For all of the official data and reports collected by the group's member nations and various UN programs and agencies, precious little of the data that supports the organization's operations was truly up to date.
That left officials worldwide trying to respond to socio-economic crises, such as the financial meltdown and staggering unemployment, with data that was months and sometimes years old.
In response, the UN launched an initiative called Global Pulse to coordinate research on big data for development. For its efforts, the UN was recognized as a 2013 Computerworld Honors Laureate.
Global Pulse "came out of the very pointed recognition that there's a need for real-time information, especially in an age of hyperconnectivity when something that happens one place in the world can immediately impact somewhere else," says Anoush Tatevossian, a UN strategic communications officer and partnership manager.
The UN isn't the only organization working to harness the power of big data and analytics for social good. Several other 2013 Computerworld Honors Laureates gather and analyze information to do things like minimize the impact of climate change, identify at-risk students and improve financial services to underserved populations.
Celebrating its 25th year, the Computerworld Honors program recognizes organizations that create and use IT systems to promote and advance public welfare. On June 3, this year's 267 honorees will gather in Washington to celebrate their achievements. Here's a look at projects undertaken by four laureates.
Monitoring Mood Shifts
One of the first projects United Nations Global Pulse took up was analyzing social media chatter and sentiment to identify trends related to unemployment increases, and then inform policymakers of likely effects. Analyzing 500,000 blogs, forums and news sites, the team used text mining and social media analytics tools from SAS to examine two years of social media data from the U.S. and Ireland. They scanned for all references to unemployment and coping mechanisms. The team then compared and analyzed so-called mood scores, which were based on the tone and themes of various references to unemployment.
In the U.S., a rise in "hostile" or "depressed" mood scores occurred four months before the unemployment spike. Increases in "anxious" chatter in Ireland correlated with a spike five months later. Increased "confused" chatter preceded a spike by three months, while "confident" chatter decreased significantly two months out. A dashboard displayed trends such as mood change over time, and leading and lagging indicators of unemployment shocks.
Another analysis revealed that increased chatter about cutting back on groceries, increasing use of public transportation and downgrading one's automobile could predict an unemployment spike. After a spike, surges in conversations about canceled vacations, reduced healthcare spending, and foreclosures or evictions shed light on lagging economic effects. This kind of information is invaluable for policymakers trying to mitigate the negative effects of increased unemployment, Tatevossian says.
Global Pulse is set up as an innovation lab at the UN in New York and is highly dependent on partners like SAS. "We're really a space to learn how big data and some of the new analytic technologies could be useful to the UN system," she says.
"We're very interested in reaching out to the private sector because we're interested in taking what has been tried [there] and seeing how it can be applied to UN processes," says Tatevossian. For example, she compares Global Pulse's work on unemployment to work that consumer goods companies do on a daily basis. "All we did is re-jig [the tools] that are used for brand monitoring and we treated unemployment as a brand," she says.
The UN has opened an additional Global Pulse lab in Jakarta, Indonesia, and will soon open a third in Kampala, Uganda.
Taking the Planet's Pulse
Every minute of every day, hundreds of thousands of sensors collect a huge volume of climate-related data for use by scientists working on very specific research projects at government organizations such as the National Oceanic and Atmospheric Administration (NOAA) and NASA.
"What we figured is that there must be a way to leverage this data in a wider way, to help companies with some of the climate-related issues they're grappling with," says Travis Koberg, head of Computer Science Corp.'s Big Data and Analytics Group, a 2013 Computerworld Honors Laureate.
Enter ClimatEdge, a suite of risk management tools that applies big data analytics to historical data from NASA, NOAA and other public sources. By exploiting this previously underutilized climate data resource, ClimatEdge provides new insights to commercial and public interests that need to minimize risk and make better-informed decisions, says Koberg.
Electronics manufacturers, for example, might want to find alternatives to East Asian semiconductor supplies if Pacific storms fueled by warming ocean waters become more intense. Or homebuilders might want to prepare for fluctuations in lumber prices brought on by climate conditions that force pine beetles and other pests that prey on trees to expand their range and destroy more forests.
Most of the data CSC is collecting is structured. What distinguishes ClimatEdge as a big data project is the sheer volume of data being collected and analyzed, says Koberg. "The volume is on an order of magnitude scale that you can't do a lot with in Excel, which is what a lot of scientists use," he says.
ClimatEdge began producing reports in June 2012. It is updated on a continuing basis as CSC learns new ways to apply evolving data science principles and gains access to new data sources.
Looking ahead, Koberg says he expects CSC to tap into, combine and analyze other underused big data reservoirs. "Climate is just one area of data. Healthcare is another. We work with healthcare where we collect data across a health system and put it together with climate data to hypothesize about characteristics tied to a certain disease," he explains. "Over time, we're looking at all sorts of data in various domains."
Creating a Student Safety Net
An educated workforce is absolutely critical to competing successfully in the global economy. Yet the U.S. ranks lower than 10th among countries in education attainment. Only about 35% of students who begin a four-year college degree program complete their education where they began their studies. Moreover, over the course of the past 20 to 30 years, retention and completion rates have steadily dropped at a time when global economic competition has increased exponentially.
"I'd personally position the problem as a national crisis," says Josh Baron, senior academic technology officer at Marist College in Poughkeepsie, N.Y.
The Open Academic Analytics Initiative, an Educause project funded by the Bill and Melinda Gates Foundation and a 2013 Computerworld Honors Laureate, aims to significantly reduce education interruption and dropout rates by identifying at-risk students via predictive analytics and then proactively intervening to support these students before they leave school.
Launched in May 2011, the OAAI collects data and log information from member schools' learning management systems and couples it with student aptitude and demographic data. A predictive model, built entirely in open source, is then applied to this data to identify at-risk students.
To build the model, researchers analyzed several semesters of data from Marist College students. Next, the team used BI systems, primarily Pentaho open-source software, to put the analytics systems together.
"We thought it important to have an open-source solution to reduce costs of ownership so that other institutions could adopt it," Baron explains. "The other big advantage of open source is everything becomes transparent and open to other institutions, especially the predictive model, which we've released to other institutions."
The OAAI team applied its model to 2,200 students over a period of two semesters, identifying those who might benefit from interventions, such as messages from instructors that express concern for a student's performance and/or additional online academic support.
In the end, the OAAI team determined that such interventions had a significantly positive effect, resulting in a 7% higher final course grade for students who received them.
"We far exceeded our expectations in terms of impact of the project," Baron says. "We weren't necessarily expecting to see statistically significant outcomes, so this has given us great encouragement."
Going forward, Baron says OAAI is seeking additional funding to continue the program, which is focused on low-income students who often drop out of school for economic reasons.
Bringing Social Issues to Light
Socially sensitive issues such as domestic violence, foeticide and child sexual abuse are taboo as topics of discussion in much of India. But these are precisely the issues that Bollywood actor and filmmaker Aamir Khan took on as topics of his TV show Satyamev Jayate, which translates to Truth Alone Prevails. The show's goal was to prompt discussion about these rarely talked about societal problems and to learn more about how Indian people thought and felt about them. It would be a first step toward resolving them.
To achieve this goal, Persistent Systems Ltd., a 2013 Computerworld Honors Laureate, set about monitoring and analyzing a massive amount of data it collected from social media channels immediately after each 90-minute episode of the program aired.
"The show is a cross between Oprah and 60 Minutes," explains Mukund Deshpande, head of BI/analytics at Persistent Systems. "The goal was to use social media to connect directly with people and close the loop as a way to have a conversation with viewers."
The show was carried on 13 TV channels in India, and each episode was posted to YouTube within 30 minutes of its airing. Each show immediately elicited millions of messages on Facebook, Twitter and other online discussion forums. The challenge, says Deshpande, was to make sense of long, complex messages that were very emotional and often contained stories of people's personal encounters with abuse.
This created a big-data problem both in terms of volume and network performance. The show was flooded with a staggering 1.09 billion impressions across social channels. All structured and unstructured data was analyzed in real time to convey the show's impact on legislation, society and individuals, which was displayed on a so-called impact dashboard.
Persistent Systems designed and developed the custom end-to-end analytics process in three weeks. The project was implemented using the latest distributed computing technology and Hadoop.
Adding to the unstructured data challenge, social media responses were in "Hinglish" (Hindi words in Roman script embedded in English). This ruled out using existing tools to handle messages, which is why developers created a customized system to understand response sentiment.
Deep analytics extracted valuable insights, Deshpande says. The new system aggregated all unstructured data then automatically filtered it to weed out spam and unrelated messages. Valid messages were tagged and rated. Short messages praising the show were rated lower than longer messages and personal stories. Final selection was done manually using triangulation to determine the top content.
Deshpande says that social scientists have expressed interest in using a similar process to conduct a new kind of social research. "Usually, they form a small group of people and study them intensely for three to six months," he notes. "But what we have here is exactly the opposite of that. We don't have rich data about a small number of individuals but data about millions of people, including their age and gender and how they feel about particular issues. It would be a new way to do social science research."