Think your enterprise has a big data problem? Consider New York City: With a population of 8.2 million, it is far-and-away the largest city in the U.S. And it generates data--a lot of data--most of it in highly fragmented silos contained within the many municipal agencies and departments that keep the Big Apple running.
It has business identification numbers (BINs), borough block and lot (BBL) tax numbers, business licenses, parking tickets, health inspections, traffic violations, crimes, ambulance calls, fires and more.
"We're here at the Hilton in New York," says Michael Flowers, analytics director for the New York City Mayor's Office of Policy and Strategic Planning and director of the Financial Crime Task Force of the City of New York, speaking at the O'Reilly Strata Conference + Hadoop World. "It's a latitude and longitude; it's a postal address. It's a borough block and lot tax number. It's a building identification number and a number of other things that for each agency indicates where it needs to go and what it needs to do." "But in terms of leveraging all that information on behalf of one another," Flowers says, "it becomes extraordinarily difficult from an ontology and taxonomy standpoint. Moreover, all of those pieces of information are stored in different parts of the city, so it's incredibly fragmented. The systems themselves range from the brand new and awesome to play with to things that really aren't that different from pong--old mainframe systems that are really difficult to play with."
NYC's 311 Receives More than 65,000 Calls a Day
To make matters even more challenging, New York City's 311 non-emergency line receives more than 65,000 calls a day--ranging from complaints about noise and reports of potholes and broken sidewalks to questions about obtaining a copy of a deed or whether it's legal to keep piranhas.
"We gear our allocation of our agency resources based on basically a simple queue," Flowers says. "A call comes in and we respond to that call."
The only problem: Calls to 311 aren't necessarily a good predictor of where those resources really need to go. They are data, but they're not complete data.
So Flowers undertook a skunkworks project for New York City Mayor Michael Bloomberg. He and his team needed to show the New York City governmental community how the city's massive volumes of data could be used to allocate resources more effectively.
"We're trying to make your lives easier in any way we can while allocating those resources as effectively as we can so we don't have to tax you as heavily as we have in the past," Flowers explains. "What we needed to do was come up with a way to demonstrate the utility of a common platform. I needed to go out and demonstrate to the New York City governmental community that it made sense for us to put this information together and use it. That was our job; that's what the skunkworks was really about."
Skunkworks Big Data and Illegal Conversions
The project was ambitious. Flowers wanted to use the data to help identify buildings likely to house illegal conversions--for instance a building safely zoned to house six human beings that a developer chops up to fit 60 people.
Based on complaints to the 311 line, you would assume the majority of illegal conversions take place in lower Manhattan. If you plot the complaints on a map, lower Manhattan "blows up like a tomato," Flowers says. But actually, inspectors from the Buildings Department are far more likely to find the illegal conversions in the outer boroughs: Brooklyn, Queens and the Bronx.
"We're sending our resources where the complaints are instead of where the conditions are," Flowers says. "And those conditions are very serious. Why are they serious? In the spring and summer of 2011, we had two buildings go up in flames that had been illegally converted. We had some firemen get seriously injured and we had people die."
Indeed, buildings that are illegal conversions are the scenes of a lot more fire per property, Flowers says. And, more importantly, firemen are much more likely to be injured or killed fighting such fires because exits tend to get blocked.
"My office was tasked with figuring out how to fix that," Flowers says.
Successful Data Project Starts with Talking to People
It might seem like an unlikely goal for a team like Flowers'. No one on his staff has an advanced degree, and everyone, except Flowers himself, is age 25 or younger. Additionally, because it was a skunkworks project, no one outside the team really understood what they were doing. But they were determined to make a difference. One of the first things Flowers did was to go out into the field and talk to the people on the front lines.
"We got out in the field," he says. "I talked to firemen. I talked to policemen. I talked to inspectors from the Buildings Department, from Housing Preservation & Development. I talked to the Water Department. I asked them: 'When you go to a place that's a dump, what do you see?' Then I replicated that in the data."
Flowers had his team study actual "vacates"--instances where an inspector had found a building so unsafe that it had to be emptied either in whole or in part.
"I didn't need to deconstruct the complaints," Flowers explains. "I deconstructed the problem. And I deconstructed the problem using city data."
Flowers' team looked for several telling metrics, including the following:
-- Is the building in a "high-risk neighborhood," which Flowers calls code for a neighborhood for poorer citizens who are much more likely to live in dangerous conditions?
-- Was the building built before 1938? The building code changed in that year, and buildings built after the change tend to be safer.
-- Is the building under foreclosure or a tax lien? "The reason those two are important is that it just speaks to the owner's financial condition," Flowers says. "I don't think there's anything revelatory in the fact that if a landlord is broke, they're going to treat their building like crap."
-- Have there been complaints? "Complaints do matter," Flowers says. "If there was a prior complaint and then a subsequent complaint six months later, it was much more likely that there was going to be a fire."
With the correct data identified, Flowers' team created a tool that was directly usable by the inspectors closest to the problem. Before inspectors had the tool, they found buildings so unsafe that they had to vacate them 13 percent of the time. Eighteen months after Flowers' project, inspectors now vacate 70 percent of the buildings.
"We won because we had the right data," Flowers says. "The city's data is good and we used it in the right way."
"All we did was prioritize," he adds. "It was immediately actionable intelligence. That's why it works."
With the value of data-driven decision-making proven, Flowers says he has three goals to achieve before Mayor Bloomberg leaves office in January 2014:
-- Establish citywide analytics focused on leveraging agency resources more efficiently and effectively
-- Grow and enable the culture of data-driven resource allocation at the agency level
-- Push dynamic New York City data to the public, tech/entrepreneurial community and academia
Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.