Ancestry.com preps for flood of census queries
- 31 March, 2012 03:49
For Ancestry.com, big data is about to get even bigger.
The subscription-based website for finding long-lost relatives already has 6.7 billion historical records and 4.8 billion people named in family trees on its website. But now it's adding the 1940 United States Federal Census, which the federal government will release on Monday.
The National Archives has turned the 1940 census paperwork into more than 3.8 million digital images. The online archivebeing released after a 72-year waiting periodwill be a gold mine for people just beginning to compile their family history, though it will become easier to use once the images are indexed.
When Ancestry.com's database and index are complete, users will be able to search more than 130 million census records using fields such name, street address, county and state.
Scott Sorensen, Ancestry.com's senior vice president of engineering and its top IT executive, says that his staff has been busily preparing systems for the expected deluge of search requests.
The company learned its lesson two years ago from a huge spike in website traffic during the TV show "Who Do You Think You Are," in which celebrities such as Sarah Jessica Parker discover clues about their ancestors. At the first commercial break, many inspired viewers apparently dashed to their computers to try their hand at family research.
Ancestry.com had prepared for a 300 percent spike in traffic from TV viewers, but the website was slammed by traffic that was (in some cases) 21 times the usual pattern, which "brought us to our knees," Sorensen says.
Since then, the company has added servers and beefed up its network and infrastructure to support bigger surges in traffic, he says.
The company has nearly 5,000 servers at its data center and uses a variety of tools to handle its big data work, including the data-mining software Hadoop; traditional relational database software; statistical software called R; algorithms that employ machine learning, a form of artificial intelligence; and Mongo DB, database software that creates linkages among the public family trees posted on the site.
The Provo, Utah-based company had about $400 million in sales last year and has about 1,000 employees, according to Hoovers.com. It currently has 1.7 million subscribers.
The key business goal at Ancestry.com is to broaden its customer base to include people who are curious about their ancestors but aren't experienced researchers. Sorensen's job is to use technology to make the discovery of ancestors as easy as possibleÂso the first-time searchers don't go away disappointed.
Consequently, his technology group works to improve customer metrics such as "time to first discovery" and (for long-time subscribers) "number of discoveries in a week." The company continues to enhance the "power-user tools" for sophisticated researchers, too, Sorensen says.
Three years ago, most ancestor discoveries were made through the company's custom search engine, but now more discoveries are made through "hinting," whereby Ancestry.com's artificial intelligence technology suggests likely connections or records.
"We take the massive amounts of data we have, and the billions of records that people have attached to the family trees, to do record linking and record matching," Sorensen says. "So you start with 40 million Smith names, and then 4 million John Smiths, but what you want are the four records about your great-great-grandfather John Smith. Our record-linking technology will try to surface those four records and give you a hint," he explained. "We try to make those discoveries more automatic."
What does the future hold? Sorensen says he envisions a time when the company adds socio-economic data to the classic genealogical data to provide more colorful information and context about ancestors. He offered this example: "I can see [from the 1930 census] that my great-great-grandfather had a radio, and was the only person on the block to have a radio. Then [with socio-economic data] here's the additional color that shows what percentage of people had a radio in that time and place."
Mitch Betts is CIO magazine's executive editor. Follow him on Twitter: @mitchbetts.
Read more about data management in CIO's Data Management Drilldown.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.
The enlightened CIO’s guide to running projects
Why IT projects really fail
Queensland government to provide 200 services online by 2015
Call Centers Suffer From Big Data Overload
CIO 100: Carsales wins top gong for innovation
The New Disruption for Brands
The new frontier of mobile and social is a game changer, opening new channels in which consumers and brands can interact. This whitepaper details the results of a survey spanning consumers in the US, UK, Singapore and Australia, exploring their expectations of using mobile devices and social media to engage with brands. The results confirm that consumers live across various channels, and as part of their experience there is an expectation of consistency, value and individualised attention. Read more to learn who you’re talking to, what to say and where to say it.
Building a Strategic Archive
For years, most companies have dealt with the evolving dynamics of data archiving by addressing an immediate need rather than building a long-term strategy. But over time, putting all information on costly storage is likely to be very expensive. This whitepaper explains why it’s time for organizations to start to strategically evaluate archive solutions for capabilities they need, both now and in the future. While no technology is future proof, an archiving solution can make you “future ready.”
Best Practice in BYOD
The key trend affecting enterprise mobility today can be summarized in four letters: BYOD – Bring Your Own Device. As the number of end-users bringing devices into your organization grows, so does the need for an effective Enterprise Mobility Management (EMM) solution. Learn how to manage devices across multiple platforms all from a single, centralised and unified management console. Download for more!