Subscribe to CIO Magazine » preps for flood of census queries

For, big data is about to get even bigger.

The subscription-based website for finding long-lost relatives already has 6.7 billion historical records and 4.8 billion people named in family trees on its website. But now it's adding the 1940 United States Federal Census, which the federal government will release on Monday.

The National Archives has turned the 1940 census paperwork into more than 3.8 million digital images. The online archive—being released after a 72-year waiting period—will be a gold mine for people just beginning to compile their family history, though it will become easier to use once the images are indexed.

When's database and index are complete, users will be able to search more than 130 million census records using fields such name, street address, county and state.

Scott Sorensen,'s senior vice president of engineering and its top IT executive, says that his staff has been busily preparing systems for the expected deluge of search requests.

The company learned its lesson two years ago from a huge spike in website traffic during the TV show "Who Do You Think You Are," in which celebrities such as Sarah Jessica Parker discover clues about their ancestors. At the first commercial break, many inspired viewers apparently dashed to their computers to try their hand at family research. had prepared for a 300 percent spike in traffic from TV viewers, but the website was slammed by traffic that was (in some cases) 21 times the usual pattern, which "brought us to our knees," Sorensen says.

Since then, the company has added servers and beefed up its network and infrastructure to support bigger surges in traffic, he says.

The company has nearly 5,000 servers at its data center and uses a variety of tools to handle its big data work, including the data-mining software Hadoop; traditional relational database software; statistical software called R; algorithms that employ machine learning, a form of artificial intelligence; and Mongo DB, database software that creates linkages among the public family trees posted on the site.

The Provo, Utah-based company had about $400 million in sales last year and has about 1,000 employees, according to It currently has 1.7 million subscribers.

The key business goal at is to broaden its customer base to include people who are curious about their ancestors but aren't experienced researchers. Sorensen's job is to use technology to make the discovery of ancestors as easy as possible—so the first-time searchers don't go away disappointed.

Consequently, his technology group works to improve customer metrics such as "time to first discovery" and (for long-time subscribers) "number of discoveries in a week." The company continues to enhance the "power-user tools" for sophisticated researchers, too, Sorensen says.

Three years ago, most ancestor discoveries were made through the company's custom search engine, but now more discoveries are made through "hinting," whereby's artificial intelligence technology suggests likely connections or records.

"We take the massive amounts of data we have, and the billions of records that people have attached to the family trees, to do record linking and record matching," Sorensen says. "So you start with 40 million Smith names, and then 4 million John Smiths, but what you want are the four records about your great-great-grandfather John Smith. Our record-linking technology will try to surface those four records and give you a hint," he explained. "We try to make those discoveries more automatic."

What does the future hold? Sorensen says he envisions a time when the company adds socio-economic data to the classic genealogical data to provide more colorful information and context about ancestors. He offered this example: "I can see [from the 1930 census] that my great-great-grandfather had a radio, and was the only person on the block to have a radio. Then [with socio-economic data] here's the additional color that shows what percentage of people had a radio in that time and place."

Mitch Betts is CIO magazine's executive editor. Follow him on Twitter: @mitchbetts.

Read more about data management in CIO's Data Management Drilldown.

Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.

More about: Betts, etwork, Hoovers, NBC, Scott, Wikipedia
References show all
Comments are now closed.
Related Whitepapers
Latest Stories
Community Comments
Latest Blog Posts
  • Secure by design - How to dramatically simplify data protection, access control and other critical security tasks
    This white paper examines how you can dramatically reduce the effort required to protect mission-critical information, while giving users fast, simple, flexible remote access that enhances business productivity.
    Learn more »
  • Simple, Proven, Tranformative
    A cheat Sheet for Google Apps for Business
    Learn more »
  • Modernize Your Business with Oracle ERP Cloud
    If your business has plans that include aggressive growth and aspires to be a best-in-class organization, your IT systems and applications need to be up to the task. Homegrown solutions or outdated software can hamper the execution of your strategic vision. If your IT infrastructure and maintenance costs are affecting your ability to stay competitive, then a cloud-based enterprise resource planning (ERP) suite is well worth exploring. This eBook explores the core components of a cloud-based ERP solution that delivers enterprise-class software without sacrificing functionality or changes to business processes and with no additional cost for infrastructure and complicated integrations.
    Learn more »
All whitepapers
rhs_login_lockGet exclusive access to Invitation only events CIO, reports & analysis.
Salary Calculator

Supplied by

View the full Peoplebank ICT Salary & Employment Index