Coming to Our Census: A Look at the Atlanta Fed's Research Data Center
Tom Heintjes: Hello and welcome back to another Economy Matters podcast, I'm Tom Heintjes, managing editor of the Atlanta Fed's Economy Matters magazine, and today I'm speaking with Julie Hotchkiss, a research economist and senior adviser in the Atlanta Fed's research department. Julie is also the executive director of the Atlanta Research Data Center, or ARDC, and that's what we're going to discuss today. Welcome to the podcast, Julie.
Julie Hotchkiss: Thanks, Tom. I'm glad to be here and that you asked me to come and talk about the Atlanta RDC.
Heintjes: Julie, can you briefly describe what the ARDC is?
Hotchkiss: Oh yes, let's get that out of the way, for sure. Research data centers are secure computing labs where qualified researchers conduct approved statistical analysis using nonpublic data. The operative words here are "secure," "approved," and "nonpublic." But I expect we'll get more into that later.
Heintjes: Oh, we sure will. Julie, there are now 24 RDCs around the country, and the Atlanta Fed houses one of them. How did the RDC begin? What need did they fill that wasn't filled by existing resources?
Hotchkiss: It's my understanding that the motivation by the Census Bureau to open satellite labs was to harness the brainpower, if you will, around the country to help address statistical issues that they face all the time. These would be questions about survey design, sampling techniques, and other types of statistical questions. I suspect that they reasoned that if researchers use Census data in their own research efforts related to these types of questions, then that just increased how much more the Census could learn.
Heintjes: That does sound like a mutually beneficial collaboration.
Hotchkiss: I think that it really is, and I think that that's what contributed to the success of this RDC network idea. Researchers, under very strict confidentiality constraints, have access to nonpublic microdata in order to expand their own research agendas, while at the same time the Census Bureau learns more about their data products, research designs, and things like that.
Now, before the satellite RDCs were opened, researchers actually had to travel to travel to the Census Bureau in Maryland to access nonpublic microdata in order to fulfill their own research. By the end of 2017, there's going to be an additional six RDC labs, bringing the total to 30 labs. And the new labs that are being added are going to fill out the middle of the country, where there are not many RDC labs at the moment.
Heintjes: Right. Before we go any further, I want to add that we are talking about the Census Bureau, so I want to make clear that that you're not speaking on behalf of the Census Bureau today.
Hotchkiss: That's absolutely true. My views are just those of a researcher, myself, who uses the RDC, and as executive director of the Atlanta RDC, where my role is really just to pay the bills and to be a cheerleader [laughter].
Heintjes: Speaking of the census, the U.S. has always had a decennial census, and data from the census go into any number of policy decisions. But why did the idea of microdata develop?
Hotchkiss: I'm not sure of the exact history, but the Census Bureau has been doing surveys in addition to the decennial census that we're all familiar with, for decades. I do know that every five years they do a survey of establishments, just like they do a survey of people every 10 years. The Census Bureau also conducts surveys for other government agencies, because they're really good at it, such as the Bureau of Labor Statistics, Department of Housing and Urban Development, the Bureau of Justice, the National Science Foundation, I could go on and on. So while it may surprise some people, the Census Bureau has been doing surveys focused on a variety of subjects for a very long time.
Heintjes: Right. RDCs began opening in the mid-'90s. Was it a matter of the technology reaching a point where the RDCs were feasible in a way they had not been prior to that?
Hotchkiss: As I mentioned before, before the RDCs were opened, a researcher would have to travel to the Census Bureau itself to do their research using nonpublic data. Again, I don't know exactly how it all transpired, but I'm sure it wasn't just the technical capability of accessing data remotely, because I was accessing data from other sources in the 1980s, but, rather, how to do it securely so that the confidentiality of the data wouldn't be compromised.
Heintjes: We'll touch on that again soon. I know that most RDCs are housed in universities, although a few are in other Reserve Banks and other types of institutions. How did the Atlanta Fed come to house an RDC?
Hotchkiss: By the time the Atlanta RDC was opened, there were only about 13 RDCs around the country. There was a noted absence in the Southeast. Two faculty members at Georgia State [University] had connections with at the National Science Foundation and with the Census Bureau, and they were approached about building a consortium in Atlanta. These faculty were aware that there was already an RDC at the Chicago Fed, so they approached me about including the Atlanta Fed in the consortium with the possibility of housing the lab in the Fed building. It seemed to make a lot of sense to house a secure data center in a facility that knows about security, like the Atlanta Fed does. There are two other locations at Federal Reserve Banks—the one in Chicago that I mentioned that's been around much longer, and the Kansas City Fed that just opened this past year. And by the end of 2017, there should be a branch RDC located at the Philadelphia Fed.
Heintjes: Was it difficult to make a business case for the Atlanta Fed to house the RDC?
Hotchkiss: It really wasn't. Fed management was very excited about the possibility of engaging the broader Atlanta research community, and they jumped at the opportunity to participate. I wasn't sure how they were going to respond to the request. But they reasoned that being part of the consortium would provide staff here at the Atlanta Fed with the potential to have access to an important new resource. But in addition, by hosting the RDC, staff would be able to interact with researchers from around the region that would be coming here to use the RDC, and this would provide intellectual stimulus to the work that we're already doing here.
Heintjes: I imagine it was quite the learning experience for everyone concerned.
Hotchkiss: That's a bit of an understatement [laughter]. Other Feds, like Chicago, have a long history of sharing their space with tenants, but Atlanta didn't have that experience. The Atlanta RDC is the first—and still the only—tenant in the Atlanta Fed building, so there were some significant growing pains, but we've managed to work through it.
Heintjes: You've mentioned the consortium. What other institutions make up the ARDC consortium?
Hotchkiss: There are currently eight members in the Atlanta consortium. In addition to the Federal Reserve Bank and Georgia State University, Emory University, the University of Georgia, Georgia Tech, Florida State University, Clemson University, and the University of Tennessee all support the operations and use the Atlanta RDC.
Heintjes: Julie, is one of the roles of an RDC is to help answer policy questions without the need for something like additional data collections?
Hotchkiss: It's interesting that you ask that, because the Census Bureau makes a very big point that they don't involve themselves in questions of policy—ever. However, they are very interested in learning more about the data they collect and about the population of people and businesses in the United States. And this is where the mutual benefit of the relationship between the Census Bureau and the research communities is illustrated. One of my favorite examples of the type of research question that can be answered through use of nonpublic data is where a researcher investigated the incentives of implementing performance-based standards in school systems. Using administrative data that are only available through the RDCs, the researcher was able to find that if a school system had performance-based standards in place, this greatly reduced the number of teachers taking second jobs. Presumably, the teachers wanted to focus their energies on doing well as a teacher so they wouldn't perform badly by those performance standards.
Heintjes: So, what's in it for the Census Bureau, if they don't care about the policy questions that come out an RDC?
Hotchkiss: From each one of these projects, they learn more about the population underlying their surveys. For example, if they see in their surveys a drop in the number of people holding second jobs, they might wonder if that drop was a statistical phenomenon, like their survey is messing up somehow, or a change in behavior. The research I just described provides a nonstatistical explanation for the change in observed responses to the survey.
Heintjes: I see. Julie, you've noted several times in this conversation that access to an RDC is strictly controlled. Even other employees at institutions hosting an RDC, such as myself, can't walk in and begin accessing data. Briefly, how is "restricted-use microdata" different from, say, Census data that is publicly available?
Hotchkiss: That's a very good question. Take reported income, for example. Publicly available surveys, like the Current Population Survey [CPS], which is where we get our monthly estimates of the unemployment rate, by the way, ask respondents about their income. Well, if someone reports an income level in the six or seven digits, this information—along with other information they report, such as their occupation or county that they live in—might make them identifiable to, perhaps, their neighbors, if their neighbors happen to be using the CPS. So what the Census does is either "topcode," or replace the income variable, making the person harder to identify. So users of the public data see the fake income variable, but researchers in the RDC, usually get to see the actual income variable on that person.
Heintjes: I see. Why does it matter whether researchers get to see the real income variable or not?
Hotchkiss: Well, for example, we can't really answer questions about things like income distribution or, say, growth in income inequality, unless they see all the incomes. So if you want to know something about that top 1 percent of earners—as a group, not even individually—we need to be able know exactly who they are and what they look like.
Heintjes: Julie, do you have any other examples of how the microdata available to researchers differs from what Census makes available to the public?
Hotchkiss: There are examples in business data as well. The business data is what comes from the census of establishments; the Census Bureau only publishes aggregate values from these surveys—for example, average employment numbers or revenues of establishments in a certain industry. In the RDC, researchers may have access to individual establishment-level information—information that a business might not want their competitors to know about but is crucial, say, if we want to know something about worker productivity. Both of these examples—the example of income inequality and knowing more about worker productivity—illustrate how useful nonpublic data can be and also how important the safeguards are in making sure these details are never made publicly available.
Heintjes: What keeps researchers from divulging this highly confidential information they have access to?
Hotchkiss: You mean besides the potential $250,000 fine and five-year prison sentence? [laughter]
Heintjes: Yes, setting aside those consequences for the moment.
Hotchkiss: Besides the heavy penalties that someone would be subjected to if they revealed any of the data they have access to, each researcher has to undergo screening to obtain what's called Special Sworn Status. This special term means the Office of Personnel can track down friends, family members, and neighbors to make sure the person being allowed to access these data doesn't have some dark secret that would make them at the least cavalier or even worse in their use of these data.
Heintjes: Julie, to use the full name, the Federal Statistical Research Data Center doesn't only hold Census data—other federal agencies contribute data, including the Department of Transportation, the Agriculture Department, the Social Security Administration, the BLS and Energy Information Administration, just to name some of the better-known ones. To researchers, what advantages are brought by having these disparate agencies house their data in one repository?
Hotchkiss: You've touched on one of the most important aspects of the RDC network, and that is that one of the main advantages of having data from many surveys managed by one agency is that it allows researchers to see individuals and business when they show up in multiple sources. So, for example, an individual who is surveyed as part of the Current Population Survey is also included in the administrative records filed by his/her employer every quarter with the Department of Labor. The employer reports the individual's earnings in the administrative data, and the employee reports his or her earnings in the survey. An important question, then, is how accurate survey responses are. By comparing the survey response with the administrative data, we can learn something about how reliable survey responses are.
Heintjes: Can linkages and inferences be made across surveys conducted by different organizations?
Hotchkiss: Yes, absolutely, Establishments and firms are linkable across economic surveys. For example, establishments from the survey of pollution abatement can also be found in the economic census, and we can use data from the economic census to fill in some information that might not have been asked in the specific pollution abatement survey, for example.
Heintjes: Given the wealth of data we have access to—well, you, not me—this must be a wonderful time to be a researcher. How have these resources changed how you go about your work?
Hotchkiss: Well, again—first, to be clear, even I don't have access to data except those included in my very specific, approved research project. But to your point, when we publish articles in top economics journals, we are often required to make data and programs available for replication purposes. Basically, it keep us honest. However, when one uses nonpublic data, of course, we can't just turn over those data. We do make the programs available, and we're typically granted an exemption from providing data. Somebody just produced a report that since 2006, the percent of articles published in top journals receiving this sort of data exemption has increased from 8 percent in 2006 to 46 percent of articles in 2014 that get this exemption. What this suggests is that using nonpublic, proprietary data is becoming increasingly important for publishing in top journals. In economics, anyway.
Heintjes: I guess that kind of dynamic is kind of a chicken-and-egg question.
Hotchkiss: It is kind of hard to say which came first: researchers using restricted data or the demand by journals to use new and unique data. The reality is with us in the academic world. Using nonpublic, proprietary, or experimental data is becoming not only more common but, to a certain extent, expected.
Heintjes: Is there anything that's come out of RDC research that the man on the street might be familiar with?
Hotchkiss: There might be a better current example, but one recent project that comes to mind that has the potential of wide-ranging impact, whether the public actually is going to be aware of it or not, is being done by one of our own Fed economists. Whenever someone wants to start a business and expects to hire workers, they have to apply with the IRS for what's called an employer identification number, or EIN. Some researchers, including my colleague, had the idea that these applications for EINs might be useful in predicting business cycles. This is one of the hardest things that Fed economists try to do in setting monetary policy: trying to figure out when we're going into or coming out of a recession. We really have very little what's called leading economic indicators of these cyclical turns. So as you might expect, as the economy starts to slow, the man and woman entrepreneur on the street knows it before we do, and there are fewer applications for EIN. And vice versa—when the economy starts to pick up again, we should see an increase in applications for EINs.
Heintjes: Wow. An undertaking like that must take some time.
Hotchkiss: The researchers have actually been working on this project for a few years now. They've been cleaning the data, and once they're finished putting it all together, it will be made available to researchers in the RDC. I believe they have plans to make a public version available using these data in the aggregate so that other researchers who don't have access to the RDC—or even you and I—could just go and have a look for ourselves.
Heintjes: So when a researcher is working in the RDC, do you help them execute queries or whatever they do in there?
Hotchkiss: The researcher has to be pretty much self-reliant in doing their research. The computing environment can be different than what we're used to when we are simply running regressions at our desk—and if I haven't mentioned it yet, let me tell you that have to be physically in the lab in order to do the research. It's sort of back to the future, for those of us who remember having to go to the campus computer lab to run our regressions. But anyway, while they're in the lab, they have an important resource available to draw on, and that's our Census administrator, Melissa Banzhaf. She is a Census employee, and she is part of the process from start to finish—your best friend. She helps researchers as they develop proposals, as they try to track down definitions of variables in the data they are using, and in filing requests to release their results for presentations or journal publications. And I would be remiss not to mention that she likes chocolate chip cookies, so if you want to stay on her good side...I'm just saying.
Heintjes: Well, tell her about this podcast and put in a good word for me! Well, Julie, this has been a fascinating conversation, and I want to thank you for taking time and sharing your insights with us.
Hotchkiss: Thanks for asking, Tom—I really appreciate your asking. I never would have thought your listeners might be interested in something as nerdy as data!
Heintjes: Oh, we dominate the data-nerd demographic. And I also want to encourage listeners to visit the RDC's website, and you'll get a good idea of the facility's capabilities and purposes. Well, we're at the end of another Economy Matters podcast. I'm Tom Heintjes, managing editor of Economy Matters, and thanks for spending time with us. Please come back next month for another episode!