24th Annual Financial Markets Conference - Mapping the Financial Frontier: What Does the Next Decade Hold? - May 19–21, 2019
- Papers, Presentations, and Audio and Video Recordings
- Speaker biographies
Research Session 2: Nonrivalry and the Economics of Data
Data is nonrival: a person's location history, medical records, and driving data can be used by any number of firms simultaneously without being depleted. In recent years, the importance of data in the economy has become increasingly apparent. Who should own the data?
Sudheer Chava: I'm Sudheer Chava. I'm a professor of finance at Georgia Tech. I'm very happy to moderate this session on the economics of data. With me I have Chris Tonetti, who is an associate professor of economics at Stanford—and the bio is there in the app. John Abowd is the discussant. He's the chief scientist of our U.S. Census Bureau. The questions you can ask through Pigeonhole, the app—the password is FEDFMC. Please direct all the questions through that.
Before we go to the session, I just want to mention a couple of things about the data—which, again, many of you probably are aware of. Right now, I'm just looking at some stats, and the data never sleeps. Even right now, I think, as we're sitting, as we're texting, as we are sleeping, there are billions of bytes of data being generated. And 2.5 quintillion—I don't even know what it is [laughs], but that's the amount of data being generated per day. More than 90 percent of all the data that's there has been generated in the last two years. As you know, that's going to even get... The velocity is going to become much faster, as many more Internet of Things come on the internet.
There's going to be much more data being generated. Again, some other interesting stats: each minute, Google has 4 million searches, and the Weather Channel receives around 18 million queries, and [for] Netflix—again, this is mind boggling—97,222 hours of video for every minute are being watched. YouTube has around 4.3 million videos—again, a minute—that people are watching, and mostly of cats and dogs [laughter] but, I think, lots of videos being watched.
There's around 30 million texts which are being generated. The bottom line is there's a huge amount of data generated, and I think it's very important to understand—what does it mean? The economics of the data. So I'm very happy to have Chris Tonetti talk about it. Chris?
Chris Tonetti: Thanks very much. As you mentioned, I'm a professor. Much of what I'm going to be talking about draws on an academic paper that I wrote titled "Nonrivalry and the Economics of Data" that's coauthored with my colleague Chad Jones. This is a paper about data, but very little data was actually used in the writing of this paper. It's mostly a theoretical paper. What we were trying to do is try to understand what's special about data—and a hint: it's in the title—it's that data is nonrival. And once you understand what's special about data, it leads to all sorts of interesting questions about the efficient use of data and about laws and markets that might be able to be developed to promote that efficient use. And so what we tried to do here is a very simple theoretical analysis that will point this out.
So just to get us all on the same page, data is just this all-encompassing thing. There's lots of different types of data. You can think about [how] Google has their search history, or Amazon has a history of purchases you've made. I think the Tesla and Waymo case is a pretty interesting one, where Tesla sells cars and attached to those cars are sensors. These sensors are collecting image data and driving behavior, and Tesla finds this to be particularly interesting data because they're trying to use it in order to estimate a self-driving car algorithm. Waymo, Google's self-driving car venture, is offering now cab ride services in Phoenix, driving people around—again, with a lot of sensors attached to these cars collecting data that they hope to use to estimate a self-driving car algorithm.
Another example is medical and genetic data. You go to a hospital and you get a scan, maybe of a tumor—is this cancerous or not? What they're trying to do is have machine-assisted diagnoses that can look at the tumor and, based upon a history of correctly labeled tumors and their status—perhaps achieved via biopsy—they'll say, "Okay, this tumor looks bad. we need to do something" or "No, we have a lot of evidence that this is one of the ones that's not so bad."
So there are lots of examples of data, and it just seems to be a more and more important use in our economy these days. So the traditional way that economists have modeled the use of data is thinking that people make choices in uncertain environments, and what data does is it helps to reduce the uncertainty that people face when they have to make their choices. In some way you can think of that is data informing models, either formal models—econometrics. Statistical models—or informally, we have our mental models of how things work, and the more experience we get, that's basically collecting data, collecting information that helps us to then better make decisions.
As I just described, many modern goods and services have at their core algorithms that make choices for us—think about these self-driving cars, or the medical detection algorithms. And so one way of thinking about data, it is as a factor of production, or is an ingredient that can be used in order to improve the quality, or lower the cost of making a product. And once you take that perspective, you can realize that there are many factors of production. Most of economics is thinking about the use and supply and creation of these factors of production—machines, buildings, labor, land, et cetera. Why is data special? Why can't I just apply all of the economics that I was taught in grad school? Why do I have to try to think about the new economics of data?
And the key thing about data is that it's nonrival, and when you have nonrival goods, very different economics apply from rival goods. So what does it mean to be nonrival? When something's nonrival, it's a technological feature—it's the feature of the world—that it's infinitely usable. Another way of saying that is its use does not deplete its availability to be used. So if I drink a cup of coffee, that's a rival good. Nobody else can drink that same cup of coffee. Or if I go to a doctor—while that doctor's treating me, she can't treat another patient at the same time.
But when you think about data, it's technologically feasible for multiple engineers or multiple algorithms to be run or estimated on the same data at the same time. That can be true within firms, and that can be true across firms. And so when you have data being a nonrival good, this harkens back to a lot of the economic growth literature about ideas being nonrival, there tend to be large social gains from the wide use of nonrival goods. That's because it's technologically possible to use this over and over and over again.
Something a little bit different about data is that you widely use certain goods, or sell a machine from one firm to another. You don't always have the privacy concerns that come along with it. So data is both nonrival, but also has a connection to the world of privacy. Now, firms might view data as a comparative advantage, something that gives them a leg up on their competition, so they might not want their competitors to have access to the data that they've invested in and they've collected. And so what we're going to do in this paper is try to think about a social planner who's taking the best interests of society in mind—what they might want to do, how data should be used and shared across firms. And then we're going to think about decentralized equilibria, or basically just marketplaces that will determine who gets to use data, when. And we're going to show that the social planner and consumers really care about the broad use of data, but firms care a lot about that last part—their comparative advantage—and that might lead to an inefficiency between what might happen in different property rights regimes.
These laws are being written right now on who gets to use data, when. Europe's got its GDPR [General Data Privacy Regulation], which says that the protection of natural persons, in relation to the processing of personal data, is a fundamental right—but they do say that the right must be considered in relation to its function in society. And this is going to be the theme of what I'm trying to get across, is that we do need to protect privacy—but that's not the only thing we should be thinking about when we're trying to write legislation on data. There's a social gain from having that data widely shared, even if that comes with some privacy cost.
The California Consumer Privacy Act of 2018, which I think will become law in 2020—or be enforced—allows consumers to opt out of having their data sold. And I'm going to show you, in the context of the theory that I've developed why this might not be the most efficient way to try to protect privacy.
The key point of the paper is that allocations with different degrees of data use will have different amounts of output, different amounts of societal welfare, and there are different ways to achieve different amounts of use of data. We're going to consider different property right regimes—will firms own the data, versus consumers own the data?—and build a simple mathematical model where there's a market for buying and selling data and see when firms own the data, how do they interact in that market? Compared to when consumers own the data: how do they interact in that market? How does that lead to the broad or narrow use of data, and what are the big macroeconomic implications of that in terms of output, welfare, consumption? Things along those lines.
This is reminiscent of property rights issues, where we had to invent patents in order to help protect the ideas. Data is similarly nonrival to ideas, and so we need to come up with some legal system to try to treat this. A key property that comes out of nonrivalry is increasing returns to scale. I won't go over the whole argument but the basic idea is when you have a factory—say you want to double the amount of cars you're producing, what's a very simple way to double the amount of cars you're producing? You just literally replicate the entire factory you're using to produce cars: double the machines, double the workers, double the buildings, double the land. If you've literally replicated it, it should be capable of producing just what your original one did and so, through that replication argument, you've doubled all these rival goods, these inputs into creating things, and you've doubled the output.
Now, in doing that I didn't have to reinvent the recipe for creating a car. I just used the same blueprint for creating a car in the second factory that I used in the first factory. And what this means is that when you increase ideas or data—this nonrival factor—together with increasing the rival factors, you get more than double the output. This is the core economics that leads to economic growth over the past 200, 250 years. And we're going to show you that this type of economics, with returns to scale, comes into play when you think about data as well.
The simple message of the paper is going to be that if firms own the data, they may choose to keep it within the boundaries of the firm, and therefore a firm is going to learn from its own consumers. But when a firm shares data, all firms might learn from all consumers, and so that's going to be a big social gain, potentially.
Now, a firm may not want to do this if they have the choice not to, and that's totally reasonable and understandable from the perspective of the firm. But from the perspective of a society, we get to write the laws that we want to live under, and we might want to write laws that empower consumers to make choices over their data, and not leave it up to the firms to do what is in their best interest in profit maximizing.
The last thing I'll show you before I talk about the trade-offs in the model that I'm trying to illustrate is a lot of classic economics relies on Adam Smith's "invisible hand." We give names to it in academic economics—the first and second welfare theorems—and they basically say that markets get things right, pretty much. When you have prices set in a competitive market, you get efficient allocations. You can't make anybody better off without making somebody worse off.
It turns out when you have nonrival goods that Adam Smith's "invisible hand" doesn't work so well anymore. You don't necessarily get to live in the best of all possible worlds. I'm going to elaborate a little bit on that here, but it fundamentally breaks some of the most core economics that have been around for a long time. Indeed, Paul Romer, who won the Nobel Prize very recently, did so very much for pointing exactly this out. So we'll start to think about some of these interesting questions.
I know this isn't an academic conference, so I'm not actually going to go through this slide, but I just felt I'd be remiss in my duties if you thought I was just talking what came off the top of my head. There is a formal model, and if you're interested, there's a paper posted online. Just to give you the flavor of it, what we do as academics is we write down utility functions, or preferences, that describe what people want. Here, there's a parameter called "kappa" that describes how much people are concerned with having their data shared, how much concern they have for privacy. And then we write down production functions, and that says, what do you do by combining ingredients to get some output?
In the model we wrote down, data is helpful for firms to produce more or higher-quality goods. There's a parameter call "eta" that controls how much higher quality of a good can a firm produce if they get access to some more data.
Then there's this creative destruction term on the bottom that also has a parameter, called "delta," and that's going to try to capture how much are firms concerned, if their competitors have access to their data, that they might go out of business, or their competitors might have an advantage over them. What I'm going to do right now is just walk you through in words what we do formally and mathematically in the paper, which is talk about, what are the incentives in this environment? Who wants to share data? Why don't you want to share all data? And how different legal structures might lead to different amounts of data sharing.
Let's imagine this all-powerful, benevolent social planner who makes choices about the use of data. Imagine the planner in this simple example is going to choose which hospitals get to see which medical scans and biopsy results. This is going to be very helpful for doctors when they see a medical scan to know if they need to take further action.
So why might the social planner want each hospital to use data collected from patients at other hospitals? It's almost self-explanatory. As I said, you want doctors to be making informed decisions. If they've only seen one medical scan before—or two, or three, or five. whatever comes through their hospital—they're not going to have a well of data to draw upon so that when they see a new image, they can say with any amount of accuracy, "We need to take this course of action." It makes a ton of sense that you want to share data so that doctors can be making informed decisions, and you'll get better treatment, better quality care.
Why might the planner not want to make all medical data available to all hospitals? This is where there's some possibility for having some privacy concerns, and in the U.S., we have very strict laws over the privacy associated with medical data. And so there might be some trade-off—maybe no trade-off at all, but possibly some trade-off—where the planner might hold back. And so there is this just trade-off that we can formalize in the model between privacy and improved quality of medical services, and it's technologically feasible for every doctor to have access to every single medical scan and biopsy result, because it's data and because it's nonrival.
I couldn't say, "Why don't we give every doctor access to every MRI scanning machine that every hospital has?"—because that's just physically impossible. If one doctor is using it, another doctor can't. But the data could be shared, and so that's a reason why you might want to.
So now let's go to another world. Let's go to the world where firms own data, and they get to make the choice over who gets to use what data. Let's go and think about a different example. Let's think about the example of some companies trying to develop some self-driving algorithms—so, Tesla and Waymo, for example. Why might Tesla... And imagine there's a market for data now, so you can buy and sell it, and the social planner isn't just telling people what they have to do.
So Tesla's got some data. Why might they want to buy some data produced by Waymo, for example? Well, Waymo's driven more car miles than Tesla has. They're trying to estimate a self-driving car algorithm, which is basically just a machine-learning algorithm or some fancy regressions. These are very data-hungry algorithms. So Tesla is desperate to have more data so that it can have a higher-quality self-driving car, or assisted driving platform. This is something that consumers value, so Tesla would want to have more data to be able to increase the quality of its product.
Now, in the world where firms own data, say I bought a Tesla and I'm driving it around. I'm producing useful data as I'm consuming my Tesla, but Tesla owns it, so they get to choose what they want to do. So why might they sell data that's produced by people driving Teslas to their competitors? Well, why do firms do anything? They want money. And if you sell something at a positive price, you gain some revenue. So they might be willing to monetize this asset that they have.
They might not want to sell it to everyone. If Waymo is a close competitor, maybe they don't want to do that, but they might be willing to sell it to some degree. And note: I think it's important that Tesla will still have access to the data even if they sell it, which is quite a different thing when you're in a world of rival goods. But because it's nonrival, just because they sold it and somebody else has access to it, it doesn't mean they have any less access to it themselves.
And so you get to thinking that they might not want to sell their data to their competitors. They might want to hoard it, but that has some social cost. And so what's the social cost? It's exactly what I was describing when I was thinking about the medical context. It's that we're going to have each product—the Waymo cab rides, the Tesla car—more slowly to develop, or having lower quality self-driving car algorithms than we otherwise feasibly could, right now, if these companies were sharing their data. And I understand why they don't want to, or why they might not want to, but it's technologically feasible because data's nonrival.
And a simple example is, imagine if every car manufacturer in the U.S. could produce with every single factory at the same time workers, robots, and machines. You would be able to produce a lot more cars than we otherwise could, but that's a nonsense statement because that's just physically impossible—but it's possible with data, because it's nonrival.
Now let's go to the third environment that we study in the model. We're going to imagine a world in which consumers own data as it's being created. So I own a Tesla. I'm driving it around. There's some interesting sensor data that's being created as I drive. What if I owned the data? Why might I want to sell that data back to Tesla? Same as before: financial incentives. If I can get a price for doing something I was going to do anyway—drive my car around—sure, I'd be willing to sell.
Now, why might I be willing to sell my Tesla data to Waymo? I don't care who's paying me. I want the money. If it's a valuable asset that I have, that I've created just by doing my normal everyday consumption, then I'm going to be willing to sell it to people who want to pay me for it.
Now I might not want to sell all my data to all firms. Maybe my location data is important to me, and so I want to maintain that privacy. So I might not sell it to everyone. But there's some asymmetry between how consumers think about selling data broadly to many firms, and how firms think about it. Consumers, they care about the quality or the price of the product. They don't necessarily care about who owns the profits created with the sale of the product. But the firm owners, they very much do care about who owns the profits from the sale. They do not want to be replaced by a competitor. And this leads to a difference in behavior on how consumers might want to sell their data across firms, compared to how firms might want to sell their data across firms.
So to just summarize the key forces in the model: firms might have an incentive to... f Tesla owns the data, they're going to use every drop of data that they collected that they find helpful to use to estimate their own self-driving car algorithm. And maybe that's using even more data than the consumer who drives the Tesla would want Tesla to use. But they might also restrict data sharing across competitors, because they don't want their competitors to replace them. And so they might want to hoard the data within the boundaries of the firm. Whereas consumers, if they own data, might be able to simultaneously respect their own privacy concerns, but sell data broadly, because they're going to weigh these two things out.
And here, a market and the price will reflect the value of that data, and the willingness of consumers to sell their data. And if you go back to a law... I don't want to say—I'm not a lawyer—that this is literally what the California Privacy Act was proposing, but imagine you have a law that just says, "We want to protect consumer privacy. We want to outlaw the sharing of data, or the selling of data, from one firm to another." That's going to be great at maximizing privacy gains, but you're going to have a missing scale effect. We're going to have worse quality products, because you're not being able to use this data that's technologically able to be used across all these firms.
So that's the basic message that I wanted to get across, and so let me just finish with a couple of concerns: How quantitatively important is this? How could we actually achieve this? And then some bigger picture questions before I wrap up.
So we've just designed a very simple model in order to illustrate these very basic forces. Maybe the joke is that economists spend a lot of time and a lot of energy formalizing common sense—and if that's the position, in some sense that's a success for me because I've just described to you something that you already agree with.
When you have a model, you can start to quantify things, but it's very difficult to quantify how important these things are. So how large are the privacy costs—and are they utility costs, per se? Do people just feel gross if others have their data, or is there some reason that people don't want firms to have their data? Maybe they're worried that if their health data is out there, insurance companies might find a way to not offer them insurance. Or if their purchase history is out there, firms might learn that they're not very price-sensitive, and so charge higher prices to them.
So I think we have to understand, where are the privacy cost concerns coming from, and how large are they? What are the returns inside the firm to having more data? Are we close to being in a world where we're saturated in data? I don't think so, but that's... Quantitatively, "how important is more data?" is a very, very interesting question.
And how substitutable are different types and sources of data? Another big open question is, are firms really concerned, if their data got out there to their competitors, that they would go out of business or lose some competitive advantage? To the degree that answer is "yes"—which I haven't seen a lot of formal research on, but I do think is the case—then there is the likelihood to be a big wedge between what's best for society and what might be achieved if you just live in a world where firms have ownership and control over the data as it's being created when consumers interact with their products.
And it's also an interesting thing to think about how the incentive to collect and to create data changes under different property rights regimes. If the firms don't own the data, are they still going to collect it? Are they going to process it? There are issues related to that. There are also difficulties in understanding how you want to implement consumers owning the data. There are technological concerns—maybe some of the blockchain stuff that was talked about this morning might be relevant. There's legal frameworks—how do you verify who's using the data? If I sell the data to somebody and they have access to it, well then, technically, maybe they can just go and sell it to another person, and I've just created a competitor for myself and not the market for my own data.
And so, are there licensing frameworks that we can enforce? How do you design the market? If I am the consumer and I own the data, who do I go sell it to—an intermediary? Do we form coalitions to sell? These are really important open questions. But the main takeaways are that there may be benefits to broadly using data across firms, that broad use is technologically possible because data's nonrival, and that markets might not deliver the optimal use of data without the right laws and institutions. This is especially important when you're thinking about nonrival goods, of which data and ideas are the two key examples.
I see, as a counterpoint to the position that protecting privacy should be the single mandate for policymakers when thinking about regulating data. There are trade-offs. Perhaps the maximum privacy protection would come at too large of a cost in terms of the restrictions that I've outlined.
And so let me finish by thinking about two big-picture issues that I think are interesting in the world of data, that come out of the issues of nonrivalry that I just outlined. One is a field of economics called "industrial organization." You can think that firms that use data might grow fast compared to those that don't, but I think a really interesting topic is that data sharing within the boundary of the firm is a force towards mergers or coalitions. If we outlaw... Or if, for market practices, we don't want to share data or sell data across firms, but it's very useful to have multiple divisions within a firm making use of this data, which is technologically feasible, that might be an incentive for firms to get larger, or for different firms that have complimentary uses of data to start to form coalitions.
Now, how should anti-trust treat that? That's above my paygrade, but it's an interesting question. When firms realize that by selling to consumers, they're creating data that's valuable, that they may own, they might actually charge a lower price to consumers for their goods because they know that the consumers will then buy more and create more data. So anti-trust isn't so easy as just looking and seeing—are prices high because of monopolies? You can think about, there are examples of targeted mandatory data sharing—after an airplane crash, we share the safety data across firms—and so it doesn't have to be a "one size fits all." There might be context-specific cases where we want to think about that.
Now let me finish on this slide. I've done some research on technology diffusion and innovation, and the key thing we realize is that ideas diffuse across space, across people, across countries. That's a reason why there's a lot more people producing ideas in the U.S. and in China than in Hong Kong, just because of scale. There's just more people there. But you don't see such a proportional difference in GDP per person, or the living standards. That's because we're not just producing in our locale with ideas that were generated in our locale.
Now what about data? It's not so obvious. Ideas live in people. We talk. We don't have slavery, so people can move around. It might be technologically more feasible to encrypt data and prevent it from diffusing. And if that's true, you might have scale effects related to country size that we haven't seen before because of the diffusion of knowledge and ideas. And so larger countries may have an important advantage as data grows in importance, just because they have more people that generate more data that can be used by all the firms.
I think that's kind of holding institutions common across countries, thinking about different sizes. But what if we now think about holding the size of the country constant, but different institutions across countries? What if—just to take an example; I don't have any evidence this is happening—what if China mandates data sharing across state-owned enterprises, and the U.S. has no such policy to mandate data sharing across firms, or even goes so far as to outlaw the selling of data across firms? You could be missing the scale effect that might be very important towards progress and growth. Similarly, if consumers in different locales have different privacy concerns, they might be willing to sell more or share more of their data. That might lead to different rates of innovation and improvement in product quality.
So I think once you start thinking about data, there are lots of contexts in which it becomes particularly interesting. So I'll just finish on that: nonrival data leads to potential large social gains from the broad use of data. When firms own the data, it's possible that they might—both at the same time—privately use more data within the boundary of their own firm than you would want them to, but also sell less data to other firms than you would want them to—or than you would choose to if you had access to it. And so when you're thinking about laws and legal frameworks, laws that outlaw sharing could be very harmful. But laws that empower consumers to own the data and choose how much they want to sell to different firms could lead to large social gains.
Chava: John Abowd is the discussant today [applause].
John Abowd: Good afternoon. Thank you very much for the invitation to speak today. While I'm speaking in my official capacity, the views that I'm expressing are my own and not those of the Census Bureau.
So I'm just going to go over a few points from the paper, to make sure that you captured the main takeaways. And then I'm going to dive in, to exercise my own comparative advantage, and discuss what it means to treat privacy loss as a non-binary decision, which is something we've thought a lot about at the Census Bureau lately. I will take you through a couple of examples—as it turns out, I analyzed more real data for these slides—but not directly, because you asked me to discuss this paper.
All right. So, first of all, you should take away the key implications, and here I'm going to really need to read my slide. So if nonrival data is an input, then the data should be shared until the marginal social benefit—that's the extra variety in consumer goods that you get—equals the marginal social cost—which is the extra privacy loss to the consumers—and the social planner will achieve that in the context of the model, and that's the benchmark here that the other statements, the welfare statements that Chris discussed, came from.
So if you assign the property rights to the consumer, then you come closer to this optimal social planner outcome than with any of the other property right assignments that he considered, because the consumers will properly internalize the privacy loss and they'll allow for almost optimal data sharing. Because there are some monopolistic components in the pricing associated with that data sharing, the consumer property rights model doesn't get all the way to the social planner's optimum.
If you assign the property rights to the firm, they will be suboptimal compared to the consumer property rights, and compared to the social planner, primarily because of the creative destruction factor that's in their model. Creative destruction works this way in their model: some percentage of the data goes to potential startups, and eventually one of those potential startups is your Microsoft to my IBM—I'm showing my age, right? So to prevent that, the firms sell less data than the consumers would sell to their rivals.
But finally, outlawing data sharing is a disaster in this context because you don't get the gains from nonrivalry in the use of data as an input. So those are the key takeaways. I think I got them all, and you should remember that.
So my main critique here—it's not really a critique, it's a discussion point—is that in this model, once the consumer surrenders her bits, they're gone forever. The privacy loss is complete. So privacy loss is a binary in this variable: either you hold on to your bits, or you surrender your bits. But that needn't be the case. An enormous amount of the work in computer science, since the database reconstruction theorem was published in 2003, has been on privacy preserving methods for releasing data. And in these contexts, it's not a binary choice. So you can have full privacy on the input bits—these are the things that enter Chris' model; they're the raw bits—and when you have full privacy on the input bits, it's secure storage via encryption, so nobody can see them, except someone to whom you provide the private key.
Or you can have full privacy on the output bits, which isn't as obvious in the context of the models that Chris was discussing. In his production function, the thing that the data are doing is feeding statistical models—machine learning is a statistical model. They're feeding statistical models, the output of which is what generates the product that the consumers are buying. And so if you have full privacy on that output message, then you've effectively prevented the idea from being disseminated. So if you relax these binary privacy loss assumptions, then you can get a model where the privacy loss is controlled to produce the accuracy in the modeling that fits the use case.
So the consumer's choice needn't be binary between releasing the bits or retaining the bits. The firm's choice might not be binary between selling the bits. It could sell the bits in a form that only allows model estimation and doesn't allow direct identification of the underlying suppliers. If you're familiar with the way the Census Bureau publishes data, you might understand why we've become expert in these kinds of modeling decisions.
So this is what it looks like. On the X-axis, is privacy loss—in these formal privacy models that's parameterized by a single parameter, but you don't need to think of it so technically here—and on the Y-axis is accuracy, and the accuracy measure is always relative to publishing the data in the clear, meaning the best you could do if I didn't make any privacy protection in the data publication. So the production possibility frontiers have the shape that I've drawn there. Chris has taken the position that many data users that I have encountered in the real world take: when we're talking about publishing data, we want to go to the full accuracy point because then the model estimation is much easier—I don't have to account for the error that went in due to privacy protection—but it's not the only point on that trade-off.
And similarly, encrypting the message isn't very useful either because then there's no accuracy in the model, and you don't get the benefits to the consumer. So that is a trade-off dimension that's not reflected in the model that we're talking about today. And both dimensions of this trade-off—the accuracy of the modeling, and the privacy loss when done with formal privacy—are also nonrival goods, and so they fit nicely into the framework that Chris's model has.
So what if your untrusted data recipients are the firms? So they're assumed to receive the data with full precision, but in their internal uses—for some reason, I thought his other input was L (labor) and not X—they don't actually require full precision. You can fit maximum likelihood, or other forms of artificial intelligence models, neural nets—you can fit them without full precision in the data. Google has an enormous team of computer scientists—they outnumber the people in this room—working on that every day.
In this case, the market could be structured so the competition would be over the precision of the harvested data—meaning, how much privacy protection, how much of a noise barrier do you put there before you release the data? But it still might fail because of the nonrival properties associated with the reuse of the model, the idea component of the model. On the other hand, once the data are harvested, they can be shared the same way they're shared in the current model without any additional privacy loss.
So there are some technologies that are currently in use—if you use the Chrome browser, you've been participating in a privacy-enhancing data analysis called "Rapport"—most people know the type of head algorithm in there. Apple is using such a thing, and Microsoft Windows 10 also has formally private, local differential privacy technology.
On the other hand, the way to get the full efficiency from the data, without sacrificing more privacy laws than you need to, is the trusted data recipient, or intermediary, model, which would require some market institutions that are not currently in place, or a trusted custodian like a statistical agency. So the trusted custodian receives the data in full precision. The data owner owns the private encryption keys, so gives permission to the custodian to use the data, but only if the data are published with a privacy-enhancing technology. So the models that come out—the published data products—have been protected with a value of the privacy loss that's been either market intermediated or intermediated by fiat.
These markets might work in acquiring or harvesting data from consumers. They'd almost certainly fail on the product side, because the information product is also nonrival and functions more like an idea than an input.
I won't take all the time for the examples I prepared, but these are real technologies and they can be used on real data. Two ways of doing it are the local model, which means you apply the privacy protections when the data come in the door. After that, anybody can use them without any further privacy loss. If you do this on publications, say from a census: that means you take the lowest level of geography—the block that you live in—and you apply the privacy protection to those data, and then you build all the aggregate tables from the protected block-level table.
In the central model, you consider all the tables you want to publish—block, block group, tract, county, state, nation—and you distribute the privacy loss across all of those tables and estimate them in aggregate, optimally. This gives you better accuracy for every privacy loss, so the central model dominates the local model in the technology, and that's what the next slide shows.
On the left is the application of the central model to the 1940 census data—the 1940 census data are public. So the blue line is the technology when applied to the equivalent of a block group in the 1940 census. And then the orange, green, and yellow lines are the county, state, and nation. So what you can see is that if you allocate the privacy loss budgets—so the same amount of privacy loss on both sides of this graph—if you allocate it optimally to the different geographical units, then you can get data that are very accurate, as long as the population in those cells increases as the geographic area gets more coarse.
On the right is what happens if you use the first algorithm. If you just apply the privacy loss straight out to the smallest geographic area, and then aggregate everything from that, you never get an improvement in accuracy. And in fact, this is what the Google engineers noticed, that basically doing the privacy filtering on the way in the door meant it was extremely difficult for them to estimate certain models as accurately as they wanted them. So they adopted a replacement technology for Rapport that has a hybrid of the central and local models.
We used this mechanism to set the privacy loss parameter for the test data from the 2018 end-to-end census test, applying a social choice model that Ian Schmutte and I developed. Basically, at the block level, we set the privacy loss parameter at 0.25, which meant that the block-level data would have an average accuracy of 64 percent, but with the same parameter of the tract-level data—a tract has about 4,000 people in it; a block has about 30 people in it—the tract-level data is 98 percent accurate on average. The use case for these data is redrawing every legislative district in the country and enforcing Section 2 of the Voting Rights Act. Almost no legislative district is as small as a block, 30 people—most are substantially bigger than a tract, 4,000 people. On that use case, a small privacy loss of 0.25 is more than adequate to accomplish the legislative purpose.
When there's nonrivalry with non-binary privacy loss, when you can treat the data as an input, those would be supported by local implementation models like Rapport or iOS or Windows 10. Or you can treat the output information good, as it's distinct from the consumer good, as the product. And that's supported by statistical agency implementations—that's a little self-serving—but it's also supported by more recent algorithms that are in use in tech companies like PROCLUS, or Google's privacy amplification machine-learning algorithms.
It's also the case that you can apply markets directly to the sale of the privacy protection. So there's Vickrey-Clarke-Groves auction mechanism that will allocate privacy loss in an opt-in system, where the consumers only have to opt in if they're paid for their privacy loss. And there are other mechanisms that are also cited... The other mechanisms are also cited in Chris's paper. This is an area of research that has already exploded in computer science. It may explode in the social sciences as well because as the technology goes across the barrier from computer science into economics and sociology and demography, we have a strong incentive to understand how these data protection methods are going to affect our analyses—and so do all the firms that are in the data buying and selling business.
Thank you very much [applause].
Chava: We have time for a few questions. I guess one of the first ones is, the question from the audience is about more like the boundaries of the firm. As companies share data, the one question is in terms of, do joint venture solve some of the data problems, in terms of the sharing? And a related one probably might be, as more and more companies share the data, there might be some suboptimal outcomes to the society. Maybe there might be a groupthink, everyone is using the same data, same machine learning models, the same data, and there might be some suboptimal outcomes. So some thoughts on that?
Tonetti: Sure. With respect to the joint ventures, I think that's likely to lead to bilaterally or multilaterally optimal amount of data sharing within the people—or the firms—that are participating in the joint venture, but there is nothing that I can think of that would suggest that that would be achieving the societally optimal amount of data sharing. And so there could be competitors who a consumer would love to share that data with—increased competition have multiple firms pushing towards improving products at the same time. Whereas the joint venture is... Another way of thinking about that is, "Let's form coalitions to monopolize the use of the data," so it would be broader than just within the particular firm, but still maybe not as broadly shared as a consumer or a benevolent social planner would want. And so I think it helps, but I don't see any reason why it would line up with what we as a society might want to achieve.
Abowd: So there's a very real sense in which certain ways of regulating data sharing could lead to a tragedy of the commons, and the best example is in biomedical data. A genome-wide association study takes the genomes from a relatively small sample—all of whom have the same phenotype—and publishes an enormous array of summary statistics about those genomes. So one of the first demonstrations that you needed to think about privacy-enhanced data analysis is, if you give me a genome with an error less than 10-10, I can tell you whether that genome is in any particular GWAS [genome-wide association study]—which basically means that the GWAS identifies the people who have the phenotype. It doesn't identify all of them, but it identifies the ones who are in the sample that were used.
The National Institutes of Health actually recognized that that was a privacy violation, and now even if your data are only going to be published in a GWAS, you have to sign the full disclosure release for medical experiments. If we set privacy protection up so that you can't get the data analysis that leads to trying to find the alleles that are associated with various phenotypes, then we will have limited the data sharing to a great disadvantage to society.
The right way to handle that is to allow some error in the statistical analysis of the GWAS that accounts for the social loss of the privacy, but will allow the actual analysis to occur on the real genomes. Don't try to noise up the input data to that analysis, and noise up the output data to an amount that has been chosen to balance learning new things in diagnostics against the privacy risk associated with the shared genome data.
If we do things like Chris showed—ban data sharing—that will lead to suboptimal outcomes as well, but so would a full sharing of the genome here, because it's too easy to recover whether or not a particular genome is in a statistical study.
Chava: I think probably a related question, another one is, again, how the model is structured in terms of positive versus negative externalities, in the context of, let's say, the cancer patient, in terms of the sharing of the data? And also a related question, I guess, is (in terms of, this also seems to be a popular question), about credit defaults and particular information from credit bureaus.
Tonetti: I think they're great questions. So the first one that was raised is something that I worry about myself. What I was trying to do with this line of research is sort of push against what I thought was the widely held belief that we want more privacy, and so what we should be doing is pushing all towards more privacy. So in the model we wrote down, we kind of abstracted away from these negative events that can occur when there is too much sharing. And so there are classic economic examples where having too much information unravels a market, and insurance is the classic example there. And so I do think there are cases in which we might want to prevent or add a lot of noise to the amount of information that we release.
And so I think this is really a context-specific issue. There are some contexts in which we have to understand, what are the privacy costs? Are they just utility costs, that I just don't want everyone in this room to know my Google search history or which books I read—or just for no other reason than I don't want you to know? Or am I afraid that if that information is out there, that firms will treat me differently—they can either price discriminate against me, or this is a very powerful example, of quantity discrimination. That's something I think is a really legitimate issue, and I don't think we have a great understanding of what the privacy costs are in different contexts, and how to best handle market design and legal frameworks under those different contexts.
Chava: Another thing is, in terms of the nonrival nature of the data, I think where the value might depend on the industry—like, for example, if a hedge fund uses some alternate data, and then derives some alpha from it, another hedge fund which later comes in might not be able to use it. This can be examples of that, where...advantage matters. So maybe I think that will be another question.
Tonetti: Yes, I think that's a great point there. So I think there are two issues on the table. The first one that I read out of the question is, there are some places where there are markets for data, and firms do sell data—it's not like we're in a world where everybody's just hoarding the data in their own data silos. So I think that's evidence that there are markets that can work. The question is really about the incentives on, are the firms' incentives to sell their data in these data markets strong enough to get the right amount of sharing, aligned with what consumers or governments would want?
There could be issues where firms are over-sharing data that consumers wish they weren't. There could be issues where firms are not selling enough compared to what the consumer would do. The basic point that I'd like to make is that there's no reason why firms' incentives are structured necessarily to line up with what consumers would want. And so it's possible that you'll see something quantitatively off in the amount of data that's shared—even if qualitatively, in some context, you'll see some data being shared.
And I think, to go to your question, there are some goods in which there, or some markets in which trades are zero sum. If I make some money, you lose some money. In that context, it's a very different world that we're living in than the one that I was talking about, where more data just improves the quality of products that then we would all get to consume. I think in finance there's been a lot of research on information economics and price discovery and things like that. And I think that is a different context that deserves its own study.
Chava: So there seems to be a lot of questions that we've got on the census data. Maybe I'll put in a few of them that you can answer.
Abowd: Can you give me the first question? Chris, I'm going to take the first question first [laughter].
Chava: Good choice.
Abowd: Even before the citizenship question was potentially added to the 2020 census, the Census Bureau had been investigating how to use stronger privacy protections on the data—particularly the block-level data that are used to draw all the legislative districts. But in addition, we release block-level data that describe the sex and age of everyone, interacted, everyone who lives in that block. We have some very detailed tables that have historically been published from the census, and we have designed the protection system for the publications from the 2020 census to be much stronger than the ones that were used in the 2010 census—but still demonstrably fit for use, because basically the block is a pixel that's used to build arbitrary geographic areas. It's not the unit of analysis itself.
So it is very important to get a way of anonymizing the data that is trusted. The way we are handling that is we've already published the algorithms we used for the data that I showed you on the slides today, and we have published a wrapper that lets you load public data from the 1940 census into those algorithms and test the properties of our privacy protection software yourself.
And I'll take the other one.
Yes, but not just the Census Bureau, but most of the agencies of the federal government that ingest data are very concerned about interference during the process of taking the 2020 census. Those agencies also rely on our published data for their operations, and there is a consortium of chief information officers and their security staffs that meet regularly, under top secret SCI [sensitive compartmented information] security clearances, to assess and balance the algorithms that we use for detecting and preventing these.
Everyone understands that this is a game in which you have no incentive to show your hand until you're ready to attack, so the methods that we are going to use to determine if there's an attack coming, or a manipulation, are necessarily secret. But the National Security Agency, and other very sophisticated federal agencies, are part of this consortium. We're very concerned about this, and we're very concerned about developing appropriate defenses.
Chava: There seems to be one more. Again, I think this...I think, the census data.
Abowd: So this is an example where the trusted curator model produces more accurate statistics in a defensible way. So currently the IRS does allow the Census Bureau to take a lot of detailed microdata into its secure computational facility and use those data in the development of statistics. They don't go the other way. They don't go the other way with the Social Security Administration, either. The only things that go the other way are the same public use products that anyone in this room can use.
The issue, properly identified in the question, is the dependence on survey methods. So that is an issue in all the statistical agencies, and one of the solutions is to multiple source the data. One of the reasons why you want to do the privacy protection on the things you're going to publish, rather than the things you ingest, are those production possibility frontiers that I showed you.
Statistical agencies are not big data. We have 330 million (roughly) persons that we expect to measure in the 2020 census. If we apply local differential privacy to those data on the way in the door, and we run into a situation where we need more data—well, if we use the algorithms that Google deployed in the Chrome browser, and if they need another billion observations, as you heard earlier today, they just wait a minute or two and they have another billion observations, and that improves the precision. If we need another billion observations, we need to wait three decades.
So in order to balance accuracy against privacy loss, it's better to prepare the analyses on the confidential data and then pass them through the privacy filter before they're published rather than trying to ingest them with the privacy filter. This will not always be true. There are some things that statistical agencies are likely to ingest, like consumer credit card data, where the volume of the transactions would justify making the filter on the way in. But in general, it's better to accept the data and do the analysis on the raw data because of the sample size restrictions.
The Foundations for Evidence-Based Policymaking Act of 2018 greatly broadened the scope for doing this in the statistical agencies—although not specifically with the IRS, because it continues to accept data that have a statutory prohibition. But the default now is that a statistical agency can access those data rather than has to ask permission to access those data.
Tonetti: And in the private sector context, I think the issue of verifiability of data becomes really important. And so if you're selling some data—or if you're going to run somebody else's regression on your data set locally, and then fuzz it up before you hand it back to them—there needs to be some (perhaps) mechanism for verifying that you use real data, and didn't just make it up. And so that's another area that I think is really interesting that hasn't necessarily been solved.
Abowd: If you comb through the code we released from the 2018 end-to-end test, you will find one very peculiar set of measurements in there, that are in there because they allow us to state with a very high probability that the algorithm was actually run.
Chava: The last one, last question—maybe, I think. So again, there are a lot of advantages in terms of sharing the data between the companies, as you have shown. There's one question which is related, which is, there might be advantages in sharing the data across countries also. And then again, there are silos, I think, across countries (a related question, in terms of the value).
Tonetti: Yes. I think the short answer is that growth over the centuries has come from taking advantage of increasing returns to scale associated with nonrivalry. And to the degree to which data is becoming more important in the way we produce things, that's an optimistic reason for why we may be able to grow and progress faster than we have.
Chava: Any thoughts? Okay. I think we're right on time. Let's thank both Chris and John, for the excellent session [applause].