General, Programming, Security, Privacy, Technical

The Principle of Least Data

Updates

Introduction

Welcome to 2024.

We’re half-way through and we’ve already had a string of major cybersecurity breaches on a scale that, quite frankly, is concerning: Most recently, it was the AT&T breach with around 110+ million user records stolen (3, 4), but before that it was 560 million user’s data breached in Ticketmaster (6), and 49 million user records breached in an attack on Dell (7) … to name just a few. The scale of the breaches in terms of user numbers and amount of data stolen is astounding. Perhaps it is time that we sit down and have a little chat about the data — and … why we are storing so much of it?

A Data Collection Feed Frenzy

What we are seeing is a significant increase in the collection of data (2, 3, 4, 5). This has accelerated, but did not start, with the advent of Social Media and larger scale online forums. This has existed in some form for far, far longer: Do you have a magazine membership, or a grocery store card? Do you have a frequent flyers card or ‘club membership’ to get discounted products? The concept is the same.

In the physical world, when a large number of people gather there will be an opportunity to sell products to them. This is no different in the digital world. Where there are large numbers of people online, usually concentrated in ‘hubs’, there will be an opportunity to market and sell products to them.

But now we have the added ability to collect and analyze detailed metrics on each person — at scale: These are individualized pieces of data, and more importantly metadata, that outline everything from how long you spent looking at a picture, to what friend-of-a-friend referred a link to you six months ago and you just clicked it now — even if accidentally, because it will show that you clicked it then closed the browser window or went back immediately.

There has been some visibility into this problem, and more stringent data laws have been imposed. Documentaries such as “The Social Dilemma” have been created with great effect, and newer generations are far more tech-savvy than the previous ones — though that does not necessarily mean they fare better than their predecessors in identifying such issues.

And this was before the advent of ‘big data’ and ‘AI’. Big Data is the inevitable result when organizations shamelessly collect every ounce of available data — regardless of the data’s utility. This information must be stored in vast repositories that are sometimes called datalakes — caverns of swirling liquid data, with huge pipes gulping up entire partitions and expanding them into n-dimensional arrays in the hopes of gleaning some new insight, some fraction-of-a-percent improvement, or some new leverage over a close competitor. Every drop counts to them, but when asked how a particular piece of data contributes to their goal, they fall silent. Artificial Intelligence is the extension of this concept. How does one train their machine learning algorithms? With data of course! And much of the prevailing wisdom (although this may be changing) leans towards “more data means better results”. There is an incentive to hoover up every drop of data available in the hopes that it will further improve the algorithm. This has translated into an environment that promotes a developer in choosing to err on the side of “more data collection” rather than less.

These enormous repositories of personal data, metadata, relationship links, and aggregations are the flame to which the hacker-moths are drawn. And at no point did I mention privacy and security in the act of collecting such data. At first it was a wave of open S3 buckets. They were the canaries in the coal mines. Those were simply the easiest to find: A simple search on a scanning website would give you those. Then, with the waterline slowly rising, adversaries had to search elsewhere. Lax (or non-existent) security controls was the next stop (and still is) — Perhaps it was an old test server using live data from the datalake, or simply that the access dashboard was publicly accessible, running unpatched PHP code, and written as a prototype that got pushed to production. The breaches kept coming, and with each successive headline the scale of the breaches grew — a testament to the continued drive to collect and aggregate vast amounts of information. Over time, security controls improved (though much work still needs to be done), and the water line rose once again. Supply chain attacks emerged as a way to circumvent existing security controls by abusing the trust relationship between a vendor and their contractors and suppliers: One of the OGs? Home Depot was hit through their HVAC system, countless software projects were (and still are) breached due to compromised libraries that they include in their code, and — the mound of gold that makes all dragons jealous — cloud based storage, analytics, and centralized authentication for B2B: That is, service providers who have other very large businesses as their customers. It’s simple: Why spend all that time hacking all these companies when an adversary can simply hit the service that they all use. One hack to rule them all. This is what happened to Snowflake. And this is how AT&T was subsequently breached recently. Over 112 million customers. The scale has changed. And not in a good way.

It is time to think about what we are storing and why? (1)
It is time to implement the Principle of Least Data.

The Principle of Least Data

The Principle of Least Data is simple: Only store the least amount of data required to do what you need to do.

You may have heard of this concept’s far-more-popular big brother, the Principle of Least Privilege: Only give a user the minimum required permissions for them to do their job. This limits what a potentially compromised user can do. If it only has permissions to do simple things, it is harder for an adversary to do any major damage than if the user has full admin privileges — the adversary doesn’t have to do anything else to start doing some real damage. Likewise, the Principle of Least Data significantly limits what is in a repository when it is breached (it is a matter of “when”, not “if”), and it thus reduces the impact of the breach on the organization’s customers. It does not eliminate it, as there was still a breach after all, but it changes a breach from “they got everything: Usernames, passwords, credit card numbers, emails, texts, etc.” to “they just got your email and phone number”. Both are breaches. But they are not the same impact.

There are some drawbacks to this approach:

  • It requires engineers and developers to think long and hard about WHAT they are going to collect and WHY the are collecting it. Arguably, this is something that you WANT your engineers to spend time on (it is the mark of a mature engineer), but it comes at a cost and any additional expense is viewed with scrutiny when it comes to an organization’s bottom line. It is up to senior engineers to explain that spending the time now (and thus cost) to think about these problems, will save time in the future — when these requirements may come as part of defined legislation with very hard deadlines and harsh penalties for not complying.
  • It limits the data that is available to organizations for use. This is the point, after all, but that does not mean that organizations will like it. If collecting data is seen as potentially adding value (revenue), then by limiting the collection of data you are hampering the organization’s revenue generation stream. This conveniently ignores the costs associated with storing the data (storage is cheap, but it is not free) and the increase in “data liability” the organization incurs. Data Liability is the inherent risk in storing a piece of data. If you have one dollar stored in your house safe you may not be a target for elite thieves, but if you have ten billion dollars cash stored in your house safe, then people would consider you negligent. The ‘dollar liability’ has reached a threshold where it no longer seems viable to store it as such. For data, this is the risk of a breach, the impact of the disclosed data — for the data owner (person whose data it is) and the organization (the data custodian). Breaches further come with a cost and the cost is a function of the size of the breach. Many laws fine an organization based on a per-record fee, and even ransomware will charge more based on the number of systems and files encrypted. This is an example of increased data liability. Furthermore, recent legal cases have started with respect to the use of data for the purposes of AI or in training machine learning models. The question of ownership is hotly debated and these nascent cases will determine the precedent that goes forward. It is a particularly precarious time for an organization to have no strategy for handling the reduction of their data collection.

Much like in defensive cybersecurity, if you do your job perfectly (a feat!) then the best you can hope for is that nothing happens (a good day). This does not look good to executives who may see the cybersecurity or IT department as a cost center. They may see a significant cost with no measurable or tangible return. And it does not benefit a security practitioner to always claim “the sky is falling” by referring to a string of recent hacks or breaches. One can derive metrics from, say, a firewall’s logs where it blocked certain attacks, or in the number of incidents opened, their investigation time, and number closed within a certain period of time, but how does one explain the value of “less data is more”? Especially in an “AI is everywhere” environment.

It is less tangible, and is inherent in the everyday decisions a developer makes when writing the line of code, within a function, within a module, within a product, that gets released to the public and handles their data. It is not a project with a start and end date, and it can’t be added on at the end of a sprint. It must be ingrained within each developer — the little voice in their ear, “Do you really need that data point?”… At least, in a perfect world.

Some References

  1. https://arstechnica.com/tech-policy/2024/07/after-breach-senators-ask-why-att-stores-call-records-on-ai-data-cloud/
    • An increase interest in the concept of “Least Data” though not using the term
  2. https://cloud.google.com/blog/topics/threat-intelligence/unc5537-snowflake-data-theft-extortion
    • Mandiant tracking breaches associated with the Snowflake breach — with a fantastic breakdown
    • A great example of a large scale supply chain attack
  3. https://about.att.com/story/2024/addressing-illegal-download.html
    • AT&T’s announcement of their breach (Was a result of Snowflake breach)
  4. https://arstechnica.com/tech-policy/2024/07/nearly-all-att-subscribers-call-records-stolen-in-snowflake-cloud-hack/
    • Further elaboration on the AT&T breach
  5. https://www.reuters.com/technology/moveit-hack-spawned-around-600-breaches-isnt-done-yet-cyber-analysts-2023-08-08/
    • More convincing that attackers are utilizing supply chain attacks to breach customers at scale
  6. https://www.bbc.co.uk/news/articles/cw99ql0239wo
    • 2024 Ticketmaster hack — 560 million users affected
  7. https://www.bleepingcomputer.com/news/security/dell-api-abused-to-steal-49-million-customer-records-in-data-breach/
    • 2024 Dell breach — 49 million records
Standard