AI Algorithm: The Data Diet Approach

Unissant Team

October 10, 2024

Improving Privacy and Security through Data Minimization

by Vishal Deshpande, Chief Data Analytics Officer

One of the biggest challenges associated with AI algorithms is the idea that they require vast amounts of data. While data is essential for development, it presents several challenges, particularly in the areas of:

Data privacy—Collecting massive amounts of data raises significant privacy issues, introducing the risk of exposing sensitive personal information.
Data security—Protecting large datasets from cyberattacks is a major challenge. A data breach can expose sensitive information and damage an agency’s reputation.

So, as I publish this blog, it’s October. Halloween is right around the corner. When I think about AI’s appetite for data, I’m reminded of the musical “Little Shop of Horrors”. The story’s antagonist is Audrey II, a grotesque, carnivorous plant with a gaping, toothy mouth. Its menacing demand: "Feed me!" As Audrey II grows in size and power, so does its murderous path of destruction.

It’s imperative that we keep AI systems from becoming like Audrey II. So, how do we control AI’s insatiable appetite and prevent it from evolving (or devolving) into a menacing beast? Frankly, we put AI on a strict data diet.

By default, it sounds counterintuitive—doesn’t AI need as much data as possible to learn and thrive? Well, yes and no. Introducing the concept of data minimization.

Should my AI model go on a diet?

Data minimization means collecting only the data necessary for the AI model, then deleting that data when it’s no longer needed. This strategy is frequently employed during AI model development. During model training, for example, reducing the size of the dataset can accelerate training while reducing computational costs.

When designing a model, it’s important to understand the full lifecycle, considering data privacy and security impacts throughout. At Unissant, we consider data minimization when AI models include sensitive data, such as health records or financial data. Other strategies, such as synthetic data generation, which we’ll explore in our next blog, are also relevant.

How to put your AI model on a data diet

When designing AI systems, agencies should take a few strategies into consideration to accomplish data minimization.

Understand your data requirements—Clearly define the specific data points needed for the AI model to achieve its objective. Apply data profiling to analyze existing datasets and understand their composition, quality, and relevance. Conduct a data quality assessment to ensure the most accurate, complete, and consistent data is included in your model.
Apply data reduction techniques—Identify and retain only the most relevant features for your model. By using data sampling, agencies can create representative subsets of the data for training and testing. Machine learning libraries further automate the ability to reduce data.
Take an iterative approach—Like a chef creating a masterful soup, you don’t want to throw in all your ingredients (or all your data) at once. Begin with a minimal dataset and gradually increase data volume as needed. Continuously assess model performance to determine if additional data is needed.
Integrate privacy-preserving techniques—Use techniques such as differential privacy at add “noise” to datasets, protecting individual privacy while maintaining data utility. [For example, if a dataset includes age of cancer patients, provide an average age range (60-64) rather than a specific number (62) to protect privacy.] Performing computations on encrypted data without decrypting it (homomorphic encryption) further protects privacy.
Incorporate ethical considerations—Ensure that data minimization does not inadvertently introduce bias into the model. Ethical AI (XAI) practices dictate that data minimization practices are documented and their impact to model performance is explained.

So, while we need to feed the proverbial beast—give our AI models data from which to learn—we can benefit when we do so in a way that applies the right data minimization best practices. By putting our AI models on a strict data diet, we can mitigate privacy and security concerns while delivering higher-value models. Makes AI a little less scary, doesn’t it?

In my next blog, I’ll explain Profile-based Synthetic Data Generation—an approach Unissant is investing in heavily to help our clients optimize data privacy and security for AI models.

Share on:

Put Your AI on a Data Diet

Improving Privacy and Security through Data Minimization

Should my AI model go on a diet?

How to put your AI model on a data diet