By Ian Graham, VP and GM, Federal Health and Civilian and Vishal Deshpande, Chief Data Analytics Officer
Artificial intelligence and machine learning (AI/ML) hold the power to rapidly transform healthcare and improve health outcomes. However, the success of AI/ML solutions depends on the accessibility of diverse and representative data. Scarcity of data for specific socioeconomic or ethnic groups, though, can introduce bias, skewing AI/ML models.
Fortunately, advanced data science capabilities can help address this challenge. Let’s explore how two advanced techniques in synthetic data generation can enable more equitable AI-powered solutions.
Profile-Based Synthetic Data Generation: A Game Changer
At Unissant, we help agencies identify the most secure and ethical pathways to implement AI/ML models. We avoid using personally identifiable information, public health information, or other confidential data in production systems. Rather, we recommend creating synthetic data to advance data privacy and security, mitigate bias, improve model performance, and accelerate AI development.
The idea of creating synthetic data is not new. However, traditional approaches have their limitations. Rule-based approaches to creating synthetic data work for simple scenarios. Statistical approaches are good for general patterns, but they frequently fail to capture specific details. While data may appear statistically similar, it often lacks the nuances associated with real production data and, as such, can perpetuate bias.
Advanced Techniques Overcome Bias through Data Diversity
Profile-based synthetic data generation, which involves creating synthetic data that adheres to specific demographic and clinical profiles, presents real opportunities when developing AI/ML models for healthcare contexts. With its ability to help mitigate bias, profile-based synthetic data generation can benefit a variety of federal health use cases—advancing medical research, empowering patient trend analytics, optimizing clinical workflows, improving patient safety, aiding in diagnosis, and facilitating personalized treatment.
Two advanced techniques stand out as particularly relevant for federal health contexts:
- Configurable attribute-level controls
- Scenario-based synthetic data generation
Conquering Data Scarcity: Configurable Attribute-level Controls
Configurable attribute-level controls allow us to fine-tune and customize data profiles to align with specific use cases. The synthetic data we create can be readily adjusted to meet domain-specific requirements such as demographic segmentation or behavioral modeling. Importantly, these controls address existing biases within the real-world data used to train the synthetic data generator. By enabling such precise adjustments, agencies can counteract skewed distributions, improve representation fairness, and ensure a more balanced, equitable dataset suitable for modeling and analysis.
One valuable application of attribute-level controls is in disease research. Clinical trials and large-scale studies may lack data for underrepresented minority populations. This can lead to biased models and treatments that may not be effective for all patient groups. By configuring attribute-level controls, researchers can generate synthetic datasets that accurately represent the diverse population of the United States, including racial and ethnic minorities, socioeconomic disparities, age groups, or geographic distribution. This can be achieved by:
- Over-sampling underrepresented groups: By increasing the representation of minority groups in the synthetic data, researchers can ensure that their models are trained on a more diverse dataset.
- Adjusting attribute distributions: Researchers can fine-tune the distribution of attributes like age, sex, and comorbidities to match specific research questions or to address historical biases.
- Introducing synthetic noise: By adding random noise to sensitive attributes, researchers can protect patient privacy while still preserving the underlying patterns in the data.
By using these techniques, researchers can develop more accurate and equitable models for predicting disease risk, identifying optimal treatment strategies, and improving patient outcomes.
Future-forward: Scenario-based Synthetic Data Generation
Scenario-based synthetic data generation goes beyond static replication by mimicking dynamic evolutionary patterns observed in real-world data. This capability is particularly beneficial for predicting and preparing for changes in data trends over time. For example:
- If census data or population studies reveal demographic shifts—such as age group distributions or migration patterns—over a specific timeframe (e.g., the next five years), the synthetic data generator can incorporate these trends to produce future-looking datasets.
- Similarly, in geospatial contexts, where changing environmental or economic conditions drive shifts in population density, the generator adapts synthetic profiles to reflect these projected outcomes.
Decision-makers can now perform predictive modeling and anticipate challenges across a range of domains, including:
- Healthcare and epidemiology: Agencies can create synthetic data to simulate epidemic outbreaks and assess their potential impact on public health systems. This allows for proactive resource planning, intervention strategies, and crisis management.
- Health planning and policy: Testing the efficacy of different intervention strategies—such as vaccination campaigns, social distancing measures, or travel restrictions—can help optimize public health responses and even tailor strategies to urban, suburban, or rural populations.
- Healthcare market and economic analysis: When planning public health infrastructure investments, experts can generate data to forecast consumer behavior, market shifts, or provider allocation trends under specific scenarios.
By combining observed historical patterns with projected data movements, scenario-based synthetic data generation supports futuristic modeling for complex, evolving use cases. This empowers organizations to remain agile and address emerging challenges with credible synthetic datasets.
Ethical and Future-Forward AI/ML in Healthcare
Profile-based synthetic data generation offers a powerful solution to address the challenges of bias and data scarcity in healthcare. By enabling the creation of diverse and representative synthetic datasets, this technology can help to improve the accuracy and fairness of AI/ML models.
Leveraging advanced techniques such as configurable attribute-level controls and scenario-based synthetic data generation, agencies can unlock the full potential of AI/ML. These techniques are highly relevant to federal health use cases including medical research, clinical decision-making, public health policy, and patient care. At Unissant, we’re excited to put these techniques to work for federal clients, helping narrow healthcare disparities today and improve outcomes in the future.