New AI model can generate real-world tabular data

A new AI model has been launched that can generate highly realistic tabular data – the type of structured data widely used in healthcare, finance, and social science. The new method, called TabCascade, addresses a long-standing challenge in artificial intelligence: while generative AI has made remarkable progress in creating realistic images, videos, text, proteins, and other forms of data, generating realistic tabular data has remained significantly more difficult.

The model was developed by PhD candidate Markus Mueller, Assistant Professor Kathrin Gruber, and Professor Dennis Fok from the Econometric Institute at Erasmus School of Economics. They will present their findings at the Forty-Third International Conference on Machine Learning, one of the global top three conferences on AI. This year, it will be held in Seoul, South Korea from 6-11 July. Additionally, the code of the model has been made publically available.

Existing AI methods have largely struggled to model the complexity of real-world data, particularly when it comes to generating realistic missing values in otherwise continuous data. Unlike images or text, which contains only one form of information, real-world tabular datasets often combine numerical values, categorical information, and incomplete or missing entries within the same dataset - and sometimes even within the same variable.

These mixed data types are common in practice and often carry important real-world meaning. For example, a missed doctor’s appointment recorded as a missing entry may provide important information about a patient’s health, as it might indicate a patient being too ill to attend their doctor’s appointment or was already admitted to the ER. Similarly, unanswered questions in an economic survey may reveal behavioural or social patterns.

To address these limitations, the researchers developed TabCascade, a new AI model specifically designed to capture the full complexity of real-world tabular data. The method works in two stages. It first learns the broad, “low-resolution” structure of a dataset - such as categorical patterns and coarse representations of numerical variables - and then gradually refines this information into a detailed, “high-resolution” representation, similar to how an image becomes sharper when moving from a rough sketch to a high-resolution picture.

This cascaded approach enables TabCascade to capture subtle statistical patterns, important relationships between variables, rare events, and complex mixed-type features that existing methods often miss. Importantly, it is the first diffusion- or flow-based generative model capable of realistically generating missing values alongside continuous and categorical information within the same framework. Existing models combine distinct diffusion processes for different data types into a single model with a unified training objective. This results in the models implicitly favouring the generation of one feature over another, which can deprecate overall sample quality. The cascaded structure of TabCascade, with its separate training objectives, ensures that all feature types get the necessary attention.

In benchmark tests, TabCascade significantly improved the realism of synthetic datasets. In one key evaluation, the method improved detection performance by more than 50 percent compared with the next best-performing approach, making it substantially harder for machine learning models to distinguish synthetic samples from real-world data.

PhD student

Markus Mueller, PhD candidate at Erasmus School of Economics

Assistant professor

Kathrin Gruber, Assistant Professor at the Department of Econometrics at the Erasmus School of Economics

Professor

Dennis Fok, Professor of Econometrics and Data Science at Erasmus School of Economics

More information

Read the paper “Cascaded Flow Matching for Heterogeneous Tabular Data with Mixed-Type Features” here.

For more information, please contact Ronald de Groot, Media and Public Relations Officer at Erasmus School of Economics, rdegroot@ese.eur.nl, or +31 6 53 641 846.

A new study by researchers Markus Mueller, Kathrin Gruber, and Dennis Fok of Erasmus School of Economics explores how AI can generate realistic synthetic data.

Monday 3 Mar 2025
General

New AI model can generate real-world tabular data

Compare @count study programme

@title

Our channels

New AI model can generate real-world tabular data

Share this page

Compare @count study programme

@title