A new AI model has been launched that can generate highly realistic tabular data – the type of structured data widely used in healthcare, finance, and social science. The new method, called TabCascade, addresses a long-standing challenge in artificial intelligence: while generative AI has made remarkable progress in creating realistic images, videos, text, proteins, and other forms of data, generating realistic tabular data has remained significantly more difficult.
The model was developed by PhD candidate Markus Mueller, Assistant Professor Kathrin Gruber, and Professor Dennis Fok from the Econometric Institute at Erasmus School of Economics. They will present their findings at the Forty-Third International Conference on Machine Learning, one of the global top three conferences on AI. This year, it will be held in Seoul, South Korea from 6-11 July. Additionally, the code of the model has been made publically available.
Existing AI methods have largely struggled to model the complexity of real-world data, particularly when it comes to generating realistic missing values in otherwise continuous data. Unlike images or text, which contains only one form of information, real-world tabular datasets often combine numerical values, categorical information, and incomplete or missing entries within the same dataset - and sometimes even within the same variable.
These mixed data types are common in practice and often carry important real-world meaning. For example, a missed doctor’s appointment recorded as a missing entry may provide important information about a patient’s health, as it might indicate a patient being too ill to attend their doctor’s appointment or was already admitted to the ER. Similarly, unanswered questions in an economic survey may reveal behavioural or social patterns.
To address these limitations, the researchers developed TabCascade, a new AI model specifically designed to capture the full complexity of real-world tabular data. The method works in two stages. It first learns the broad, “low-resolution” structure of a dataset - such as categorical patterns and coarse representations of numerical variables - and then gradually refines this information into a detailed, “high-resolution” representation, similar to how an image becomes sharper when moving from a rough sketch to a high-resolution picture.
This cascaded approach enables TabCascade to capture subtle statistical patterns, important relationships between variables, rare events, and complex mixed-type features that existing methods often miss. Importantly, it is the first diffusion- or flow-based generative model capable of realistically generating missing values alongside continuous and categorical information within the same framework. Existing models combine distinct diffusion processes for different data types into a single model with a unified training objective. This results in the models implicitly favouring the generation of one feature over another, which can deprecate overall sample quality. The cascaded structure of TabCascade, with its separate training objectives, ensures that all feature types get the necessary attention.
In benchmark tests, TabCascade significantly improved the realism of synthetic datasets. In one key evaluation, the method improved detection performance by more than 50 percent compared with the next best-performing approach, making it substantially harder for machine learning models to distinguish synthetic samples from real-world data.
- PhD student
- Assistant professor
- Professor
- More information
For more information, please contact Ronald de Groot, Media and Public Relations Officer at Erasmus School of Economics, rdegroot@ese.eur.nl, or +31 6 53 641 846.
- Related content
