Skip to main content

IBM Synthetic Data Sets

An IBM Redpaper publication

thumbnail 

Published on 03 February 2025, updated 03 February 2025

  1. .PDF (2.8 MB)

Share this page:   

ISBN-10: 0738461997
ISBN-13: 9780738461991
IBM Form #: REDP-5748-00


Authors: Erik Altman, Erik Altman, Dipali Aphale, Joy Deng, Yadu Nandan B, Saurabh Srivastava and Kelly Xiang

    menu icon

    Abstract

    IBM Synthetic Data Sets is a family of artificially generated, enterprise-grade datasets that enhance predictive artificial intelligence (AI) model training and large language models (LLMs) to benefit IBM Z® and IBM LinuxONE clients, ecosystems, and independent software vendors. These pre-built datasets are downloadable and packaged as comma-separated values (CSVs) and data definition language (DDL) files, making them familiar to use, and compatible with everything from databases to spreadsheets to hardware platforms to standard AI tools. These datasets also leverage the IBM® industry expertise and domain knowledge of the financial services sector without using any real client seed data, which alleviates security concerns with Personally Identifiable Information (PII). Real data at client sites is often limited in scope to only their own organization's transactions, and clients do not always know which transactions are fraudulent or not. To address this scenario, IBM Synthetic Data Sets were modified for fraud detection use cases so that clients can download and enable development of predictive AI models and LLMs for financial services or optimize existing models for improved accuracy and risk mitigation.

    The IBM Synthetic Data Sets family contains the following features:

    *IBM Synthetic Data Sets for Payment Cards

    *IBM Synthetic Data Sets for Core Banking and Money Laundering

    *IBM Synthetic Data Sets for Homeowners Insurance

    This IBM Redbooks® publication introduces IBM Synthetic Data Sets and provides information about how IBM Synthetic Data Sets can enhance and optimize your predictive AI model training and LLMs.

    Table of Contents

    Executive Overview

    Introducing IBM Synthetic Data Sets

    Dataset deep dive

    Available editions

    Previewing data schemas

    Using real data versus synthetic data

    Data generation methodology

    Artificial intelligence ethics

    Legal usage terms

    Getting started

    Frequently asked questions

    Additional resources

    Appendix: Data schemas for IBM Synthetic Data Sets