Generate synthetic data in the browser

Supported datasets

Dataset	Type	Variables	Description
Friedman 1	Regression	10 + 1	y = 10 * sin(Pi * x1 * x2) + 20 * (x3 - 0.5) ** 2 + 10 * x4 + 5 * x5 + e
Friedman 2	Regression	4 + 1	y = sqrt(x1 ** 2 + (x2 * x3 - 1 / (x2 * x4)) ** 2) + e
Friedman 3	Regression	4 + 1	y = atan(x2 * x3 - 1 / (x2 * x4) / x1) + e
Peak	Regression	10 + 1	Peak Benchmark Problem. From: mlbench
Hastie	Classification	10 + 1	Binary classification problem used in Hastie et al
Moons	Classification	2 + 1	Two interleaving half circles
Spirals	Classification	2 + 1	Two entangled spirals
Ringnorm	Classification	10 + 1	Breiman, L. (1996). Bias, variance, and arcing classifiers

Unlimited data size

In the real world, data collection is almost always an expensive and complex process. Artificial data is an easier and faster alternative for testing statistical and machine learning methods. Until you have enough RAM and disk space, you can generate any number of records.

Known generating functions

In many practical cases, observations are noisy, and the data generating function is not fully known. That makes a model evaluation harder. Luckily all synthetic datasets have transparent rules and procedures under the hood. StatSim Gen uses mkdata, an open-source library that has all its code publicly available on GitHub.

Save results as CSV files

The comma-separated format is probably the most popular for storing tabular data. Most data processing libs and programs support it. Just save results as a CSV file and then load it into another app. You can preview or profile CSV files using StatSim Preview and StatSim Profile, our free web apps. Or fit an XGBoost model in StatSim Fit.

If you enjoyed the app, star us on GitHub. To report errors, create an issue.

Star Issue

Generate synthetic data in the browser

Synthetic datasets for machine learning experiments

Supported datasets

Unlimited data size

Known generating functions

Save results as CSV files