StatSimGen Version: 0.0.2

Generate synthetic data in the browser

Synthetic datasets for machine learning experiments

Supported datasets

Dataset Type Variables Description
Friedman 1Regression10 + 1y = 10 * sin(Pi * x1 * x2) + 20 * (x3 - 0.5) ** 2 + 10 * x4 + 5 * x5 + e
Friedman 2Regression4 + 1y = sqrt(x1 ** 2 + (x2 * x3 - 1 / (x2 * x4)) ** 2) + e
Friedman 3Regression4 + 1y = atan(x2 * x3 - 1 / (x2 * x4) / x1) + e
PeakRegression10 + 1Peak Benchmark Problem. From: mlbench
HastieClassification10 + 1Binary classification problem used in Hastie et al
MoonsClassification2 + 1Two interleaving half circles
SpiralsClassification2 + 1Two entangled spirals
RingnormClassification10 + 1Breiman, L. (1996). Bias, variance, and arcing classifiers


Unlimited data size

In the real world, data collection is almost always an expensive and complex process. Artificial data is an easier and faster alternative for testing statistical and machine learning methods. Until you have enough RAM and disk space, you can generate any number of records.

Known generating functions

In many practical cases, observations are noisy, and the data generating function is not fully known. That makes a model evaluation harder. Luckily all synthetic datasets have transparent rules and procedures under the hood. StatSim Gen uses mkdata, an open-source library that has all its code publicly available on GitHub.

Save results as CSV files

The comma-separated format is probably the most popular for storing tabular data. Most data processing libs and programs support it. Just save results as a CSV file and then load it into another app. You can preview or profile CSV files using StatSim Preview and StatSim Profile, our free web apps. Or fit an XGBoost model in StatSim Fit.

If you enjoyed the app, star us on GitHub. To report errors, create an issue.

Star Issue