Dataset | Type | Variables | Description |
---|---|---|---|
Friedman 1 | Regression | 10 + 1 | y = 10 * sin(Pi * x1 * x2) + 20 * (x3 - 0.5) ** 2 + 10 * x4 + 5 * x5 + e |
Friedman 2 | Regression | 4 + 1 | y = sqrt(x1 ** 2 + (x2 * x3 - 1 / (x2 * x4)) ** 2) + e |
Friedman 3 | Regression | 4 + 1 | y = atan(x2 * x3 - 1 / (x2 * x4) / x1) + e |
Peak | Regression | 10 + 1 | Peak Benchmark Problem. From: mlbench |
Hastie | Classification | 10 + 1 | Binary classification problem used in Hastie et al |
Moons | Classification | 2 + 1 | Two interleaving half circles |
Spirals | Classification | 2 + 1 | Two entangled spirals |
Ringnorm | Classification | 10 + 1 | Breiman, L. (1996). Bias, variance, and arcing classifiers |
In the real world, data collection is almost always an expensive and complex process. Artificial data is an easier and faster alternative for testing statistical and machine learning methods. Until you have enough RAM and disk space, you can generate any number of records.
In many practical cases, observations are noisy, and the data generating function is not fully known. That makes a model evaluation harder. Luckily all synthetic datasets have transparent rules and procedures under the hood. StatSim Gen uses mkdata, an open-source library that has all its code publicly available on GitHub.
The comma-separated format is probably the most popular for storing tabular data. Most data processing libs and programs support it. Just save results as a CSV file and then load it into another app. You can preview or profile CSV files using StatSim Preview and StatSim Profile, our free web apps. Or fit an XGBoost model in StatSim Fit.