Link BayesDB

Query the probable implications of your tabular data as easily as an SQL database lets you query the data itself.

Overview

BayesDB, a Bayesian database table, lets users query the probable implications of their tabular data as easily as an SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.

BayesDB is suitable for analyzing complex, heterogeneous data tables with up to tens of thousands of rows and hundreds of variables. No preprocessing or parameter adjustment is required, though experts can override BayesDB's default assumptions when appropriate.

BayesDB assumes that each row of your table is a sample from some fixed population or generative process, and estimates the joint distribution over the columns. BQL then allows you to draw Bayesian inferences about individuals and about the overall population or process. The estimates are currently provided by CrossCat, a new nonparametric Bayesian method for analyzing high-dimensional data tables.

Examples

INFER salary FROM mytable WHERE age > 30 WITH CONFIDENCE 0.95;

Fill in missing data with the INFER command. Unlike a traditional regression model, where you need to separately train a supervised model for each column you're interested in predicting, INFER statements are flexible and work with any set of columns to predict.

SIMULATE salary FROM mytable WHERE age > 30;

Easily simulate new probable observations based on CrossCat's estimate of the joint density of the data.

ESTIMATE PAIRWISE DEPENDENCE PROBABILITIES FROM mytable;

With just one command, estimate any pairwise function of columns, including the probability that the two columns are statistically dependent, the mutual information between columns, and their correlation.