Français Search
www.bayesia.com does not fully support your browser (Internet Explorer 6).
We suggest upgrading to IE 7 or downloading Firefox for a more enjoyable web experience.

Data mining / automatic learning with BayesiaLab

Discover the knowledge buried in your databases

Data mining avec BayesiaLab

Do you have data that you are unable to analyse? Do you have questions related to the relations between your database variables and the segmentation of your records? It will scarcely take you a few minutes to discover your data’s hidden relations with BayesiaLab.

Step 1 - Import of data using the BayesiaLab wizard

  • Choice of database separators
  • Definition of missing and filtered values
  • Sampling and breakdown of data into learning/test sets
  • Variable typing (discrete, continuous, weight, type of data (learning or test))
  • Pre-processing using the pooling procedure tools of discrete modalities and discretization of continuous variables, choice of a target variable

Step 2 - Learning: several methods, a wide range of algorithms

The data is imported : BayesiaLab has created a network with the interesting variables. It is now a question of finding the relations between them. Here are the methods offered by BayesiaLab :

  • By finding associations (unsupervised learning), the data’s set of direct probabilistic relations can be explored. + learn more »

    The number of Bayesian networks that can be designed for a given number of variables is so great that it is impossible (except in extreme cases) to carry out an exhaustive search of the best network. The learning algorithms rely then on a set of heuristics that allows to reduce the search space. BayesiaLab comes with four structural learning algorithms (discovering of the network structure and estimation of the corresponding conditional probability tables) that are conceptually different, from the faster to the slower. The heuristics that are used being different, the results of each method can be different. However, as each learning methods uses the same metric (the MDL score), the resulting networks can be easily compared. The score is available in the console and is also automatically inserted in the comment associated to the network. The lower the score is, the better the network is.

    • Maximum spanning tree : this learning algorithm is by far the quickest unsupervised learning algorithm. Indeed, it relies only on two passes. The first one consists in computing the a priori weight of all the binary relations between all the variables, the second one consists then in constructing the maximum weight spanning tree with those relations. Even if the resulting network is not optimal, it can then be used for a first imputation of the missing values, it can be used as the initial network before using Taboo or EQ, and it can also be used for the variable clustering with there is a lot of variables.

      The user can choose between two different scoring methods for this learning: the Minimum Description Length and the Pearson's Correlation.

    • Taboo : Structural learning implementing the Taboo search in the space of the Bayesian networks. This method is particularly useful for refining a network built by human experts or for updating a network learned on a different data set. Indeed, beyond taking into account the a priori knowledge represented by a network and an equivalent number of cases, the starting point of Taboo is the current network (and not the fully unconnected network (no arc), as this is the case for SopLEQ and Taboo Order). Furthermore, arcs that are fixed (the blue one) remain unchanged.
    • EQ : Search method looking for the equivalence classes of Bayesian networks. This method is very efficient because it allows to avoid a lot of local minima and to strongly reduce the size of the search space. As the Taboo algorithm do, EQ can start with the current network. However, the fixed arcs are treated as normal arcs.
    • SopLEQ : Search method based on a global characterization of data and on the exploitation of the equivalence properties of Bayesian networks.
    • Taboo Order : Learning method that uses the Taboo search in the space of the order of the Bayesian network nodes. Indeed, finding the best Bayesian network for a fixed node order is an easy task that only consists in choosing the parents of a node among the nodes that appear before it in the considered order. This is the more complete search method, but also the more time consuming.
    « - less
  • Supervised learning focuses entirely on the characterization of the target variable, meaning that its probabilistic profile can be quickly established. + learn more »

    BayesiaLab puts at your disposal several learning algorithms :

    • Naive Bayes Naive Bayes : Bayesian network with a predefined architecture in which the target node is the parent of all the other nodes. This structure thus states that the target node is the cause of all the other nodes and that the knowledge of its value makes each node independent of the others. In spite of these strong assumptions, which are false in the majority of the cases, the low number of probabilities to estimate makes this structure very robust, with a very short learning time as only the probabilities have to be estimated.
    • Augmented naive Bayes Augmented naive Bayes : partially predefined structure allowing relaxing the strong constraint of conditional independence mentioned above. This architecture is made up of a naive architecture, enriched by the relations between the child nodes knowing the value of the target node (the common parent).

      The prediction accuracy of this algorithm is better than those obtained by the naive architecture, but the unsupervised search of the child relationships can be time consuming.

    • Tree Augmented Naive Bayes:Tree Augmented Naive Bayes : partially predefined structure allowing relaxing the strong constraint of conditional independence mentioned above.

      This architecture is made up of a naive architecture on which a maximum spanning tree is learned. The prediction accuracy of this algorithm is better than those obtained by the naive architecture, but not as good as obtained with Augmented Naive Bayes; however, this algorithm is much quicker than it.

    • Sons & Spouses : structure in which the target node is the parent of a subset of nodes having potentially other parents (spouses). Sons & Spouses : structure in which the target node is the parent of a subset of nodes having potentially other parents (spouses).

      This structure is to some extent an augmented naive architecture in which the children set is not fixed a priori, but searched according to the marginal dependence of the nodes on the target. This algorithm thus has the advantage of highlighting the nodes that are not correlated to the target. The learning duration is comparable with the augmented naive architecture one.

    • Markov Blanket learning Markov Blanket learning : algorithm that searches the nodes belonging to the Markov Blanket of the target node, i.e. fathers, sons and spouses. The knowledge of the values of each node of this subset of nodes makes the target node independent of the all the other nodes. The search of this structure, which is entirely focused on the target node, makes it possible to obtain the subset of the nodes that are really useful much more quickly than the two previous algorithms. Furthermore, this method is a very powerful selection algorithm and is the ideal tool for the analysis of a variable: a restricted number of connected nodes, different kinds of probabilistic relations:

      • fathers : nodes that bring more information jointly than alone;
      • sons : nodes having a direct probabilistic dependence with the target;
      • spouses : nodes those are marginally independent of the target but which become informative when knowing the value of the son.
    • Augmented Markov blanket Augmented Markov blanket learning : algorithm that is initialized with the Markov Blanket structure and that uses an unsupervised search to find the probabilistic relations that hold between each variable belonging to this Markov Blanket. This unsupervised search implies additional time cost but allows having better prediction results compared to the first version.
    • Minimal augmented Markov blanket learningMinimal augmented Markov blanket learning : the selection of the variables that is realized with the Markov Blanket learning algorithm is based on a heuristic search. The set of the selected nodes can then be non minimal, especially when there are various influence paths between the nodes and the target. In that case,the target analysis result takes into account too much nodes. By applying an unsupervised learning algorithm on the set of the selected nodes, the Minimal Augmented Market Blanket learning allows reducing this set of nodes, and it results then in a more accurate target analysis. However, if the task is a pure prediction task (as for example a scoring function), the Augmented Markov Blanket algorithm is usually more accurate than its Minimal version since it uses more "pieces of evidences".
    • Semi-Supervised learning Semi-Supervised learning : unsupervised learning algorithm that searches the relationships between the nodes that belong to a predefined distance of the target. This distance is computed by using the Markov Blanket learning algorithm. The semi-supervised learning algorithm allows learning a network fragment centered on the target variable. This algorithm is very useful for tasks that involve a lot of nodes, as for example in micro-arrays analysis (thousand of genes), and for prediction tasks where the Markov Blanket nodes have missing values, as these nodes do not allow to separate the target node from the other nodes anymore.
    « - less
  • Unsupervised learning for the search for new concepts (segmentation, clustering) allows you to divide your records into semantically significant classes (partitions), regrouping the records sharing certain characteristics. You can therefore easily create a typology (customers, patients, products, etc.) and define policies adapted to each segment discovered. + learn more »

    Here are the available clustering methods and their parameters :

    • A fixed number of classes : the algorithm tries to segment data according to a given number of classes (ranging from 2 to 127). However, it is possible to obtain less clusters than desired;
    • Automatic selection of the number of classes : a random walk is used to find the optimal number of classes, starting with the specified number of clusters and increasing that number until obtaining empty or unstable clusters, or reaching the specified maximum number of clusters. The random walk is guided by the results obtained at each step;
    • Options :
      • the sample size option makes it possible to search for the optimal number of classes on data subsets to improve the convergence speed (a sampling by step/trial). The partition obtaining the best score is then used as the initial partition for the search on the entire data set;
      • the number of steps for the random walk. Knowing that it is possible to stop the search by clicking on the red light of the status bar while preserving the best clustering, this number can be exaggeratedly great.
      • It is also possible to give a weight to each variable of the network. Those weights, with default value 1, are associated with the variables and permit to guide the clustering. A weight greater than 1 will imply that the variable will be more taken into account during the clustering. A zero weight will make the variable purely illustrative.
    « - less

    At the end of the segmentation, an automatic analysis of the obtained segmentation is carried out and returns a textual report. This report is a "Target Report Analysis" and contains some additional information.

    Cartographie de clusteringA graphical representation of the created clusters can be generated. This graph displays three properties of the found clusters:

    • the color represents the purity of the clusters: the more a cluster is blue, the more it is pure
    • the size represents the prior probability of the cluster
    • the distance between two clusters represents the mean neighborhood of the clusters
  • The clustering of variables is particularly useful for discovering new concepts and synthesizing the hidden variables corresponding to the unsupervised classification. You can therefore discover probabilistic relations i découvrir les relations probabilistes entre ces variables latentes et les variables manifestes (modèles hiérarchiques, équations structurelles automatiques).

Step 3: analysis, using the created network

Examples of data mining applications