Français Search

Data mining / automatic learning with BayesiaLab

Discover the knowledge buried in your databases

Data mining avec BayesiaLab

Do you have data that you are unable to analyse? Do you have questions related to the relations between your database variables and the segmentation of your records? It will scarcely take you a few minutes to discover your data’s hidden relations with BayesiaLab.

Step 1 – Import of data using the BayesiaLab wizard

  • Choice of database separators
  • Definition of missing and filtered values
  • Sampling and breakdown of data into learning/test sets
  • Variable typing (discrete, continuous, weight, type of data (learning or test))
  • Pre-processing using the pooling procedure tools of discrete modalities and discretization of continuous variables, choice of a target variable

Step 2 – Learning: several methods, a wide range of algorithms

The data is imported : BayesiaLab has created a network with the interesting variables. It is now a question of finding the relations between them. Here are the methods offered by BayesiaLab :

  • By finding associations (unsupervised learning), the data’s set of direct probabilistic relations can be explored. + learn more »

    Le nombre de réseaux bayésiens possibles est tel qu'il est impossible (sauf cas extrêmes) de faire une recherche exhaustive du meilleur réseau. Les méthodes d'apprentissage reposent donc sur des heuristiques permettant de réduire l'espace de recherche. BayesiaLab propose cinq méthodes d'apprentissage structurel (structure du réseau bayésien et tables de probabilités) conceptuellement différentes, de la plus rapide à la plus lente. Les heuristiques mises en oeuvre étant différentes, les résultats entre chaque méthode peuvent différer. Par contre, toutes ces méthodes reposent sur la minimisation de la même métrique (score MDL).

    • Arbre de recouvrement maximal : s'effectuant en seulement deux passes, cet algorithme est de loin le plus rapide. La première passe consiste à calculer a priori le poids de toutes les relations binaires entre variables, la seconde consiste ensuite à construire l'arbre de recouvrement maximal avec ces relations. Même si le réseau obtenu n'est pas optimal, il peut être utilisé pour une première imputation des valeurs manquantes, comme réseau initial pour une recherche Taboo ou EQ, ou encore pour construire le graphe nécessaire à la segmentation de variables lorsqu'il y a beaucoup de variables. Il est possible de choisir parmi deux scores différents pour cet apprentissage : le Minimum Description Length et la corrélation de Pearson.
    • Taboo : apprentissage reposant sur la stratégie de recherche Taboo dans l'espace défini par les réseaux bayésiens. Cette méthode est particulièrement utile pour l'affinage de réseaux construits par expertise ou pour la mise à jour de réseaux appris sur des ensembles d'apprentissage différents. En effet, au delà de la prise en compte de la connaissance a priori définie par un réseau Bayésien et un nombre d'exemples équivalant, cet algorithme commence sa recherche à partir du réseau courant (et non à partir d'un réseau complètement déconnecté comme c'est le cas avec SopLEQ et Taboo Order). En outre, les arcs fixés (les bleus) restent inchangés et les arcs interdits sont pris en compte.
    • EQ : algorithme d'apprentissage recherchant les classes d'équivalence des réseaux bayésiens. Cette méthode est très efficace car elle permet d'éviter beaucoup de minima locaux et de diminuer fortement la taille de l'espace de recherche. Comme l'algorithme Taboo, EQ peut commencer sa recherche à partir du réseau courant. Cependant les arcs fixés restent inchangés et les arcs interdits sont pris en compte.
    • SopLEQ : algorithme d'apprentissage reposant sur une première phase de caractérisation globale des données et sur l'exploitation des propriétés d'équivalence de réseaux bayésiens. Les arcs fixés sont considérés comme des arcs normaux mais les arcs interdits sont pris en compte.
    • Taboo Order : algorithme d'apprentissage utilisant la recherche Taboo dans l'espace défini par les ordres de noeuds des réseaux bayésiens. En effet, trouver le meilleur réseau bayésien pour un ordre de noeuds donné est une tâche aisée qui consiste uniquement à choisir les parents de chaque noeud parmi l'ensemble des noeuds apparaissant avant lui dans l'ordre considéré. Cette méthode est la méthode de recherche la plus complète, mais aussi la plus coûteuse en temps de calcul. Les arcs fixés sont considérés comme des arcs normaux mais les arcs interdits sont pris en compte.
    « - less
  • Supervised learning focuses entirely on the characterization of the target variable, meaning that its probabilistic profile can be quickly established. + learn more »

    BayesiaLab puts at your disposal several learning algorithms :

    • Naive Bayes Naive Bayes : Bayesian network with a predefined architecture in which the target node is the parent of all the other nodes. This structure thus states that the target node is the cause of all the other nodes and that the knowledge of its value makes each node independent of the others. In spite of these strong assumptions, which are false in the majority of the cases, the low number of probabilities to estimate makes this structure very robust, with a very short learning time as only the probabilities have to be estimated.
    • Augmented naive Bayes Augmented naive Bayes : partially predefined structure allowing relaxing the strong constraint of conditional independence mentioned above. This architecture is made up of a naive architecture, enriched by the relations between the child nodes knowing the value of the target node (the common parent). The prediction accuracy of this algorithm is better than those obtained by the naive architecture, but the unsupervised search of the child relationships can be time consuming.
    • Tree Augmented Naive Bayes:Tree Augmented Naive Bayes : partially predefined structure allowing relaxing the strong constraint of conditional independence mentioned above. This architecture is made up of a naive architecture on which a maximum spanning tree is learned. The prediction accuracy of this algorithm is better than those obtained by the naive architecture, but not as good as obtained with Augmented Naive Bayes; however, this algorithm is much quicker than it.
    • Sons & Spouses : structure in which the target node is the parent of a subset of nodes having potentially other parents (spouses). Sons & Spouses : structure in which the target node is the parent of a subset of nodes having potentially other parents (spouses). This structure is to some extent an augmented naive architecture in which the children set is not fixed a priori, but searched according to the marginal dependence of the nodes on the target. This algorithm thus has the advantage of highlighting the nodes that are not correlated to the target. The learning duration is comparable with the augmented naive architecture one.
    • Markov Blanket learning Markov Blanket learning : algorithm that searches the nodes belonging to the Markov Blanket of the target node, i.e. fathers, sons and spouses. The knowledge of the values of each node of this subset of nodes makes the target node independent of the all the other nodes. The search of this structure, which is entirely focused on the target node, makes it possible to obtain the subset of the nodes that are really useful much more quickly than the two previous algorithms. Furthermore, this method is a very powerful selection algorithm and is the ideal tool for the analysis of a variable: a restricted number of connected nodes, different kinds of probabilistic relations:

      • fathers : nodes that bring more information jointly than alone;
      • sons : nodes having a direct probabilistic dependence with the target;
      • spouses : nodes those are marginally independent of the target but which become informative when knowing the value of the son.
    • Augmented Markov blanket Augmented Markov blanket learning : algorithm that is initialized with the Markov Blanket structure and that uses an unsupervised search to find the probabilistic relations that hold between each variable belonging to this Markov Blanket. This unsupervised search implies additional time cost but allows having better prediction results compared to the first version.
    • Minimal augmented Markov blanket learningMinimal augmented Markov blanket learning : the selection of the variables that is realized with the Markov Blanket learning algorithm is based on a heuristic search. The set of the selected nodes can then be non minimal, especially when there are various influence paths between the nodes and the target. In that case,the target analysis result takes into account too much nodes. By applying an unsupervised learning algorithm on the set of the selected nodes, the Minimal Augmented Market Blanket learning allows reducing this set of nodes, and it results then in a more accurate target analysis. However, if the task is a pure prediction task (as for example a scoring function), the Augmented Markov Blanket algorithm is usually more accurate than its Minimal version since it uses more "pieces of evidences".
    • Semi-Supervised learning Semi-Supervised learning : unsupervised learning algorithm that searches the relationships between the nodes that belong to a predefined distance of the target. This distance is computed by using the Markov Blanket learning algorithm. The semi-supervised learning algorithm allows learning a network fragment centered on the target variable. This algorithm is very useful for tasks that involve a lot of nodes, as for example in micro-arrays analysis (thousand of genes), and for prediction tasks where the Markov Blanket nodes have missing values, as these nodes do not allow to separate the target node from the other nodes anymore.
    « - less
  • Unsupervised learning for the search for new concepts (segmentation, clustering) allows you to divide your records into semantically significant classes (partitions), regrouping the records sharing certain characteristics. You can therefore easily create a typology (customers, patients, products, etc.) and define policies adapted to each segment discovered. + learn more »

    Voici les paramètres qui déterminent la segmentation effectuée :

    • A fixed number of classes : the algorithm tries to segment data according to a given number of classes (ranging from 2 to 127). However, it is possible to obtain less clusters than desired;
    • Automatic selection of the number of classes : a random walk is used to find the optimal number of classes, starting with the specified number of clusters and increasing that number until obtaining empty or unstable clusters, or reaching the specified maximum number of clusters. The random walk is guided by the results obtained at each step;
    • Options :
      • the sample size option makes it possible to search for the optimal number of classes on data subsets to improve the convergence speed (a sampling by step/trial). The partition obtaining the best score is then used as the initial partition for the search on the entire data set;
      • the number of steps for the random walk. Knowing that it is possible to stop the search by clicking on the red light of the status bar while preserving the best clustering, this number can be exaggeratedly great.
      • It is also possible to give a weight to each variable of the network. Those weights, with default value 1, are associated with the variables and permit to guide the clustering. A weight greater than 1 will imply that the variable will be more taken into account during the clustering. A zero weight will make the variable purely illustrative.
    « - less

    At the end of the segmentation, an automatic analysis of the obtained segmentation is carried out and returns a textual report. This report is a "Target Report Analysis" and contains some additional information.

    Cartographie de clusteringA graphical representation of the created clusters can be generated. This graph displays three properties of the found clusters:

    • the color represents the purity of the clusters: the more a cluster is blue, the more it is pure
    • the size represents the prior probability of the cluster
    • the distance between two clusters represents the mean neighborhood of the clusters
  • The clustering of variables is particularly useful for discovering new concepts and synthesizing the hidden variables corresponding to the unsupervised classification. You can therefore discover probabilistic relations i découvrir les relations probabilistes entre ces variables latentes et les variables manifestes (modèles hiérarchiques, équations structurelles automatiques).

Step 3: analysis, using the created network

Examples of data mining applications