Next Generation Data Classification and Linkage
Data classification and linkage is the task of identifying information corresponding to the same entity from one or more data sources. Methods used to tackle data classification and linkage problems fall into two broad categories. One commonly used method is deterministic models, in which sets of often very complex rules are used to classify pairs of entities as links. The other is the probabilistic model, in which statistical or probabilistic approaches are used to classify pairs. However, these models fail to deliver when there are lots of missing values, typographical errors, non-standardized entities, etc. To this end, intelligent routines making use of artificial neural networks, genetic algorithms and clustering algorithms can provide the next generation models for data classification and linkage. An introduction to data linkage, impact on humanity and community, current models, associated pitfalls, new directions and issues both technical and social for next generation data classification and linkage systems are discussed using an example prototype. A new model for linkage is proposed, where it is highlighted that not only the relationships between attributes of different entities, but also identification of relationships within the attributes of an entity is important in handling missing values and can provide better accuracy.