Link Prediction using Supervised Learning

Apr 16th, 2013 by Santosh

Actual Paper : http://www.siam.org/meetings/sdm06/workproceed/Link%20Analysis/12.pdf

Below is a summary, understanding and takeaway of the paper.

Link Prediction problem is to predict the likelihood of two nodes sharing an edge in future, given the current state of the graph.

Co-authorship data set was used in this paper. This dataset also obey power-law distribution.

The paper explains the procedure to construct a dataset, select features, and use various algorithms to learn a model. Each feature is also evaluated with visual analysis and experimental results.

Dataset is constructed in following manner:

To predict a link, we partition the range of publication years into two non-overlapping sub-ranges. The first sub-range is selected as training years and the later one as the testing years. Then, we prepare the classification dataset, by choosing those author pairs, that appeared in the training years, but did not publish any papers together in those years. Each such pair either represents a positive example or a negative example, depending on whether those author pairs published at least one paper in the testing years or not. Coauthoring a paper in testing years by a pair of authors, establishes a link between them, which was not there in the training years.

Choosing correct features is most critical. Various kinds of feature selection

Proximity Features – Keyword Match Count

Cheaper to Compute
For example, for the case of coauthor ship network, two authors are close (in the sense of a social network) to each other, if their research work revolves around a larger set of identical keywords.

Aggregate Features – Sum of Papers, Sum of Neighbors, Sum of Keyword Counts, Sum of Classification Code, Sum of log(Secondary Neighbors Count, Log since they grow exponentially – Costly to compute).

Combine the attribute values of the corresponding nodes in a node-pair.
if either (or both) of the scientists are prolific, it is more likely that they will collaborate. Before aggregation, the individual measure is how prolific a particular scientist is and the corresponding individual feature is the number of different areas (s)he has worked on. Summing the value to combine these, yields an aggregated feature that is meaningful for the pair of authors for link prediction.

Topological Features – Clustering Index, Shortest distance, Shortest Distance in Author-KW graph

Applicable equally to all domains
The shorter the distance, the better the chance that they will collaborate.

Normalized the feature values to have zero mean and one standard deviation. For algorithms that have tunable parameters, like SVM, K-Nearest Neighbors, etc., separate validation set is used to find the optimum parameter values. Majority of the misclassification in this experiment is due to base error from some samples – found from bagging / ensemble. For link prediction, we may have to bias model for high recall, depends on the type of application. Degree of confidence can be used instead of binary classification.

Posted in Link Prediction

Data Mining

Link Prediction using Supervised Learning

Leave a Reply Cancel reply

Recent Posts

Categories

Recent Comments

Archives

Meta