Data Mining

Graph-based Features for Supervised Link Prediction

Apr 24th, 2013 by Santosh

Actual Paper: http://www.islab.ntua.gr/attachments/article/63/06033365.pdf

Link predictions is done as a classification problem for a competition. 94 features were calculated and they are given as input for Random forests for classification. ROC is used for evaluation.

Study of network dynamics has numerous applications like marketers use it to recommend products to a person based on his contacts. Graph representation allows to use power of existing graph theory to solve this problem. This paper deals each edge as friendship and node as user. (In our project, each node is photo and edge is relationship between them.).

This problem is classification where prediction is find if link exists or not, but in real world it would be, which node has more probability of creating links with others and with which other nodes it has more probability of creating links.

In this paper, different similarity scores are considered as features. Scores should be normalized to a common scale wherever possible.

Instead of using the entire graph at once, sub-graphs are used to find prediction for a pair of nodes. A Sub graph for a pair of nodes is created by considering the neighbor nodes of those two nodes and constructing a graph with edges between such nodes. This can be done at different levels.

Score is calculated by considering top 10 related nodes of a node A, and then finding the intersection of those 10 nodes with the neighbour nodes of node B and calculating the sum of similarity of those intersection nodes and the node A. When there are fewer friends for a node, it is difficult to get proper semantics, so neighbour nodes are considered. But calculating the top 10 friends for Node A can be expensive.

Weights for edges between pair of nodes is calculated based on the degree of the nodes.
w = 1 / (sqrt(1+degree))

Using this weights, nearest neigbours were found.

ROC is used for evaluation.

Feasibilty, scalability and accuracy must be balanced.

Posted in Uncategorized | No Comments »

Scalable Proximity Estimation and Link Prediction in Online Social Networks

Apr 24th, 2013 by Santosh

Actual Paper: http://www.cs.utexas.edu/~yzhang/papers/osn-imc09.pdf

Estimating proximity between nodes in social networks is challenging due to size and dynamic nature of networks. Proximity measure is the central measure in analysis of social networks. Proximity sketching and Proximity embedding are proposed in this paper. They are demonstrated using Link Prediction problem on multiple networks.

Simplest proximity estimations are based on the information flow between the nodes, between neighbors of nodes. Katz Measure, Rooted Page Rank and Escape probability are used in this paper.

We’ve studied more on link prediction part of this paper as we are interested solving this problem for the purpose of our project.

1. Basic Link Predictors: Link exists between a pair of nodes, if the similarity between the nodes crosses certain threshold value.
2. Composite Link Predictors: can be applied through Weka, using multiple predictors.

Proximity Measures:

Graph Based:
1. Shorted path distance between pair of nodes in existing graph.
2. Dijkstra is inefficient for graphs with large no. of nodes. In this paper, they applied expanded ring search.

Node neighbourhood based:
1. No of common neighbours.
2. More weight to common neighbors with small no. of friends.
3. Preferential attachment: Product of neighbours of both nodes.
4. PageRank: The PageRank of a node depends on the count of inbound links and the PageRank of outbound neighbors
5. Path ensemble based – Eg: Katz Measure.

For link prediction, they have divided data sets into three parts over time lines S1, S2, S3. Based on information from S1 and S2, they will predict links that will be formed in S3.

In the friendship link prediction problem, we construct the positive set as the set of user pairs that were not friends in the first snapshot, but become friends in the second snapshot. The negative set consists of user pairs that are friends in neither snapshots.

Accuracy is measured interms of FP and FN.

FN = #of missed friend links / #of new-friend links
FP = #of incorrectly predicted friend links / #of non-friend links

Computational time for each proximity for a single pair of a nodes is shown in results.

Related work on Social Networks, Proximity Measures, Link Prediction, Dimensionality Reduction is studied.

Posted in Link Prediction | No Comments »

SNAP – Stanford Network Analysis Project

Apr 17th, 2013 by Santosh

SNAP is a tool for network analysis and graph mining. It is implemented in C++ and can scale up to millions of nodes, with efficient data structures to optimize memory.

The source code has lot of examples that will explain how to implement or traverse over trees / graphs using the tool. Many commonly used algorithms are also implemented in the examples section.

http://snap.stanford.edu/index.html

Posted in Tools | No Comments »

Image Labeling on a Network: Using Social-Network Metadata for Image Classification

Apr 17th, 2013 by Santosh

Original Paper: http://i.stanford.edu/~julian/pdfs/eccv2012.pdf

This paper tries to predict tags, labels or recommendations for images based on social network meta data of the images, i.e., the connection between images, like images uploaded to same groups. The task is a binary labeling problem using structure learning techniques

Challenge is to find how to effectively use/model social network data and predict based on it.

Modeling is done as follows:

we propose a graphical model that treats image classication as a problem of simultaneously predicting binary labels for a network of photos. Nodes represent images, and edges represent relationships between images. Our intuition that images sharing common properties are likely to share labels allows us to exploit techniques from supermodular optimization, allowing us to efficiently make binary predictions on all images simultaneously

Labels range from subjective to objective. Social network metadata provides context that is not inherent in the image. So, the model proposed in this paper outperforms SVM.

Following is a gist of dataset:

The photo itself
Photo data, including the photo’s title, description, location, timestamp, viewcount, upload date, etc.
User information, including the uploader’s name, username, location, their network of contacts, etc.
Photo tags, and the user who provided each tag
Groups to which the image was submitted (only the uploader can submit a photo to a group)
Collections (or sets) in which the photo was included (users create collections from their own photos)
Galleries in which the photo was included (a single user creates a gallery only from other users’ photos)
Comment threads for each photo

Images sharing common tags are likely to share image labels and similarly other properties as well.
Features vectors represent presence or absence of each tag.
Memory is important as feature vectors are stored in memory.

For Nodes, 1000 most popular words / tags / groups features are used by feature reduction.
For edges, indicators for no. of tags, groups, collections etc are used.

Performance is reported from previous papers using, image tags alone, SVM etc. They are compared using Mean Average Precision and Balanced Error Rate.

Posted in Link Prediction | No Comments »

Supervised Random Walks: Predicting and Recommending Links in Social Networks

Apr 17th, 2013 by Santosh

Actual Paper: http://arxiv.org/pdf/1011.4071.pdf
Citiation: Backstrom, Lars and Leskovec, Jure Supervised random walks: predicting and recommending links in social networks WSDM ’11

The link prediction problem is to identify any links that may be formed in future, or missing links in a network.

The challenge is to combine network structure and node/edge data. This problem is solved in this paper. Social networks are dynamic and they keep growing and studying a network at edge level is also interesting.

1. Real networks are extremely sparse.
2. How to use features of network to model links.

TO combine features of nodes and edges is challenging. Common approach followed in networks is to calculate no. of neighbors, degree of nodes, shortest between nodes etc.

In this paper, an Edge Strength Function is optimized using page-rank like random walk scores. This paper is very general and can be applied to a variety of domains.
1. Facebook, myspace.
2. Security – to predict hidden links in social network
3. Protien – protien interactions in system biology.
4. Give relavent pages to link to, for bloggers.

Challenge for many of the existing approaches is scalability.

First approach: Classification problem.
1. Class Imbalance – Few nodes are connected to a node, compared to not connected.
2. Feature Selection for Edges and Nodes.
3. Proximity – many possibilities like mentioned above.

Second Approach: Task is to rank the nodes.

This paper combines both of them.

Experiments are performed on synthetic data. ROC Curve is used for evaluation. Then real data sets are used for evaluation like FB, Co-authorship etc.

This method outperforms existing supervised models and far better than unsupervised models.

Posted in Link Prediction | No Comments »

Project problem statement

Apr 17th, 2013 by smolleti

Project Problem statement:

We have a dataset of photos with tags, userids, location and label as features/metadata/dimensions. We will be creating a graph network of the photos based on common attributes, for example similar location and user id would link (create an edge) between two photos. We have the dataset with metadata as features and the nodes (photos) are linked using common features. Once network is created based on our assumptions, we need to come up with data mining algorithms to try to predict an missing edge of a triangle based on the other two edges.

Posted in Project Specific | No Comments »

Link Prediction using Supervised Learning

Apr 16th, 2013 by Santosh

Actual Paper : http://www.siam.org/meetings/sdm06/workproceed/Link%20Analysis/12.pdf

Below is a summary, understanding and takeaway of the paper.

Link Prediction problem is to predict the likelihood of two nodes sharing an edge in future, given the current state of the graph.

Co-authorship data set was used in this paper. This dataset also obey power-law distribution.

The paper explains the procedure to construct a dataset, select features, and use various algorithms to learn a model. Each feature is also evaluated with visual analysis and experimental results.

Dataset is constructed in following manner:

To predict a link, we partition the range of publication years into two non-overlapping sub-ranges. The first sub-range is selected as training years and the later one as the testing years. Then, we prepare the classification dataset, by choosing those author pairs, that appeared in the training years, but did not publish any papers together in those years. Each such pair either represents a positive example or a negative example, depending on whether those author pairs published at least one paper in the testing years or not. Coauthoring a paper in testing years by a pair of authors, establishes a link between them, which was not there in the training years.

Choosing correct features is most critical. Various kinds of feature selection

Proximity Features – Keyword Match Count

Cheaper to Compute
For example, for the case of coauthor ship network, two authors are close (in the sense of a social network) to each other, if their research work revolves around a larger set of identical keywords.

Aggregate Features – Sum of Papers, Sum of Neighbors, Sum of Keyword Counts, Sum of Classification Code, Sum of log(Secondary Neighbors Count, Log since they grow exponentially – Costly to compute).

Combine the attribute values of the corresponding nodes in a node-pair.
if either (or both) of the scientists are prolific, it is more likely that they will collaborate. Before aggregation, the individual measure is how prolific a particular scientist is and the corresponding individual feature is the number of different areas (s)he has worked on. Summing the value to combine these, yields an aggregated feature that is meaningful for the pair of authors for link prediction.

Topological Features – Clustering Index, Shortest distance, Shortest Distance in Author-KW graph

Applicable equally to all domains
The shorter the distance, the better the chance that they will collaborate.

Normalized the feature values to have zero mean and one standard deviation. For algorithms that have tunable parameters, like SVM, K-Nearest Neighbors, etc., separate validation set is used to find the optimum parameter values. Majority of the misclassification in this experiment is due to base error from some samples – found from bagging / ensemble. For link prediction, we may have to bias model for high recall, depends on the type of application. Degree of confidence can be used instead of binary classification.

Posted in Link Prediction | No Comments »

Data Mining

Graph-based Features for Supervised Link Prediction

Scalable Proximity Estimation and Link Prediction in Online Social Networks

SNAP – Stanford Network Analysis Project

Image Labeling on a Network: Using Social-Network Metadata for Image Classification

Supervised Random Walks: Predicting and Recommending Links in Social Networks

Project problem statement

Link Prediction using Supervised Learning

Recent Posts

Categories

Recent Comments

Archives

Meta