Wikipedia Network Maps
Added November 2008
I have been working on a really exciting project with Ian Pearce and Max Darham that attempts to visualize Wikipedia.
Our idea was as such:
How are articles on Wikipedia organically organized? For instance, if you read the article on Bicycles, there will likely be links to the inventor of the modern bicycle within the text of the article, and there will likely be links to rubber, steel, and all the other parts; the countries that were important in making advances in bicycle technology, and so forth. If we treat an article as a vertex and the link as a line or connection to another article or vertex, could we make a program that visually maps this data? We basically wanted to make a prototype datamine project and then use our information to draw specific conclusions about how Wikipedia looks, works, is organized, and if that bears any similarities to any other known information networks maps or types.
Our plan was as such:
We wanted to build a ruby program that visited a random article. You can do this for the English language by using http://en.wikipedia.org/wiki/Special:Random.
Then, we wanted to have the program store all the links from the page, remember the page and all of its links, then recurse through and look at all the links that were on that page, then go through those linked pages, and so forth until we had a huge array of all the articles and the articles that were linked to within a given article. Then, we wrote a little method that turned that information into a Python list, and then used Sage to create visuals of the data. Below are a few examples as of right now of a few graphs we have generated so far; these are the largest connect components, or the largest group of connected articles; there are others, but they don’t link to anything, and nothing links to them, so they are kind of boring. In the next few weeks, we will wrap up the project and have more complete graphs and conclusions to boot. Check back later.
Actual Sage Code Used to make it all work:
(Click here to see the sage)
g = eval(open(get_remote_file('http://devingaffney.com/files/wiki_data
/kg/kg_data.txt')).read())
wiki_graph = Graph(g)
wiki_digraph = DiGraph(g)
wg_conncomp = wiki_graph.connected_components_subgraphs()
wdg_conncomp = wiki_digraph.connected_components_subgraphs()
print ""
print "GRAPH"
print "Total articles:", wiki_graph.num_verts()
dh = wiki_graph.degree_histogram()
dh_plot = list_plot(dh, plotjoined=True)
gcount = 0
print "Degree histogram:"
for x in dh:
gcount = gcount+1
if x != 0:
print gcount, x
dh_plot.axes_range(0,200,0,60)
dh_plot.show()
graph_degree_total = 0
for deg in wiki_graph.degree_iterator():
graph_degree_total = deg+graph_degree_total
graph_average_degree = float(graph_degree_total/wiki_graph.num_verts())
print ""
print "GRAPH : LARGEST CONNECTED COMPONENT"
print "Total articles in largest connected component:", wg_conncomp[0].num_verts()
print "Diameter of largest connected component:", wg_conncomp[0].diameter()
array = []
counter = 0
for x in wg_conncomp[0].vertices():
for y in wg_conncomp[0].vertices():
length = wg_conncomp[0].shortest_path_length(x,y)
array.append(length)
for x in array:
counter = x+counter
average_path_length = float(counter/len(array))
print "Average path length:", average_path_length
print "Clustering Average:", wg_conncomp[0].clustering_average()
print "Degree Total:", graph_average_degree
print "Number of Cliques", wiki_graph.clique_number()
wg_conncomp[0].show(vertex_size=2, fontsize=2, figsize=[75,75],
filename='wikipedia_crcl.png', layout="circular")
print ""
print "DIGRAPH"
dh = wiki_digraph.degree_histogram()
dh_plot = list_plot(dh, plotjoined=True)
dgcount = 0
print "Degree histogram:"
for x in dh:
dgcount = dgcount+1
if x != 0:
print dgcount, x
digraph_degree_total = 0
for deg in wiki_digraph.out_degree_iterator():
digraph_degree_total = deg+digraph_degree_total
digraph_average_degree = float(digraph_degree_total/wiki_digraph.num_verts())
dh_plot.axes_range(0,200,0,60)
dh_plot.show()
print ""
print "DIGRAPH : LARGEST CONNECTED COMPONENT"
print "Total articles in largest connected component:", wdg_conncomp[0].num_verts()
print "Clustering Average:", wdg_conncomp[0].clustering_average()
print "Degree Total:", digraph_average_degree
wdg_conncomp[0].show(vertex_size=2, fontsize=2, figsize=[75,75],
filename='digraph_wikipedia.png')
(Click here to see the Ruby)
Example of Running the code:
$> load "run.rb"
$> run("kg", 350, 10)#kg is the prefix to search through,
350 is the stop point if we get that
many or know its not higher than that
10 is to stop the random searcher
(after a certain point, its useless to
try to collect all the single, unconnected
nodes that are out there.
Then, assuming you have altered the
code to search through the right place,
you should be running it at full speed.
Click to Download the wikipedia network maps ruby code
Back