
Graph of Related HTML5 Tags
A graph that was built by scraping the web pages, counting each pages HTML5 tags, and graphing them using D3.js
I got the idea of a graph showing relation ships between html tags after reading the MDN web docs page for html tags. The immediate issue was that there wasn't any data necessarily connecting individual tags other than the individual categories that MDN displays them in. To remedy this issue I took to crawling the internet and cataloging the use of tags on each page.
I formatted the data like this:
<URL>
<TAG>:<OCCURRENCES>,...
After a healthy amount of this data was acquired it had to be parsed and turned into a network graph for later display. I made a decision when developing the graph about how to weight tag relationships. After trying a few schemes for determining how important a tag is for another tag I settled on this formula:
score = number_of_appearances_on_the_same_page /
sqrt(total_number_of_appearances)
This allows the score to give more weight to more related tags while still allowing some weight to more frequent tags.
From this, connections were generated by taking each tags two most related tags. After these connections were found, duplicate connections were removed and the tags and their connections were output as JSON for use in D3.js.
D3.js was great for displaying the graph. It had plenty of features but the one I was most interested in was the d3-force which allows the node graph to be automatically formatted based on spacing requirements. Another useful module was d3-zoom, this allows the network graph to be zoomed which was necessary because there were a lot of nodes to display.
Thanks for reading.
Get the code on github!