Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Stanford PhD Dissertation Browser (stanford.edu)
33 points by abhaga on Dec 1, 2010 | hide | past | favorite | 24 comments


Oh, for a second there I thought Stanford made all their dissertations available online. On second thought why doesn't universities do this and instead try to sell access? I've heard of people who put in a $20 bill in the copy of their dissertations in the university library and finding it intact years later.


At least for C.S. I would guess that most people will happily send you a PDF copy of their theses (if they're not available from their websites already). Just send them a friendly email.


You're right but contacting everyone like that is not very practical, in most cases you're nor event aware that a dissertation for the topic you're interested exists.

IDEA BOLT: What if one creates a central website that offers free storage and search capabilities and ask people to upload their dissertations and theses?

What do you think about that idea?



Yep, but arXiv's content is very limited, e.g. very little EE research and economics, and nonexistent social sciences.


I don't know where to find the stats, but the number of categories has expanded greatly. I assume that trend will continue.

The hard part is convincing people to upload their work. Even in physics the use of the arXiv is not uniform. In some areas, almost every published and unpublished paper is posted, while in other areas of physics hardly any papers are. For example, compare the category Quantum Physics with Space Physics.


My university has free, online access to all Master's thesis and PhD dissertations: http://scholar.lib.vt.edu/theses/


Neat? Yes.

Usable? No.

I'm not sure if theres some piece of information thats trying to be portrayed here (other than quantity of dissertations per major) but as a browser, it's pretty useless. It employs mystery meat nav heavily, mostly because you have to hover over a dissertation to view any info about it. Imagine finding a paper you liked, and then going back 15 minutes later to try and find it.

If there is some data thats trying to be shown its not clear to me what it is. Some of the inner circles are in columns which seem to indicate correlation, but I cant figure out if thats accurate or not. For instance, if I click on philosphy, I see a dot in the direction of Food Research, and when I hover over that dot I see a thesis about "Practical reasoning and the varieties of agency".

What is trying to be portrayed here?


"What is trying to be portrayed here?"

How dissertations in an area overlap with other areas.

Calling it "Stanford Phd dissertation browser" sets up the wrong expectations, but it's an interesting infographic.


Yep, this is a really bad way to present the information. Having a collapsible columnar list for each department would be orders of magnitude more useful. Plus it would be in HTML, and you could use ctrl/cmd+f to search for dissertations by keyword.

Perfect example of bad data viz.


Yes and no. I found quite useful and visual the way to show the relationship between topics. For example, you can see the close relationship between biology and psychology. I'm not sure a large table could the job.


I see almost no correlation between bio and psych. In fact, I only see 2 articles that link those 2 topics at all.


Usable? Absolutely.

It's automatically clustering topics radially based on proximity to each other and to other disciplines.

Click on CS. Then click on CS again (now in the center). CS + Music is out by the Music field. CS + EE is all in a column pointed to the EE field.

It's a tool for discovery.


If its a tool for discovery than I still say its not usable.

If I went looking for list of articles, linking CS + Music, this doesn't help me. I'd want to click CS, click music, and then receive a list of articles matching that correlation. Instead I have to hover over each dot, and scrub through years to even see titles for things. Sure, I get the "excitement" of not knowing what I'm looking for, but that is not a usable tool for discovering articles.


Thanks for the feedback, folks. I'm actually a bit surprised this hit hacker news without any of us authors posting it, but heck, I'll jump in.

Think of this as an experiment in exploring a document collection at a higher level than search. Specifically, what you're seeing is Stanford's dissertations through the lens of a text model that tries to distill high-level patterns in the data. It doesn't always succeed, but it often hits the mark. There are plenty of ways that the visualization and the underlying text model could be improved.

For the curious, I'll tell you a bit more on how the numbers are computed: we build a unigram language model of the contents of every Stanford department based on their dissertations. Then, we posit that every dissertation comes from a mixture of those department models (using a supervised topic model, Labeled LDA). This lets us infer, for every dissertation, a weighted mixture of departments that best characterizes that abstract. So, say, dissertation X is 60% computer science, 20% physics, and so on. These scores are aggregated to compute the average similarities between departments, and are sliced to give the view over time.

So what you're looking at is, essentially, a visualization of word overlap between departments measured by letting the dissertations in one department borrow from words from another department. Which departments borrow the most words from which others?

When you zoom in two-levels (click on a department twice), individual dissertations are plotted on a line between each dissertation's home department and it's next highest scoring department. So the relative position of two dissertations near each other is not meaningful unless they are on the same radial line. Dissertations from other departments that have a high score for the central, focused department, are also shown.

For instance, take a look at Computer Science in 2005. You'll see three dissertations along the radial line to Linguistics - those are the three students that graduated from the Stanford NLP group that year. There are plenty of other places you find similar things that work, and also places where things don't work as nicely as you'd expect.

The visualization Jason built was really interesting from the text modeling perspective, because it let us experiment with many model variations (lda, tf-idf, etc etc) to see how well each matched our intuitions. This model, though still wanting, was by far the best. Good enough, even, for us to put online for the world to play with, and for hacker news to pick apart ;)


> I'm actually a bit surprised this hit hacker news without any of us authors posting it,

Picked it up from the NLP Lab twitter stream. BTW, thanks for the awesome NLP tool chain! :)


why'd you take the data down? I was writing a scraper in order to make your underlying data into happy open and accessible data: http://scraperwiki.com/scrapers/stanford-dissertations/edit/


I can't speak for Dan, but it's probably best if you don't do this! I think most of the data itself is UMI data from ProQuest, which seems to be licensed pretty strictly.


boo proprietary data!


I did my PhD in Computer and Systems Engineering (in the EE department of my school), an my thesis involved use of Computer Vision and AI in the analysis of microscope images of human cells, so I did a lot of work with MD's and Biologists. I thought it would be interesting to see which theses overlapped cell biology and Electrical Engineering.

The browser showed two overlaps: "Low-Power dynamic amplifiers for pipelined A/D conversion" and "Precision clock synthesis using direct modulation of front end multiplexers/demultiplexers in high speed serial link transceivers"

The first of these mentioned "cell phones" in the abstract. There was no evidence of any cell biology link in the second.

The visualization may be interesting, but I'm not so confident in the quality of the data.


Interesting setup, but it seems pretty wildly incorrect at times.

For instance: Comp Sci -> Ethics: "Designing interactions that combine pen, paper, and computer". Comp Sci -> Radiology: "Securing untrustworthy software using information flow control" Comp Sci 98 -> Physiology: "Consistent overhead byte stuffing"

Could be a heck of a lot better. Especially given the long almost-locked-up pauses, and the inability to keep a block of text up when you move the mouse away. All the little things add up, making me doubt the creator used it themselves at all, aside from making sure it functioned.


I'm a little annoyed that they used the phrase "topic distance" but it looks like they're using something which is in some cases very asymmetric (the 'distance' between CS and EE is not the same as between EE and CS). As a visualization, it's not meaningful because they don't explain what it means for two topics to be 'close'.

I'm guessing they're using something like a KL divergence between the distributions over words (smoothed?) given each topic, but that could be way off.


It it possible to obtain the data for this somehow? I'd interested in making a different, more straightforward columnar text-only view.


In flash? C'mon!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: