Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Metrics-shmetrics. Once I stop seeing StackOverflow clones listed above StackOverflow's original pages I will gladly believe that Google's search quality is "better than ever before."


I've been tracking how often this happens over the last month. It's gotten much, much better, and one additional algorithmic change coming soon should help even more.

I'm not saying that a clone will never be listed above SO, but it definitely happens less often compared to a several weeks ago.


My experience is the exact opposite: I am seeing many, many more clone sites in my search results in the last few months. It feels like it increases when I accidentally click a clone site.

This happens for more than StackOverflow clones. Mailing lists, Linux man-pages, FAQs, published Linux articles, etc. all have clone pages that are obvious link farms (sometimes they even include ads that attempt to harm my computer) that rank higher than the "official" (or at least less-noisey) pages.

Ideally, I'd like to completely remove domains from result as has been discussed elsewhere on HN. Hopefully this upcoming push for social networking that Google has will reintroduce a better-implemented "SearchWiki" feature...


I think there is a disconnect in the scale at which you are both commenting:

Comment #1:

  > I've been tracking how often this happens over the last month.
  > <snip>
  > it definitely happens less often compared to a several weeks
  > ago.
Comment #2:

  > I am seeing many, many more clone sites in my search
  > results in the last few months
You can't argue against "things have gotten better in the last week" with "things have been getting worse for the last 6 months."


Try DuckDuckGo. Gabriel has been doing an aggressive job about removing unsavory domains and I've been fairly impressed. I think that Google probably can't be nearly as aggressive for political reasons.


The reason Google isn't doing the same thing as DuckDuckGo is most likely because manually banning a domain instead of improving their algorithms to avoid unwanted behaviors will only temporarily work, and only in select cases. There will always be new spam and content farm sites.


There seem to be a small number of large content farms (perhaps suggesting economies of scale are pretty important). In this case, manually killing them will work well for Herr. Weinberg.


Over at blekko, we leave in a few large but marginal sites like eHow, and let users kill them with their personal spam slashtags. For smaller spam websites, we can frequently use Adsense IDs to kill them in groups.


Is google ever going to go on the record about companies like demand media and whether or not they get special treatment from google?


One specific and ubiquitous example of webspam has been driving me nuts this week: An enterprising spammer has won huge on Amazon Web Services (AWS)-related keywords.

For example: https://encrypted.google.com/search?hl=en&q=aws+s3+emr+p...

Result #4 at the moment is "AWS Developer Forums: Interactions between S3, EMR and HDFS ..." on http://www.hackzq8search.appspot.com/developer...com/...

What's sublime about this example is that:

1. hackzq8search is clone of AWS's websites amazonwebservices.com, aws.typepad.com, etc

2. hackzq8search is hosted on appspot.com, Google's App Engine domain

3. hackzq8search is over quota, so the site doesn't show any content anyway.

Yet this site was the top search result, beating out the site it was cloning, time and time again on my AWS/EMR-related searches this week.

The one mitigating aspect as that hackzq8search's URL naming scheme is easily decodable -- the hackzq8search URL includes the full URL of the cloned URL, so I can write a Greasemonkey script to extract the proper original URL.

I found a glimmer of optimism in that the site has been slowly fading in SEO-success this week: I complained about https://encrypted.google.com/search?q=aws+s3+security+sox+pc...  on Thursday, but on Friday the hackzq8search Search Result was gone from the first search result page.

It's still not hard to slam some AWS-related keywords into Google and get these bogus results, though.


Someone else already reported this. There's been some weird stuff going on with AWS-type pages, e.g. see http://news.ycombinator.com/item?id=2103401 for example. I don't know the exact cause, but I know the indexing team is aware of this issue and working on it.


While searching for a pdf file, I hardly find the pdf through the top results. The top results are often kind off pdf, ebook search engines themselves and they clearly appear on top by gaming Google. I hope this gets fixed too.


If you know it's a pdf that you're looking for, you can add filetype:pdf to your query.


This is great for HN readers. For most people using Google, not so much, I think.


Have you noticed sites that scrape google groups content ranking higher than the google group they've scraped? How can that possibly happen? Still seems rampant.


Thanks.


Why not just downrank the results from the SO clones?


Because that wouldn't solve the problem for clones of other sites, or clones in other languages. And the Stack Overflow cloners could just make other websites. That's why a primary instinct in search quality is to look for an algorithmic solution that goes to the root of the problem. That approach works across different languages, sites, and if someone makes new sites.

To be clear: the webspam team does reserve the right to take manual action to correct spam problems, and we do. That not only helps Google be responsive, it also improves our algorithms because we get use that data to train better algorithms. With Stack Overflow, I especially wanted to see Google tackle this instance with algorithms first.


> Because that wouldn't solve the problem for clones of other sites, or clones in other languages.

Sites that are the victims of content cloning have to be very visible and valuable, so maybe a little manual curating could be relevant.

> the Stack Overflow cloners could just make other websites

Not really? The point is not to tag the clones but to tag the original; everything that is not the original and that has copied content is a clone -- its name, domain or country notwithstanding.


That would be a rather awesome feature for evil people: Just copy&paste your competitions content onto SO (or any other specially protected site) and their google ranking will drop like a stone.


Why doesn't Google just count 'original' as whoever published the corpus first...


this was discussed in an earlier thread and it seemed like the idea of finding the "original" gets really messy and game-able (and potentially oppressive if curation were used).

the primary input to search engines comes from web crawlers...the idea of "first" when it comes to duplicated content is already difficult to determine, and (I would guess) it would get much much worse in the inevitable arms race if something like this were implemented.


But detecting duplicate content should not be very difficult, esp. now that Google indexes everything almost in real time. The site that had the content first is necessarily canonical and the others are the copies?

Because we don't understand what's hard, we think you're not really trying, and then we make up evil reasons to explain that.

I believe if people understood better the difficulties of spam fighting they would be more understanding.


> But detecting duplicate content should not be very difficult, esp. now that Google indexes everything almost in real time. The site that had the content first is necessarily canonical and the others are the copies?

Not necessarily. The rate at which Google refreshes its crawl of a site, and how deep it crawl, depend on how often a site updates and its PageRank numbers. If a scraper site updates more often and has higher PR than the sites it's scraping, Google will be more likely to find the content there than at its source. Identifying the scraper copy as canonical because it was encountered first would be wrong.


Do you think if Google educates to publishers and web masters to use Rel="Original" , the contents which are similar will be put it as considering set, use this tag as the best practice as similar to Rel Canonical tag that will help Google to identify the original content and make the search quality better?


What would stop the site that is scraping content from using the rel="original" tag too


Matt, would it make more sense to put more weight on number of web page visitors (recorded by Google Toolbar) as opposing to number of incoming links?


I agree with what you're saying: there needs to be an algorithmic approach.

But I'd like to say one other thing. Why is Google only doing something about web spam now after people have pointed out how bad things have been getting? Has anybody considered creating a small team to just oversee public perceptions of the search results and try to keep on-top of things in the future?


Can you provide a query where that is still the case ? It hasn't happened for a few weeks for me, since stack overflow changed their title seo.


Happened to me not ten minutes ago with the search string "pass json body to spring mvc"

The efreedom answer at the 5th position is actually the most relevant - the stackoverflow question from which it was copied doesn't even show up on the first page. There is one stackoverflow result on the first page, but it deals with a more complex related issue, not the simple question I was looking for.


In Google's most recent cache, the efreedom result has the word "pass" on the page due to some related links content near the bottom, whereas the stackoverflow page does not. If you modify your query to [parse json body to spring mvc], stackoverflow is at position #1, and efreedom is at position #4. This still has room for improvement, but it would seem like the simplest explanation is just the better match on your query terms.


Didn't notice that - that's good to know. That's actually exactly how I'd expect a good search engine to behave. As annoyed as I am when I get a junk result, I'd be even more pissed if Google dropped terms from my query just so it can return a more popular site.

Of course, then all the content-copy farms will respond by copying valid content plus word lists - hopefully Google knows how to detect that.


to be completely fair, the very first link (from the official springsource blog) also appears to answer your question.


True, it does... I just noticed this because I've actually got in the habit of scanning for stackoverflow results, first - they almost always are right on the money, and it's less cognitive overhead to read a site format I'm familiar with, with extraneous discussion well tucked-away.

It almost feels like a cache miss when I have to drop down to the official site/documentation, since that typically requires a greater time investment to read through to find the relevant sections.

I guess that's a tribute to how well stackoverflow works, most the time. And also to how lazy I am.


Thanks for the concrete query--I'm happy to ping the indexing team to make sure it's not tickling any unusual bugs or coverage issues. Jeff's original blog post helped us uncover a few of those things to improve.


If you're looking for SO results why not use 'site:stackoverflow.com'? That would clear out everything else.


http://www.google.com/search?q=delayed_job+delay+priority...

Stackoverflow comes in at number 8 while clones are 6 and 10


Assuming we are looking at the same results, the pages at position #6 and #10 are not copies of the stackoverflow content at position #8. They are copies of http://stackoverflow.com/questions/1399293/test-priorities-o.... Unfortunately, the only place that the word "delay" (which is in your query) sometimes appears on that stackoverflow page is in the "related" links in the right column. At the time Google last crawled that stackoverflow page (see the cache), "delay" wasn't on the page, only "delayed". Whereas, the last time Google crawled the other two pages you mentioned, they did have "delay" on the page. Google should still be able to do better, but this little complication certainly makes things more difficult.


Yeah.. it might not be the best example, as what I was searching for is not current possible, so there are no correct results for it.


One UI issue we've struggled with is how to tell the user that there isn't a good result for their query. This comes up when we evaluate changes that remove crap pages all the time. For nearly any search you do, something will come up, just because our index is enormous. If the only thing in the result set that remotely matches the query intent is a nearly empty page on a scummy site, is that better or worse than having no remotely relevant results at all? I definitely lean towards it being worse, but many people disagree.


I also have one spam site example. http://www.google.com/search?q=internet+phone+service Look at 3rd result for internetphoneguide.org


What I find seriously bad is that even a huge site like stackoverflow has to optimize its search engine strategy to fight the problem. Little web sites are doomed.


SEO is supposed to be StackOverflow's core competency. They are completely aware most people end up on their site via Google. The search on their own site sucks.

The reason Q&A sites are so visible is that people tend to type questions in their search engines, so Q&A sites are a good match to those.


Not really, a big website can not cover all keywords in its niche, no matter how big it is. The strategy for small sites is to focus on long tail keywords (3-4 terms) and outrank the big guys.


Yes but it sucks Stack Overflow had to add that to the title to fight the spammers because it is often distracting to see it the search results.


Agreed. My ideal search engine wouldn't require real websites to play in the SEO arms-race to beat out the junk sites.


That ideal search engine would find itself quickly the target of people that would try to gain an advantage by figuring out how it works.

And then another SEO cycle would start. Don't forget that before google came along nobody was trying to 'game the system' with backlinks and other trickery, the fact that that google is successful is what caused people to start gaming google.


If it were "ideal", it wouldn't be game-able. I'm not going to claim that this ideal is possible!


Any real-world search engine is going to be analyzed until enough of its internal mechanisms are laid bare to allow gaming to some extent.

Typically you pretend the search engine is a black box, you observe what goes in to it (web pages, links between them and queries) and you try to infer its internal operations based on what comes out (results ranked in order of what the engine considers to be important).

Careful analysis will then reveal to a greater or lesser extent which elements matter most to it and then the gaming will commence. Only by drastically changing the algorithm faster than the gamers can reverse-engineer the inner workings would a search engine be able to keep ahead but there are only so many ways in which you can realistically speaking build a search engine with present technology.

Your ideal, I'm afraid, is not going to be built any time soon, if you have any ideas on how to go about this then I'm all ears.


I think the solution is a diversity of search engines. Maybe even vertical search engines. These days I get such shitty results from google for programming related searches that I've started going straight to SO and searching there. If I don't find it there I then try google, then try google groups search.


I'm a programmer, and I'm as annoyed as you about the SO clones. But keep in mind the vast, vast majority of Google users couldn't care less about StackOverflow.

Moreover, the unique licensing around SO content, along with its mass, presents an interesting edge case for Google. They should of course fix it, but it's not indicative of the average or mode experience.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: