Major Google Index Problem? Or is that by design?

What if you had created a post half a decade ago, had it dropped from the index (Or at least it wasn't listed in any results.) after another unrelated site copied your entry then Google lists that copied (literally copy + paste) page on the first page of search results? I wouldn't think this can happen through Google but apparently it can. ( We rather find this very interesting that it's possible then anything else. 🙂 )

This could be viewed as being similar to something, oh I don't know, like having someone cheat on a test only to have the victim viewed as the cheat. Ouch! Not fun at all. That's apparently what happened to a post when searching for this string as Google's behaviour does appear counter intuitive at first sight:

Device eth0 does nto seem to be present delaying initialization

First the page on this site did not even list in any results though personally I don't even care if it's on page 5+ however NOT to be listed at all yet the exact same copy of your content being #1 page entry is another thing all together. Not until we submitted our page for reconsideration asking why this is the case, did things change and we are back in some results. Of course, one can check WebArchives and Google Cache history to see the exact same content coming from the original for over half a decade. For a moment I thought there was the 301 redirects done earlier here to move content from the old links to the new ones for easier to remember links:

http://www.microdevsys.com/WordPress/2008/01/13/linux-networking-device-eth0-does-not-seem-to-be-present-delaying-initialization/

http://microdevsys.com/wp/device-eth0-does-not-seem-to-be-present-delaying-initialization-linux-networking/

but isn't the 301 a permanent redirect. Or do we have to keep the 301 permantly there TO redirect? Not quite sure.

One thing I would do in a search engine that big is to create an original date index or check within web archives for original content to prevent this sort of thing. That, I think would help prevent alot of the duplication online especially if the duplication could be seen as an endorsement towards the original or lowered in importance if a reference does not exist. Not quite sure but just a thought.

I do understand Google's seeming struggle that is apparently not noticed. And this goes towards it's developers. The actual search logic must be fairly complex and with so many updates to it over the years to try and be the better search engine, creating new logic to fix things on one end could concievably end up breaking stuff on another. So I can see how we could be the victim of an, of course, well meant update. But we feel some sort of index to track original content, however old, to prevent this sort of thing and protect original content might go a long way for Google.

Last but not least, we thank Google for bringing the post back into results. Thank You!

Aug 5 2013:

Opened up a topic for discussing this with Google. Hopefully we can get some interesting replies or a fiery one. 😉

April 7 2013:

So I did some comparisons of the various search engines out there to see the variation in results:

BING:

Yahoo:

The copied content is on page 2. The page off of here is listed higher.

Lycos:

NOTE: Duplicated content is at the bottom.

Having a bit of fun on the web figuring out what people think about that and maybe get some feedback here too.

So I got a reply from Google:

"We reviewed your site and found no manual actions by the webspam team that might affect your site's ranking in Google. There's no need to file a reconsideration request for your site, because any ranking issues you may be experiencing are not related to a manual action taken by the webspam team."

Which is interesting because the site wasn't even listing at all on any of the dozens of pages we checked earlier. Moreover the site appeared again once we submitted the reconsideration request and we must say this was a rather quick turnaround on Google's behalf. (Of course, thank you for that.)

Still this brings us back to the original question above which still remains which is the ordering vs other engines. 🙂

Or perhaps we can give a benefit of the doubt and assume any of the changes we did in the last one month helped removed something, though unknown that was irking Google. Perhaps the 301's or a temporary 404 at some point while changes were being done? No idea. All we know is that after submitting the reconsideration, we're back in some search results.

But we're still in the same position as we were in terms of the original question. A summary is in order:

Given a paragraph, placing it on two separate sites a few months apart with separate markup around the exact same content, how would Google index and rank these two pages?

So we finally go a reply from Google which is what we suspected and unfortunately, still concerning and a confirmation unfortunately:

black beltLevel 7
You're late to the party, pal. 🙂 Years ago this was a problem that ran many a publisher into serious cash flow problems. It's disappointing that it took so darn long for Google to start catching up to scrapers but it seems that they are getting better. I've found that you must help them help you, e.g. doing periodic audits on your site's content and reporting them to Google for removal from their SERPs. That's the only way I began to finally recover.

Free2WriteLevel 8
Google assigns a probability to each subset of outcomes and is recalculated each time the world is re-crawled and the indexed is rebuilt.

One factor that is probably used in Google's probability calculations is user behavior. So, if a webpage with the same content is better tuned to user behavior the outcome can be tilted, in a markovesque eight-fold way. But there are hundreds of other factors so a commitment scheme might be used if you're trying to find out how the algorithm works in general outside the Googleplex, when applied to a specific set of webpages.

Feel free to post your thoughts and Thanks!

Cheers,
TK

This entry was posted on Saturday, August 3rd, 2013 at 11:10 pm and is filed under NIX Posts. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

You must be logged in to post a comment.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Thoughts and Scribbles | MicroDevSys.com