On Page SEO Part 2: An Introduction To Signals of Quality

 


In the previous tutorial we looked at some basic on page factors including the alt attribute. It was suggested that every img tag should also have an alt attribute even if the image referred to was entirely decorative. These changes might at first seem a bit pedantic, however it makes for better accessibility and standards compliant HTML.


Ensuring pages are accessible and standards compliant can cause a lot of work for webmasters trying to rectify things after a site has gone live, especially if every page contains multiple HTML errors. So is it worth all the bother? The simple fact is that accessible sites are generally more search engine friendly and can be viewed on a wider selection of devices and browsers.
Making sure that every piece of html code on every page validates and meets current accessibility standards are signals that a business cares about every single visitor to their website. Spammers using ‘throwaway domains’ are more likely to shy away from this type of work because of labor, time and expense.
Signals of quality are rarely about relevance, for example it’s easy to understand why allowing a page to go live as an ‘untitled document’ would harm relevancy, it’s not so obvious why including a telephone number would increase search engine rankings.
There is a distinct difference between quality and relevance and search engine must necessarily balance both aspects in order to deliver the best results. The task of Identifying quality is becoming increasingly important due to the amount of low-quality content that is being uploaded to the web every day.

Bayesian Filters

Bayesian filtering is utilized by most modern day mail clients as a means to weed out spam emails from legitimate emails. Search engines use it to categorize documents and Google uses it to deliver relevant Adsense ads. How do Bayesian filters Work? Initially the process starts with a list of sites that have been classified as high quality and another list that has been classified as low quality. The filter looks at both and analyzes the characteristics common to either type of site.
Once the filter has been seeded and the initial analysis completed they can be used to analyze every page on the web. The clever thing about Bayesian filters is that they continue to spot new characteristics and get smarter over time. Before we delve into any great detail on how Bayesian filters work, here is a couple of quotes from Matt Cuts regarding Signals of quality that clearly show Google is addressing the problems caused by low quality mass generated content.
“Within Google, we have seen a lot of feedback from people saying, Yeah, there’s not as much web spam, but there is this sort of low-quality, mass-generated content . . . where it’s a bunch of people being paid a very small amount of money. So we have started projects within the search quality group to sort of spot stuff that’s higher quality and rank it higher, you know, and that’s the flip side of having stuff that’s lower-quality not rank as high.”
“You definitely want to write algorithms that will find the signals of good sites. You know, the sorts of things like original content rather than just scraping someone, or rephrasing what someone else has said. And if you can find enough of those signals—and there are definitely a lot of them out there—then you can say, OK, find the people who break the story, or who produce the original content, or who produce the impact on the Web, and try to rank those a little higher. . . .”
There has been mention of Signals of Quality in Google patents and some specifics have been discussed by Google engineers so hopefully the days of article mills and article spinners are numbered.

How Bayesian Filtering Works

Although it is known that search engines use Bayesian Filtering the exact algorithm is of course proprietary and unlikely to be made public, however the actions of Bayesian filters are well understood. So lets start by looking at how Bayesian filtering works.
To begin a large sample or white list of known good documents (authoritative highly trusted pages) and a large sample of known bad documents (pages from splogs, scrapper sites etc) are analyzed and the characteristics of each page compared. When a large corpus of documents is compared programmatically patterns or ‘signals’ emerge that were hitherto invisible. These signals can then be used to provide a numeric value (or percentage likelihood) of whether the characteristics of other pages lean towards those from the original sample of good documents or those from the original sample of bad documents.
Some simple examples of this would be to compare the words in the good documents to those in the bad documents, if it is discovered that many low quality pages use the terms like ‘buy cheap Viagra’ or have a section on each page for ‘sponsored links´ then other pages that do the same might be of low quality also. Conversely if it is discovered that high quality pages often contain a link to a Privacy Policy or display a contact telephone number then other pages that do the same might also be high quality pages.
As the process continues more signals are uncovered. In this way the filter learns to recognize other traits and whether they are good or bad. There is likely to be many signals of quality measured, each one adding to or subtracting from an overall score of a pages quality.
This means is that SEO’s web designers and webmasters need to adopt a holistic approach that takes into account information architecture, relevancy, accessibility, usability, quality, hosting and user experience.

The Link Structure of The Web

Although links will be covered in future tutorials, it makes sense to discuss some of the implications of recent changes in the link structure of the web now. Once upon a time reciprocal links were all that were needed to achieve top search engine rankings. Because reciprocal links were easy to acquire and made it easy to promote sites of lesser quality so that they outranked quality sites search engines stepped in and devalued reciprocal links along with PageRank.
One way links were now the way to go, so a new market in selling one way links emerged. Search engines again viewed this as a way to game the system and paid links, if detected, were devalued so that they passed no value whatsoever. The nofollow attribute was implemented so that, amongst other reasons, links could be sold without penalty. The nofollow attribute has also been adopted for other reasons and is used on millions of blogs and some of the most popular social sites.
URL shortening is also popular and again is used by some of the most popular sites on the web. The upshot of all this is that although the web continues to grow the ability of many millions of pages to link out and cast a vote for other pages has been removed. Of course you still get the traffic which can be substantial if you make the front page of Digg. Because the link graph of the entire web is essentially in recession, search engines are again reevaluated the way they calculate rankings and quality has many discernable signals.

The Need To Discern Quality

According a study carried out by WebmasterWorld the top 15 doorway domains are a haven for spam. The study analyzed popular search terms and discovered that more than 50% of the results were spam. 77% of the results from blogspot.com were found to be spam. The following list shows the level of spam found on the top 15 doorway domains:
Dorway Domain
Spam%
sitegr.com
100%
blog.hix.com
100%
blogstudio.com
99%
torospace.com
95%
home.aol.com
95%
blogsharing.com
93%
hometown.aol.de
91
usaid.gov
85
hometown.aol.com
84
maxpages.com
81
oas.org
78
blogspot.com
77
xoomer.alice.it
77
netscape.com
74
freewebs.com
52
The study shows that on the keywords tested some of these blogs are used exclusively by spammers, while others had a very high percentage. The reason for this is that these sites provide free blog space which is a magnet for spammers who need to generate links to low quality splogs or scraper sites quickly.
The next list compares percentage of spam sites by top-level domain' (TLD):
TLD
Spam%
.info
68
.biz
53
.net
12
.org
11%
.com
4%


This research highlights the incredible amount of spam that exists on the web but it would be unfair to penalize every .info domain for example just because a high percentage of .info domains are used by spammers
Conversely it would be unwise to trust every .com even though in general they seem to be comparatively spam free. To discern quality many signals have to be considered covering every aspect of a website.
The next tutorial in this series will be looking at on page signals of quality nad why quality score is the new PageRank

No comments:
Write comments