Project „New Search Engine“
My sets versus PageRank and Panda

1. Errors of search engines

Search engines make two errors: irrelevance and not-differentiating of spam.
Irrelevance is ordering of irrelevant WWW pages in front of relevant WWW pages. Relevant WWW pages are quantitatively (reasonably) big and qualitatively good pages, which correspond to the searched word.
Not-differentiating of spam means, that up in the order are „Black SEO“ WWW pages (formally correct - but without content, wallpapered, pretending something other than containing, copied, link farms and the like).
It can be estimated, that the ratio of errors of search engines, caused by irrelevance and by not-differentiating of spam, is 50:50.
My sets are directed above all towards removing of irrelevances, but they have meaning also for antispam.

2. Single WWW pages versus sets

Till now, search engines evaluated single WWW pages.
Search engines proceed so, that for the searched word and for each WWW page, on which the searched word is found, compute the ordering value, then sorts these ordering values in descending order and according to this construct the order of the found WWW pages.
The number of found pages, where the searched word occurs, is often very large, even millions till hundreds of millions. To differentiate one million of found pages, the ordering values must differ in the sixth order (after decimal point). This leads to randomness and incorrectness, as very small differences decide about the order (change of two words).
I have found three years ago, that it is better to evaluate the sets of Internet components (WWW pages, documents, images, audios, videos…). For the evaluation of the sets I use the same criteria, as for the evaluation of single WWW pages, but with different weights of criteria.
The advantages of the sets:
- the sets are much bigger than single WWW pages and differ much more from each other, so that it is easier for the algorithm to evaluate them and to construct their order.
- ordering values of the sets are at least 10 times larger than ordering values of WWW pages. This removes some randomness and incorrectness. Mathematically it can be said, that using the sets at least removes the equality of ordering values of WWW pages at the boundaries of the six-decimal decision intervals, in other words, it can be roughly estimated, that the ordering by using sets will be 10 percent better (more exact), than the ordering by using single WWW pages.

3. PageRank versus sets

Google uses PageRank (of WWW pages) as one of the criteria. It considers links between WWW pages.
The evaluation of the sets of Internet components differs substantially from the evaluation of PageRank – only the WWW pages, which are part of the set, are evaluated. These sets cannot be constructed by link exchange or by buying links. The weight of the criterion Rank for single WWW pages can be lowered using the sets. The weight of the criterion SetRank (average or sum of the Ranks of the WWW pages of the set) is still lower, than the weight of the Rank for single WWW pages.

The advantages of the sets:
- using PageRank can be relatively easily betrayed by link exchange or by buying links (especially by buying links from the same branch); using sets can eliminate such betraying to some extent (for the construction of the sets my algorithm does not use backlinks, only the pages are considered, which belong to the given set).

4. Panda versus sets

Google came with Panda. Panda evaluates sites (Webs, WWW servers) in order to reveal spam (Black SEO). When the site is evaluated as spam, for all the pages within this site their ordering value is decreased by some number, given by the stage of the spam.
For Panda, the sets are whole sites, alternatively (big) parts of these sites. It proceeds from the whole, from up to down, from „molecules“ to „atoms“. It is directed only to antispam, for the solution of irrelevances does not make sense. Besides, Panda probably creates the sets only from WWW pages.
I construct my sets around every WWW page (alternatively, I evaluate, that the given WWW page does not have any set). I proceed per partes, from down to up, from „atoms“ to „molecules“. Into the sets, I put not only WWW pages, but also other components of Net (documents, images, audios, videos). As for the theoretical development, I have about two year’s time advantage.
The advantages of the sets:
- the sets of Panda (consisting just from WWW pages) are less expressible than my sets (consisting from WWW pages plus other Net components).
- Panda cannot be used for solving of the irrelevances; my sets can be used for solving of the irrelevances.
- if Panda considers the whole site to be spam, it penalizes all the WWW pages of this site - this has negative consequences in the case, that by error the „clean“ site is considered to be spam; my sets are smaller, so that the wrong consideration has smaller negative impact (only the set is penalized, not the whole site) – in other words: the application of the rules of Panda to my sets will make Panda more precise.
- as for Panda, the spammers already know, that it is concentrated on the whole sites, so that they can defend by optimizing their sites; as for my sets, the spammers will not know (at least for some time), what the sets are, so that they will not be able to defend by optimizing these sets.

5. Consequences

The order of found links using the sets is better (more exact), than the order constructed by single WWW pages or by Panda.
My procedure of constructing the sets can be patented.

Implementation of my algorithm is simple. Practically no changing of existing programs is necessary, just adding some files and programs.

6. Remark

Google does not use my sets for sure.
The proof is its order of links, e.g. when searching the word „Lednice” (on, the link to the isolated WWW page is the fourth..

7. Links



My sets