Geoff Pullum raises an interesting question about the proper terminology for citing Google hits as measurements of corpus frequency. SC applauds Prof. Pullum's efforts, but thinks that he could have pursued the analogy to existing computer terminology even further.
Briefly, Prof. Pullum's suggestion is that we should use Ghit (or Gh) as the unit of corpus frequency, with large quantities being noted as KGh, MGh, and GGh (analogous to existing KB, MB, and GB for numbers of bytes). However, he also lays down a condition which is difficult to as the number of hits gets high:
A pattern gets n Ghits if and only if searching the web using Google yields n distinct web pages that contain tokens of the pattern. I do think the pages should be distinct: it seems to me that duplicate pages should in principle be eliminated if the notion of a Ghit is to mean anything. Since it is perfectly possible for a page on the web to have an identical copy at a different URL (this probably happens quite a bit), it is clearly possible for copies of pages to come up as separate hits in the list when you run a Google search. That means that the number of items on the list returned by the Google search engine will only be a rough approximation to the actual Ghit count for your search string. It also will not be a measure of the number of occurrences of the string on the web: the number of occurrences will be higher than the Gh value because a page will often contain multiple occurrences.
In other words, when Google says you've got 50 hits, this doesn't necessarily mean you've got 50 Ghits. So SC proposes a revision to the nascent terminological reforms of Prof. Pullum.
Just as we distinguish between numbers of bits and bytes by use of capitalization, we can distinguish between total hits and unique occurrences with capitalization. When one wishes to write "one thousand bits", it is notated as 1 Kb; "one thousand bytes" is 1 KB. Similarly, one thousand raw hits could be notated as 1 kgh, and one thousand verifiably unique hits could be written as 1 KGh. Or, since Prof. Pullum feels strongly that Google ought to keep their capital letter, they could be 1 kGh and 1 KGH instead.
As for verifying the uniqueness of the pages involved, SC isn't sure how to go about doing it. One could always download all of the returned links, and then attempt to verify that the text is different in each of them. This gets impossible after about 1,000 hits (since Google won't return larger numbers of links), but at least it could be done to validate small-scale measurements. Perhaps there's some way to eliminate pages with identical text from Google's results at search-time, but SC isn't aware of it.
The night of the fight, you may feel a slight sting. That's pride f*cking with you. F*ck pride. Pride only hurts, it never helps.
fde83b3c922fbac8fc761678d62d30b9
Posted by: Ramiro | April 07, 2009 at 10:13 AM
The night of the fight, you may feel a slight sting. That's pride f*cking with you. F*ck pride. Pride only hurts, it never helps.
fde83b3c922fbac8fc761678d62d30b9
Posted by: Name | April 07, 2009 at 10:13 AM