-
A Preliminary Study of the Relationship between the h-Index and Excess Citations / Étude préliminaire de la relation entre l’indice de Hirsch (indice-h) et les citations excédentaires
This article presents a study of the average number of excess citations of papers in the h-core, denoted as e2/h and the ratio between the e-area and the h-area, denoted as e2/h2. Real-world citation data from different countries are studied. It is found that at the country level, a small set of publications generate a disproportionately large amount of citations. Although different countries have different e2/h2 values in different fields, average e2/h2 values are all above 1. The e2/h values vary widely between fields, reflecting the general citation density in these fields. For cumulative data e2/h2 and e2/h values each converge quickly. Neither a shifted Zipf nor an exponential model could fit the data.
Cet article présente une étude du nombre moyen de citations excédentaires d’articles dans le core-h, noté e2/h et le rapport entre la zone-e et la zone-h, noté e2/h2. Les données étudiées sont des données de citations de pays du monde réel. On constate qu’au niveau des pays, un petit groupe de publications génère un montant disproportionné de citations. Bien que différents pays aient des valeurs e2/h2 différentes dans différents domaines, les valeurs moyennes e2/h2 sont toutes supérieures à l’unité. Les valeurs e2/h varient considérablement suivant les domaines, ce qui reflète la densité générale de citation dans ces domaines. Pour les données [End Page 127] cumulatives, chacune des valeurs e2/h2 et e2/h converge rapidement. Aucun modèle zipfien décalé ou exponentiel ne correspond aux données.
h-index, h-core, excess citations
indice-h, core-h, citations excédentaires
Introduction
In this article we study the structure of publication citation data at the country level, a standard topic in informetrics. Our approach is not based on summary statistics, nor does it use detailed publication citation information. Instead, we utilize the framework of the h-index and its resulting subdivisions of the publication citation curve (details are provided below). Yet the framework we provide in this paper can be applied to any topic, not just countries, for which the h-index approach is meaningful.
The introduction of the h-index by Hirsch (2005) has led to a profusion of investigations related to this indicator and the introduction of a plethora of related h-type indicators. Reviews of these developments have been provided, for example, by Alonso et al. (2009) and Egghe (2010). Almost immediately after the introduction of the h-index by Hirsch, it was pointed out that the h-index approach can be applied not only to the lifetime publications and citations of a scientist but also to most source-items relations (Egghe and Rousseau 2006), the case of library books and their borrowings being a prime example (Y. X Liu and Rousseau 2009). Moreover, the h-index approach can be applied to any time period; concretely, one may use any publication and any (usually different) citation window (Liang and Rousseau 2009).
Consider a non-empty set of publications, Pub, and let r(p) denote the rank of publication p in Pub according to the number of citations received during a given citation period. This number of citations is denoted as c(p). Let c(r) denote the number of citations received by the publication ranked r. In a continuous context the h-index, h, is defined as the number h such that c(h) = h. For discrete data this equation may not have a solution or may have more than one solution. Hence, Hirsch (2005) originally defined the h-index as the highest rank such that the first h publications each received at least h citations. When discussing the h-index and related indicators we will assume a continuous context, while when considering real-world data we apply Hirsch’s original definition for discrete data.
The set Pub can be subdivided into two sub-sets: the first h publications, referred to as the h-core (Rousseau 2006), and the other publications, referred to as the h-tail. In some studies, such as J. Q. Liu et al. (2013), this tail is further subdivided into two parts: the uncited publications and the others. In this contribution we do not pay special attention to the uncited publications, except that we assume that such articles can exist, contrary to many other publications in which uncited publications are not considered (see, e.g., Egghe and Rousseau 2006; Egghe, Guns, and Rousseau 2011; Zhang 2013a). Now we introduce further studies. [End Page 128]
Methodology
In the context of studies on the h-index, the area under the citation curve (see figure 1) can be divided into three sections. First, the area under the curve is divided into two: the section related to the citations received by the tail publications, called the tail area (denoted as t-area in figure 1), and the section related to the citations received by the publications in the h-core. This second section can, in turn, be subdivided into two disjoint sections: a square corresponding to h2 citations (see figure 1), referred to as the h-area, and the remaining part, corresponding to the excess citations, denoted as the e-area. Let Ch denote the number of citations received by the publications in the h-core, and recall that the number of publications in the h-core is by definition equal to h. Using the concept of excess citations, denoted as e (Zhang 2009), we have Ch = e2 + h2 ≥ 0. The core, excess citations, the tail, and the e-t ratio have been studied, for example, by Chen et al. (2013), J. Q. Liu et al. (2013), Ye and Rousseau (2010), and Zhang (2010, 2013a, 2013b). In this article we focus on the ratios e2/h2 and e2/h.
Using the geometric interpretation (see figure 1) we see that e2/h2 is the ratio between the e-area and the h-area. This parameter can take a value equal to zero (no excess citations) and has no theoretical upper limit. Moreover, e2/h is the average number of excess citations to papers in the h-core. Also, this [End Page 129] parameter has a lower limit equal to zero and no upper limit. These two ratios can be considered as structural measures.
In the rest of this article we investigate real-world e2/h2 and e2/h ratios based on ESI (Essential Science Indicators) and WoS (Web of Science) data, also paying attention to dynamic aspects (cf. Ye and Rousseau 2008).
Data collection follows the approach taken in earlier studies of h-indices and the h-core (Alonso et al. 2009; Egghe 2010; Ye and Rousseau 2010; Zhang 2009, 2013a, 2013b).
Data
We retrieved data from the ESI database for the period January 1, 2002, to December 31, 2012 (downloaded between March 21 and April 15, 2013). Data were selected for some fields and countries. Specifically, we chose agricultural sciences, chemistry, clinical medicine, engineering, and social sciences (general) as fields and the following countries: Canada, England, France, Germany, Italy, Japan, the United States (the G7 countries but with the UK replaced by England, for ease of data manipulation), Brazil, India, and China (representing the BRIC countries), complemented by three “smaller” countries, namely, Belgium, the Netherlands, and South Korea. We could not include Russia as there are obvious mistakes in the ESI data for this country. Moreover, we retrieved data from Thomson Reuters’s WoS for the time spans 2008, 2008–2009, 2008–2010, 2008–2011, and 2008–2012, for two fields: engineering and chemistry. (Downloads took place between April 29 and May 2, 2013; only the document type “article” was included.)
Informetric results are generally more meaningful when applied to larger units than to smaller ones. For this reason we work at the level of countries. This, however, does not imply that our methodology is inappropriate for smaller units. We note that the h-index for countries has been studied, for example, by Csajbók et al. (2007) and is one of the indicators provided in SCImago’s country rankings (SCImago, 2007–2014). Retrieved data are shown in the Appendix.
Results and comments
Denoting by P the total number of publications in the set under consideration and by C the total number of citations received, table 1 shows the ratios Ch/C, h/P, e2/h2, and e2/h for 13 countries in five fields based on 11 years of data (2002–2012) from the ESI database.
The data shown in table 1 lead to five interesting conclusions:
1. At the country level, h-core publications make up a very small portion of all publications (usually under 1%).
2. The corresponding citations, Ch, also form a small part of all citations (about 10%).
-
3. The ratio Ch/h (about 30 on average) illustrates that a small set of publications generate a disproportionately large amount of citations (Albarrán and Ruiz-Castillo 2011). [End Page 130]
[End Page 131]
4. Average e2/h2 values, that is, the ratios between the e-area and the h-area, are all above 1 (be it sometimes just slightly) but climb to 1.5 for the social sciences.
5. The average number of excess citations to papers in the h-core (e2/h) varies widely between fields, reflecting the general citation density in these fields.
We calculated the Pearson correlation coefficient, denoted as r, between Ch/C and h/P per field. Results are shown in table 2. These correlations are highly significant. Although this might be expected, we have no knowledge of any publication presenting this relation. We also calculated Pearson’s r between Ch/C and e/h and between h/P and e/h. These correlations are not significant at all (except to some extent for the field of engineering).
Bringing all fields together we observe a moderate (Pearson) correlation between the number of publications and e2/h (= 0.675) and an even slighter negative correlation between the number of publications and e2/h2 (= −0.334).
For studying dynamic evolutions we chose the fields of engineering and chemistry. Data come from WoS and include Russia. Results are shown in table 3. [End Page 132]
[End Page 133]
The data in table 3 show that the e2/h2 and e2/h values converge to stable values, often—but not always—in a decreasing way. This indicates that in most cases, the role of excess citations is more important in the beginning than later. This is not surprising as publications in the h-core largely belong to the oldest period. This phenomenon has been observed in previous studies (e.g., in Figure 4 of Y. X Liu et al. 2009 for the field of horticulture). For engineering and chemistry the e2/h2 ratio is always smaller than 1, in contrast with the results obtained for countries. We further note that the United States has the highest e2/h values in these two fields. [End Page 134]
Theoretical modelling
In this section we briefly report a negative result. Although we tried, we were not able to fit a theoretical model to our observations. Traditionally, negative results (“failures”) are not reported, but nowadays scientists realize that not doing so may lead to a waste of time and money when colleagues, unaware of earlier attempts, take the same approach (Fanelli 2012; Hobbs 2009). Specifically, in an attempt to find a theoretical model to explain our results we tried a shifted Zipf model (Burrell 2008; Egghe and Rousseau 2012a, 2012b; Glänzel 2008, 2010) and a decreasing exponential model.
Shifted Zipf (power) model
Consider first the framework of the shifted Zipf function, that is, the rank-frequency function
with B, β > 0. Egghe and Rousseau (2012a) showed that in this situation h is characterized by the relation (h + 1)hβ = B. Then, by definition, we have that the number of citations (items) received by the publications (sources) in the h-core is
valid for 0 < β < 1. This number is actually the square of the R-index.
For a shifted system the average μ = 1/(α − 2), where α is the exponent in the equivalent shifted size-frequency system. Following the mathematical deduction shown in Rousseau (2013), the shifted Zipf model leads to
For large h, such as in the case of countries, universities, and even famous scientists, we may say that e2/h2 ≈ μ. Indeed, even if h is only 25, the ratio (h + 1)/h = 1.04, which is very close to 1. Hence, in the shifted Zipf model e/h > 1 if and only if μ > 1.
Decreasing exponential model
Then let’s consider the exponential model. As an alternative for the Zipfian model we consider an exponential size-frequency function
[End Page 135]
with a > 0. Then T, the total number of sources (articles), is
and the total number of items (citations) is
Hence, the average μ = 1/α. If 0 < α < 1, then this average is larger than 1; if α > 1, this average is smaller than 1.
Following the mathematical deduction (Rousseau 2013), the decreasing exponential model yields
A comparison between these two models and the real data
However, comparing the theoretical values with real ones, the Zipf (power) model gives values that are (roughly) three times too big, while the exponential model gives values that are (roughly) three times too small (see table 4).
The results show that neither of these models is acceptable for our data. This might not be a surprise, as Redner (1998) has shown that the number of papers with x citations, denoted as N(x), has a power law decay for large x, while for smaller x the curve resembles a stretched exponential (Laherrère and Sornette 1998). Recently, Burrell (2013a, 2013b) also observed that in the context of h-index-related publication citations studies the power law assumption is not accurate. This leads to the problem of adapting Redner’s and Burrell’s ideas to the study of ratios in this contribution. Perhaps these observations suggest that a better model might be a shifted power function with an exponential cutoff (Ye 2011). Yet this is just a suggestion as we were not able to perform the calculations for such a model.
Conclusion
In this article on the structure of publication citation curves we observed the following.
-
1. Average e2/h2 values (ratios between the e-area and the h-area) for the period 2002–2012, averaged over countries per field, are all above 1 (be it sometimes just slightly) but climb to 1.5 for the social sciences. Recall that in theory we could expect values anywhere between zero and infinity. Our study provides [End Page 136]
[End Page 137]
concrete information about the relation between h-core citations and excess citations.
2. e2/h (the average number of excess citations to papers in the h-core) varies widely between fields. Yet countries and disciplines that produce more publications and receive more citations than others (e.g., the United States and clinical medicine) have higher e2/h values.
3. The e2/h2 and e2/h values converge when calculated over overlapping periods. This was observed for the fields of engineering and chemistry, but the phenomenon is believed to hold generally.
4. At the country level, h-core publications comprise just a very small portion of all publications (usually under 1%).
5. The corresponding citations also form a small part of all citations (about 10%).
6. The ratio Ch/h (about 30 on average), together with the two previous points, illustrates that a small set of publications generate a disproportionately large amount of citations.
7. Perhaps unsurprisingly, we found that neither a shifted Zipf nor an exponential model could fit the data.
Restrictions and suggestions for further research
The above results deal with the country level. Similar studies at different levels, for example, at the institutional level or research group level, may clarify our observations. Moreover, the most important challenge we offer our colleagues is to develop a model explaining our findings. [End Page 138]
huawn@nju.edu.cn
Department of Mathematics, KU Leuven
ronald.rousseau@uantwerpen.be
yye@nju.edu.cn
Acknowledgements
The authors are grateful for financial support from the National Natural Science Foundation of China, Grant no. 71173187, and the National Social Science Foundation of China, Major Key Project 12&ZD221. They thank anonymous reviewers for their input.