A Preliminary Study of the Relationship between the h-Index and Excess Citations / Étude préliminaire de la relation entre l'indice de Hirsch (indice-h) et les citations excédentaires

Xiaoyuan Yuan; Weina Hua; Ronald Rousseau; Fred Y. Ye

doi:10.1353/ils.2014.0011

Canadian Journal of Information and Library Science

A Preliminary Study of the Relationship between the h-Index and Excess Citations / Étude préliminaire de la relation entre l’indice de Hirsch (indice-h) et les citations excédentaires
Xiaoyuan Yuan, Weina Hua, Ronald Rousseau, and Fred Y. Ye

Abstract

This article presents a study of the average number of excess citations of papers in the h-core, denoted as e²/h and the ratio between the e-area and the h-area, denoted as e²/h². Real-world citation data from different countries are studied. It is found that at the country level, a small set of publications generate a disproportionately large amount of citations. Although different countries have different e²/h² values in different fields, average e²/h² values are all above 1. The e²/h values vary widely between fields, reflecting the general citation density in these fields. For cumulative data e²/h² and e²/h values each converge quickly. Neither a shifted Zipf nor an exponential model could fit the data.

Résumé

Cet article présente une étude du nombre moyen de citations excédentaires d’articles dans le core-h, noté e²/h et le rapport entre la zone-e et la zone-h, noté e²/h². Les données étudiées sont des données de citations de pays du monde réel. On constate qu’au niveau des pays, un petit groupe de publications génère un montant disproportionné de citations. Bien que différents pays aient des valeurs e²/h² différentes dans différents domaines, les valeurs moyennes e²/h² sont toutes supérieures à l’unité. Les valeurs e²/h varient considérablement suivant les domaines, ce qui reflète la densité générale de citation dans ces domaines. Pour les données [End Page 127] cumulatives, chacune des valeurs e²/h² et e²/h converge rapidement. Aucun modèle zipfien décalé ou exponentiel ne correspond aux données.

Keywords

h-index, h-core, excess citations

Keywords

indice-h, core-h, citations excédentaires

Introduction

In this article we study the structure of publication citation data at the country level, a standard topic in informetrics. Our approach is not based on summary statistics, nor does it use detailed publication citation information. Instead, we utilize the framework of the h-index and its resulting subdivisions of the publication citation curve (details are provided below). Yet the framework we provide in this paper can be applied to any topic, not just countries, for which the h-index approach is meaningful.

The introduction of the h-index by Hirsch (2005) has led to a profusion of investigations related to this indicator and the introduction of a plethora of related h-type indicators. Reviews of these developments have been provided, for example, by Alonso et al. (2009) and Egghe (2010). Almost immediately after the introduction of the h-index by Hirsch, it was pointed out that the h-index approach can be applied not only to the lifetime publications and citations of a scientist but also to most source-items relations (Egghe and Rousseau 2006), the case of library books and their borrowings being a prime example (Y. X Liu and Rousseau 2009). Moreover, the h-index approach can be applied to any time period; concretely, one may use any publication and any (usually different) citation window (Liang and Rousseau 2009).

Consider a non-empty set of publications, Pub, and let r(p) denote the rank of publication p in Pub according to the number of citations received during a given citation period. This number of citations is denoted as c(p). Let c(r) denote the number of citations received by the publication ranked r. In a continuous context the h-index, h, is defined as the number h such that c(h) = h. For discrete data this equation may not have a solution or may have more than one solution. Hence, Hirsch (2005) originally defined the h-index as the highest rank such that the first h publications each received at least h citations. When discussing the h-index and related indicators we will assume a continuous context, while when considering real-world data we apply Hirsch’s original definition for discrete data.

The set Pub can be subdivided into two sub-sets: the first h publications, referred to as the h-core (Rousseau 2006), and the other publications, referred to as the h-tail. In some studies, such as J. Q. Liu et al. (2013), this tail is further subdivided into two parts: the uncited publications and the others. In this contribution we do not pay special attention to the uncited publications, except that we assume that such articles can exist, contrary to many other publications in which uncited publications are not considered (see, e.g., Egghe and Rousseau 2006; Egghe, Guns, and Rousseau 2011; Zhang 2013a). Now we introduce further studies. [End Page 128]

Click for larger view
View full resolution

Figure 1.

Areas under the citation curve

Methodology

In the context of studies on the h-index, the area under the citation curve (see figure 1) can be divided into three sections. First, the area under the curve is divided into two: the section related to the citations received by the tail publications, called the tail area (denoted as t-area in figure 1), and the section related to the citations received by the publications in the h-core. This second section can, in turn, be subdivided into two disjoint sections: a square corresponding to h² citations (see figure 1), referred to as the h-area, and the remaining part, corresponding to the excess citations, denoted as the e-area. Let C_h denote the number of citations received by the publications in the h-core, and recall that the number of publications in the h-core is by definition equal to h. Using the concept of excess citations, denoted as e (Zhang 2009), we have C_h = e² + h² ≥ 0. The core, excess citations, the tail, and the e-t ratio have been studied, for example, by Chen et al. (2013), J. Q. Liu et al. (2013), Ye and Rousseau (2010), and Zhang (2010, 2013a, 2013b). In this article we focus on the ratios e²/h² and e²/h.

Using the geometric interpretation (see figure 1) we see that e²/h² is the ratio between the e-area and the h-area. This parameter can take a value equal to zero (no excess citations) and has no theoretical upper limit. Moreover, e²/h is the average number of excess citations to papers in the h-core. Also, this [End Page 129] parameter has a lower limit equal to zero and no upper limit. These two ratios can be considered as structural measures.

In the rest of this article we investigate real-world e²/h² and e²/h ratios based on ESI (Essential Science Indicators) and WoS (Web of Science) data, also paying attention to dynamic aspects (cf. Ye and Rousseau 2008).

Data collection follows the approach taken in earlier studies of h-indices and the h-core (Alonso et al. 2009; Egghe 2010; Ye and Rousseau 2010; Zhang 2009, 2013a, 2013b).

Data

We retrieved data from the ESI database for the period January 1, 2002, to December 31, 2012 (downloaded between March 21 and April 15, 2013). Data were selected for some fields and countries. Specifically, we chose agricultural sciences, chemistry, clinical medicine, engineering, and social sciences (general) as fields and the following countries: Canada, England, France, Germany, Italy, Japan, the United States (the G7 countries but with the UK replaced by England, for ease of data manipulation), Brazil, India, and China (representing the BRIC countries), complemented by three “smaller” countries, namely, Belgium, the Netherlands, and South Korea. We could not include Russia as there are obvious mistakes in the ESI data for this country. Moreover, we retrieved data from Thomson Reuters’s WoS for the time spans 2008, 2008–2009, 2008–2010, 2008–2011, and 2008–2012, for two fields: engineering and chemistry. (Downloads took place between April 29 and May 2, 2013; only the document type “article” was included.)

Informetric results are generally more meaningful when applied to larger units than to smaller ones. For this reason we work at the level of countries. This, however, does not imply that our methodology is inappropriate for smaller units. We note that the h-index for countries has been studied, for example, by Csajbók et al. (2007) and is one of the indicators provided in SCImago’s country rankings (SCImago, 2007–2014). Retrieved data are shown in the Appendix.

Results and comments

Denoting by P the total number of publications in the set under consideration and by C the total number of citations received, table 1 shows the ratios C_h/C, h/P, e²/h², and e²/h for 13 countries in five fields based on 11 years of data (2002–2012) from the ESI database.

The data shown in table 1 lead to five interesting conclusions:

1. At the country level, h-core publications make up a very small portion of all publications (usually under 1%).
2. The corresponding citations, C_h, also form a small part of all citations (about 10%).
3. The ratio C_h/h (about 30 on average) illustrates that a small set of publications generate a disproportionately large amount of citations (Albarrán and Ruiz-Castillo 2011). [End Page 130]

Click for larger view
View full resolution

Table 1.

Values for 13 countries in five fields, 2002–2012 (Essential Science Indicators data)

[End Page 131]

Click for larger view
View full resolution

Table 2.

Pearson correlation between C_h/C and h/P per field
4. Average e²/h² values, that is, the ratios between the e-area and the h-area, are all above 1 (be it sometimes just slightly) but climb to 1.5 for the social sciences.
5. The average number of excess citations to papers in the h-core (e²/h) varies widely between fields, reflecting the general citation density in these fields.

We calculated the Pearson correlation coefficient, denoted as r, between C_h/C and h/P per field. Results are shown in table 2. These correlations are highly significant. Although this might be expected, we have no knowledge of any publication presenting this relation. We also calculated Pearson’s r between C_h/C and e/h and between h/P and e/h. These correlations are not significant at all (except to some extent for the field of engineering).

Bringing all fields together we observe a moderate (Pearson) correlation between the number of publications and e²/h (= 0.675) and an even slighter negative correlation between the number of publications and e²/h² (= −0.334).

For studying dynamic evolutions we chose the fields of engineering and chemistry. Data come from WoS and include Russia. Results are shown in table 3. [End Page 132]

Click for larger view
View full resolution

Table 3.

Evolution of e²/h² and e²/h for chemistry and engineering based on cumulative data (data from Web of Science)

[End Page 133]

The data in table 3 show that the e²/h² and e²/h values converge to stable values, often—but not always—in a decreasing way. This indicates that in most cases, the role of excess citations is more important in the beginning than later. This is not surprising as publications in the h-core largely belong to the oldest period. This phenomenon has been observed in previous studies (e.g., in Figure 4 of Y. X Liu et al. 2009 for the field of horticulture). For engineering and chemistry the e²/h² ratio is always smaller than 1, in contrast with the results obtained for countries. We further note that the United States has the highest e²/h values in these two fields. [End Page 134]

Theoretical modelling

In this section we briefly report a negative result. Although we tried, we were not able to fit a theoretical model to our observations. Traditionally, negative results (“failures”) are not reported, but nowadays scientists realize that not doing so may lead to a waste of time and money when colleagues, unaware of earlier attempts, take the same approach (Fanelli 2012; Hobbs 2009). Specifically, in an attempt to find a theoretical model to explain our results we tried a shifted Zipf model (Burrell 2008; Egghe and Rousseau 2012a, 2012b; Glänzel 2008, 2010) and a decreasing exponential model.

Shifted Zipf (power) model

Consider first the framework of the shifted Zipf function, that is, the rank-frequency function

(1) inline graphic

with B, β > 0. Egghe and Rousseau (2012a) showed that in this situation h is characterized by the relation (h + 1)h^β = B. Then, by definition, we have that the number of citations (items) received by the publications (sources) in the h-core is

(2) inline graphic

valid for 0 < β < 1. This number is actually the square of the R-index.

For a shifted system the average μ = 1/(α − 2), where α is the exponent in the equivalent shifted size-frequency system. Following the mathematical deduction shown in Rousseau (2013), the shifted Zipf model leads to

For large h, such as in the case of countries, universities, and even famous scientists, we may say that e²/h² ≈ μ. Indeed, even if h is only 25, the ratio (h + 1)/h = 1.04, which is very close to 1. Hence, in the shifted Zipf model e/h > 1 if and only if μ > 1.

Decreasing exponential model

Then let’s consider the exponential model. As an alternative for the Zipfian model we consider an exponential size-frequency function

(4) inline graphic

[End Page 135]

with a > 0. Then T, the total number of sources (articles), is

(5) inline graphic

and the total number of items (citations) is

(6) inline graphic

Hence, the average μ = 1/α. If 0 < α < 1, then this average is larger than 1; if α > 1, this average is smaller than 1.

Following the mathematical deduction (Rousseau 2013), the decreasing exponential model yields

A comparison between these two models and the real data

However, comparing the theoretical values with real ones, the Zipf (power) model gives values that are (roughly) three times too big, while the exponential model gives values that are (roughly) three times too small (see table 4).

The results show that neither of these models is acceptable for our data. This might not be a surprise, as Redner (1998) has shown that the number of papers with x citations, denoted as N(x), has a power law decay for large x, while for smaller x the curve resembles a stretched exponential (Laherrère and Sornette 1998). Recently, Burrell (2013a, 2013b) also observed that in the context of h-index-related publication citations studies the power law assumption is not accurate. This leads to the problem of adapting Redner’s and Burrell’s ideas to the study of ratios in this contribution. Perhaps these observations suggest that a better model might be a shifted power function with an exponential cutoff (Ye 2011). Yet this is just a suggestion as we were not able to perform the calculations for such a model.

Conclusion

In this article on the structure of publication citation curves we observed the following.

1. Average e²/h² values (ratios between the e-area and the h-area) for the period 2002–2012, averaged over countries per field, are all above 1 (be it sometimes just slightly) but climb to 1.5 for the social sciences. Recall that in theory we could expect values anywhere between zero and infinity. Our study provides [End Page 136]

Click for larger view
View full resolution

Table 4.

Comparison of real e/h values with two modelling results

[End Page 137]

concrete information about the relation between h-core citations and excess citations.
2. e²/h (the average number of excess citations to papers in the h-core) varies widely between fields. Yet countries and disciplines that produce more publications and receive more citations than others (e.g., the United States and clinical medicine) have higher e²/h values.
3. The e²/h² and e²/h values converge when calculated over overlapping periods. This was observed for the fields of engineering and chemistry, but the phenomenon is believed to hold generally.
4. At the country level, h-core publications comprise just a very small portion of all publications (usually under 1%).
5. The corresponding citations also form a small part of all citations (about 10%).
6. The ratio C_h/h (about 30 on average), together with the two previous points, illustrates that a small set of publications generate a disproportionately large amount of citations.
7. Perhaps unsurprisingly, we found that neither a shifted Zipf nor an exponential model could fit the data.

Restrictions and suggestions for further research

The above results deal with the country level. Similar studies at different levels, for example, at the institutional level or research group level, may clarify our observations. Moreover, the most important challenge we offer our colleagues is to develop a model explaining our findings. [End Page 138]

Xiaoyuan Yuan

Library of Nanjing University
School of Information Management, Nanjing University
yuanxy@nju.edu.cn

Weina Hua

School of Information Management, Nanjing University
huawn@nju.edu.cn

Ronald Rousseau

Institute for Education and Information Sciences, IBW, University of Antwerp
Department of Mathematics, KU Leuven
ronald.rousseau@uantwerpen.be

Fred Y. Ye

School of Information Management, Nanjing University
yye@nju.edu.cn

Acknowledgements

The authors are grateful for financial support from the National Natural Science Foundation of China, Grant no. 71173187, and the National Social Science Foundation of China, Major Key Project 12&ZD221. They thank anonymous reviewers for their input.

References

Albarrán, P., and J. Ruiz-Castillo. 2011. “References Made and Citations Received by Scientific Articles.” Journal of the American Society for Information Science and Technology 62 (1): 40–49. http://dx.doi.org/10.1002/asi.21448.

Alonso, S., F. J. Cabrerizo, E. Herrera-Viedma, and F. Herrera. 2009. “h-Index: A Review Focused in Its Variants, Computation and Standardization for Different Scientific Fields.” Journal of Informetrics 3 (4): 273–89. http://dx.doi.org/10.1016/j.joi.2009.04.001.

Burrell, Q. L. 2008. “Extending Lotkaian Informetrics.” Information Processing and Management 44 (5): 1794–1807. http://dx.doi.org/10.1016/j.ipm.2008.03.002.

———. 2013a. “Formulae for the h-Index: A Lack of Robustness in Lotkaian Informetrics?” Journal of the American Society for Information Science and Technology 64 (7): 1504–14. http://dx.doi.org/10.1002/asi.22845.

———. 2013b. “The h-Index: A Case of the Tail Wagging the Dog?” Journal of Informetrics 7 (4): 774–83. http://dx.doi.org/10.1016/j.joi.2013.06.004.

Chen, D.-Z., M.-H. Huang, and F. Y. Ye. 2013. “A Probe into Dynamic Measures for h-Core and h-Tail.” Journal of Informetrics 7 (1): 129–37. http://dx.doi.org/10.1016/j.joi.2012.10.002.

Csajbók, E., A. Berhidi, L. Vasas, and A. Schubert. 2007. “Hirsch-Index for Countries Based on Essential Science Indicators Data.” Scientometrics 73 (1): 91–117. http://dx.doi.org/10.1007/s11192-007-1859-9.

Egghe, L. 2010. “The Hirsch Index and Related Impact Measures.” Annual Review of Information Science and Social Science 44: 65–114.

Google Scholar

Egghe, L., R. Guns, and R. Rousseau. 2011. “Thoughts on Uncitedness: Nobel Laureates and Fields Medalists as Case Studies.” Journal of the American Society for Information Science and Technology 62 (8): 1637–44. http://dx.doi.org/10.1002/asi.21557.

Egghe, L., and R. Rousseau. 2006. “An Informetric Model for the Hirsch-Index.” Scientometrics 69 (1): 121–29. http://dx.doi.org/10.1007/s11192-006-0143-8.

———. 2012a. “The Hirsch Index of a Shifted Lotka Function and Its Relation with the Impact Factor.” Journal of the American Society for Information Science and Technology 63 (5): 1048–53. http://dx.doi.org/10.1002/asi.22617.

———. 2012b. “Theory and Practice of the Shifted Lotka Function.” Scientometrics 91 (1): 295–301. http://dx.doi.org/10.1007/s11192-011-0539-y.

Fanelli, D. 2012. “Negative Results Are Disappearing from Most Disciplines and Countries.” Scientometrics 90 (3): 891–904. http://dx.doi.org/10.1007/s11192-011-0494-7.

Glänzel, W. 2008. “On Some New Bibliometric Applications of Statistics Related to the h-Index.” Scientometrics 77 (1): 187–96. http://dx.doi.org/10.1007/s11192-007-1989-0.

———. 2010. “The Role of the h-Index and the Characteristic Scores and Scales in Testing the Tail Properties of Scientometric Distributions.” Scientometrics 83 (3): 697–709. http://dx.doi.org/10.1007/s11192-009-0124-9. [End Page 139]

Hirsch, J. E. 2005. “An Index to Quantify an Individual’s Scientific Research Output.” Proceedings of the National Academy of Sciences of the United States of America 102 (46): 16569–72. http://dx.doi.org/10.1073/pnas.0507655102.

Hobbs, R. 2009. “Looking for the Silver Lining: Making the Most of Failure.” Restoration Ecology 17 (1): 1–3. http://dx.doi.org/10.1111/j.1526-100X.2008.00505.x.

Laherrère, J., and D. Sornette. 1998. “Stretched Exponential Distributions in Nature and Economy: Fat Tails with Characteristic Scales.” European Physical Journal B 2 (4): 525–39. http://dx.doi.org/10.1007/s100510050276.

Liang, L. M., and R. Rousseau. 2009. “A General Approach to Citation Analysis and an h-Index Based on the Standard Impact Factor Framework.” In Proceedings of ISSI, ed. B. Larsen and J. Leta, 143–53. Rio de Janeiro: BIREME/PAHO/WHO and the Federal University of Rio de Janeiro.

Google Scholar

Liu, J. Q., R. Rousseau, M. S. Wang, and F. Y. Ye. 2013. “Ratios of h-Cores, h-Tails and Uncited Sources in Sets of Scientific Papers and Technical Patents.” Journal of Informetrics 7 (1): 190–97. http://dx.doi.org/10.1016/j.joi.2012.11.002.

Liu, Y. X., I. K. Ravichandra Rao, and R. Rousseau. 2009. “Empirical Series of Journals h-Indices: The JCR Category Horticulture as a Case Study.” Scientometrics 80 (1): 59–74. http://dx.doi.org/10.1007/s11192-007-2026-z.

Liu, Y. X., and R. Rousseau. 2009. “Properties of Hirsch-Type Indices: The Case of Library Classification Categories.” Scientometrics 79 (2): 235–48. http://dx.doi.org/10.1007/s11192-009-0415-1.

Redner, S. 1998. “How Popular Is Your Paper? An Empirical Study of the Citation Distribution.” European Physical Journal B 4 (2): 131–34. http://dx.doi.org/10.1007/s100510050359.

Rousseau, R. 2006. “New Developments Related to the Hirsch Index.” Science Focus 1 (1): 23–25 [in Chinese]. English translation available: http://eprints.rclis.org/6376/.

———. 2013. “Modelling Some Structural Indicators in an h-Index Context: A Shifted Zipf and a Decreasing Exponential Model.” http://eprints.rclis.org/19896/.

SCImago. 2007–2014. SCImago Journal & Country Rank. http://www.scimagojr.com/countryrank.php.

Ye, F. Y. 2011. “A Theoretical Approach to the Unification of Informetric Models by Wave-Heat Equations.” Journal of the American Society for Information Science and Technology 62 (6): 1208–11. http://dx.doi.org/10.1002/asi.21498.

Ye, F. Y., and R. Rousseau. 2008. “The Power Law Model and Total Career h-Index Sequences.” Journal of Informetrics 2 (4): 288–97. http://dx.doi.org/10.1016/j.joi.2008.09.002.

———. 2010. “Probing the h-Core: An Investigation of the Tail-Core Ratio for Rank Distributions.” Scientometrics 84 (2): 431–39. http://dx.doi.org/10.1007/s11192-009-0099-6.

Zhang, C. T. 2009. “The e-Index, Complementing the h-Index for Excess Citations.” PLoS ONE 4 (5): e5429. http://dx.doi.org/10.1371/journal.pone.0005429.

———. 2010. “Relationship of the h-Index, g-Index, and e-Index.” Journal of the American Society for Information Science and Technology 61 (3): 625–28.

Google Scholar

———. 2013a. “The h’ Index, Effectively Improving the h-Index Based on the Citation Distribution.” PLoS ONE 8 (4): e59912. http://dx.doi.org/10.1371/journal.pone.0059912.

———. 2013b. “A Novel Triangle Mapping Technique to Study the h-Index Based Citation Distribution.” Scientific Reports 3:1–5. http://dx.doi.org/10.1038/srep01023. [End Page 140]

Appendix. Original data

Click for larger view
View full resolution

Table A1.

Statistical data from the Essential Science Indicators, 2002–2012

[End Page 141]

Click for larger view
View full resolution

Table A2.

Dynamic data for engineering and chemistry from Web of Science

[End Page 144]

“Pure Delight and Professional Development”: The Reading Practices and Library Use of an Active Poetry Community / « Pures délices » et développement professionnel : Les pratiques de lecture d’une communauté active de poésie et la fréquentation de la bibliothèque

Canadian Journal of Information and Library Science

Introduction

Methodology

Data

Results and comments

Theoretical modelling

Shifted Zipf (power) model

Decreasing exponential model

A comparison between these two models and the real data

Conclusion

Restrictions and suggestions for further research

Acknowledgements

References

Appendix. Original data

Previous Article

Share

Additional Information

Project MUSE Mission