Whalen, R., Huang, Y., Sawant, A., Uzzi, B. & Contractor, N.: Natural Language Processing, Article Content & Bibliometrics: Predicting High Impact Science

Abstract: In this paper we advocate increased use of textual data to develop new bibliometric methods. To demonstrate text’s potential we propose a new bibliometric method that combines natural language processing with traditional bibliometric techniques to improve high impact science predictions. Relying upon the vast amounts of scholarly data now available online, we assemble a universe of scientific topics and use article text to measure the topical distance between citing and cited papers. We show that accounting for topical distance improves our ability to predict scientific impact. Citations from both topically distant and proximate papers provide more insight into an article’s impact potential than those from papers with middling similarity.

License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

File: ASCW15_whalen-etal-nlp-article-content-and-bibliometrics.pdf


  1. I was part of the UK university research assessment exercise (REF). In that each output was viewed initially on three quality criteria: novelty, significance and rigour. In terms of the quartiles (and I think picking up on Christian’s point), it seems to me that the distant quartile relates strongly to significance – is this important work with broad impact n the field? Whereas the close quartile would be more about rigour and to some extent novelty – for someone who knows the field well, does the deep academic content of the work stand up?


  2. The paper correctly criticizes the flaws of metrics which are solely based on citation counts for scholarly impact assessment. Instead, the authors suggest to take the content of citing and cited papers into account which apparently lead to promising results with regard to the limited scope of the conducted experiment. Interestingly, citations from papers with less similar content are found to be an indicator for a paper with high potential. The given explanation (namely that such papers are relevant for a wider community) makes sense and the results seem to support this hypothesis.

    However, although the suggested method might increase the precision of impact assessment, it does not challenge the concept of measuring impact by citation counts itself. While citations are clearly an indicator for some kind of impact, they do not tell us anything about the actual quality of the impact. For example, a paper might receive a lot of citations simply because it comes from a reputable author or institution or because it makes a very provocative statement which comes in handy to illustrate a point. At the same time, the citing authors do not necessarily need to agree with the cited work and maybe they haven´t even read it. In such cases, citations are neither an indicator for high quality nor for high impact – if we are interested in the actual contribution to the knowledge produced in a research area.

    From this perspective, the assumption that citations from papers with little overlapping content are particularly “valuable” is troubling as this also means that such papers did not actually process much of the knowledge presented in the cited work. On the other hand, papers which work closely with the cited work would appear as less significant although the actual scholarly impact on these citing authors is obviously much stronger.

    The argument made here is not supposed to discourage the suggested method but an encouragement to take it further. Instead of merely improving the precision of assessing scholarly impact in the traditional limited understanding, the method of analyzing the contextual content of citations might be applied to come to a better and more differentiated understanding of impact itself. In any case, the proposed approach seems to be worth following.


  3. You result that the most topically distant citations (1st quartile) are the best predictors of a high future impact seems plausible to me. However, I did not find an explanation why this is also true for topically similar citations (4th quartil).

    You should add some more information about the analysed papers:
    – Did you only analyse document types “article” and “review”?
    – What was the publication date of the most current articles: since it takes some time until articles are cited more often, this could have also some influence on the model?
    – When did you collect citation data?


  4. This is an interesting paper on the development of textual analysis to improve bibliometric indicators. The authors propose a method that would rely on full text analysis of scientific papers to characterize the citations relations. The main idea is to calculate the ‘topical distance’ among publications based on the presence of common keywords.
    The idea is indeed appealing and very relevant. Recent developments in bibliometrics research also point to the importance of having more nuanced methods in the characterization of citations (e.g. http://dx.doi.org/10.1002/asi.23367). I have the following questions and comments for the paper (with the aim of improving it):

    – The authors say that “[w]e now have sufficient access to data, methods, and computational power to consider not only the presence or absence of a citation, but also the content of both the citing and the cited articles”. This is a quite optimistic claim, but I wonder to what extent is this realistic. In other words, would it be possible for the authors to provide some figure on the number or share of worldwide scientific publications that can nowadays be analyzed with the same (or similar) methodology as here? My impression is that this possibility is still very uncommon.

    – It is said that the paper will “distinguish between articles that influence particularly diverse areas of science and those that are only cited by articles within a relatively narrow subject area”. I would like to see here a discussion of why is this relevant. For example, if we have two papers of 10 citations each, one cited one time from papers coming form 10 different disciplines and the other one all cited from the same discipline, why would this aspect be important at all?

    – In the analysis, the ‘topical diversity’ of citations is used to predict the total number of citations of the publications, together with the local citations within the SN journals dataset. The results show that just with only the local citations the predictability of total citations is of an Adj. R-squared of 0.54, while the use of quartiles of the ‘topical similarity measure’ have a better performance in predicting citations (0.67). Here, I have the following questions:
    o Citations have a very skewed distribution, so one wonders if a linear regression model is the best approach for this analysis? Also, how is the distribution of citations related to the ‘topical similarity measure’? (E.g. are in general highly cited papers more topically diverse? Or are lowly cited papers less diverse?).
    o Another question would be if a model including the 5 variables (i.e. local citations and the 4 quartile variables) would be better than the 2 models separated?
    o In how much the citedness of the publications in each of the four quartiles can be affecting the analysis? For example, it could be that publications in Q1 and Q4 are the most cited ones (both from the dissimilar (Q1) and similar (Q4) topical perspectives) and this influencing the better predictability (i.e. highly cited publications predict better highly cited publications).

    I just have a final reflection from an efficiency perspective. With the ‘topical distance analysis’ there is some improvement with respect to using the more simple analysis based on local citations. However this improvement seems to be not very big. One wonders if the time and computer power necessary for the topic analysis is efficient with the results obtained. For example, would the results of the topical (or cognitive) distance between publications simply based on cited references (see for example https://static.sys.kth.se/itm/wp/indek/indekwp10.pdf) be as good as the full text analysis of keywords and topics? In some way the whole text analysis seems more time consuming than the analysis of references across publications. Some discussions on this issue would be very interesting.

    For the rest, congratulations for the interesting work.


  5. A further factor to consider is the intent of a citation in a later work: sometimes, a subsequent paper makes extensive use of the earlier results, othertimes a paper simply includes a citation in a long list of loosely related papers. Being able to somehow weight according to the significance by similar methods would be very valuable.


  6. The paper proposes a new method that combines NLP and traditional bibliometric methods for predicting high impact science. The preliminary results suggest that the number of citations from very dissimilar articles and the number of citations from very similar articles are significant predictors for the impact of a research work. The methods are simple but effective. The findings are interesting. There has been an increasing interest in using textual data for bibliometric studies. This study will likely generate fruitful discussions in the workshop.

    A couple of comments for the authors to consider:
    1. The position paper uses the number of citations within the dataset to predict the number of citations outside the dataset. An alternative way for evaluation is to select a certain time point as a cutoff, and use past citations (or other features) to predict future citations.

    2. Besides using author keywords and WoS keywords as features for text representation, there are other methods (e.g. bag-of-words, topic models, or other NLP methods). It will be interesting to see how other methods impact the results.

    A minor comment is that I am not sure “term frequency minus inverse document frequency” is the common expression for TF-IDF. I think the general expression is simply TF-IDF.


  7. Interesting paper and a simple but effective idea!


Leave a Reply

Your email address will not be published. Required fields are marked *

ascw Captcha ensure human user * Time limit is exhausted. Please reload the CAPTCHA.