CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization

Published 17 Jun 2020 in cs.IR, cs.AI, and cs.CL | (2006.09595v1)

Abstract: The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. As of May 2020, 128,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset Challenge. Here we present CO-Search, a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers during a time of crisis. The retriever is built from a Siamese-BERT encoder that is linearly composed with a TF-IDF vectorizer, and reciprocal-rank fused with a BM25 vectorizer. The ranker is composed of a multi-hop question-answering module, that together with a multi-paragraph abstractive summarizer adjust retriever scores. To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations, creating 1.3 million (citation title, paragraph) tuples for training the encoder. We evaluate our system on the data of the TREC-COVID information retrieval challenge. CO-Search obtains top performance on the datasets of the first and second rounds, across several key metrics: normalized discounted cumulative gain, precision, mean average precision, and binary preference.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (62)

View on Semantic Scholar

Summary

The paper introduces CO-Search, a hybrid information retrieval system for COVID-19 literature combining semantic search (SBERT) with traditional methods, question answering, and abstractive summarization.
Evaluation on TREC-COVID datasets demonstrates CO-Search's effectiveness, securing top placements in metrics such as nDCG, P@5/10, MAP, and Bpref.
The system provides researchers and clinicians with efficient access to relevant COVID-19 information and highlights the potential of integrating contemporary neural and established IR methodologies.

An Overview of CO-Search: Integrating Advanced IR Techniques for COVID-19 Literature

The paper "CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization" details a sophisticated information retrieval (IR) system designed to process and extract valuable insights from a vast corpus of COVID-19 scientific literature. This work addresses the need for efficient information retrieval tools to help clinicians, researchers, and policymakers navigate an ever-growing body of COVID-19-related research.

CO-Search employs a multi-faceted approach that integrates semantic search capabilities with advanced question answering (QA) and summarization functionalities. The retriever component synergizes a Siamese-BERT (SBERT) model with traditional keyword-based methods such as TF-IDF and BM25. This hybrid strategy leverages both semantic representations and keyword frequency to enhance the retrieval accuracy of COVID-19-related documents.

The SBERT model is pivotal in the architecture, as it facilitates the embedding of textual queries and documents into a shared latent space, enabling semantic overlap to be efficiently captured. The paper outlines the training of this model using a bipartite graph constructed from paragraph-citation pairs, fostering a robust semantic understanding well-suited for this domain. The subsequent integration with TF-IDF and BM25 scores exploits reciprocal rank fusion, blending semantic retrieval with keyword-based scores for comprehensive document retrieval.

Emphasizing context-sensitive retrieval, the ranker module deploys a QA engine complemented by an abstractive summarizer. The QA system utilizes a multi-hop reasoning approach, capable of tracing complex inter-paragraph relations to reinforce the relevance of retrieved documents by assessing their capacity to answer user queries. The summarizer employs an encoder-decoder model, combining a BERT encoder with a modified GPT-2 decoder, to generate concise summaries of the retrieved articles, thereby assisting users in quickly apprehending the core information.

Evaluation on the TREC-COVID challenge datasets demonstrates CO-Search’s effectiveness. The system secures top placements across several automatic metrics such as normalized discounted cumulative gain (nDCG), precision at specified intervals (P@5, P@10), mean average precision (MAP), and binary preference (Bpref). These outcomes affix its utility in automatic information retrieval contexts, demonstrating a superior capability to distill meaningful insights from a dense and rapidly evolving research corpus.

Practically, CO-Search is positioned to support the global research community amidst a pandemic by ensuring access to relevant, up-to-date information, potentially guiding both academic inquiry and public health decision-making processes. Theoretically, its architecture underscores the potential of blending contemporary neural approaches with established IR methodologies, charting a path for future IR systems handling specialized and voluminous data collections.

Further evolution of this work may explore domain adaptation strategies that could refine SBERT embeddings with even richer COVID-19-specific semantics. Additionally, exploration into real-time updates and dynamic retraining mechanisms could enhance responsiveness to newly emerging literature. The authors’ commitment to open source the system lays a foundation for collaborative enhancements and adaptations by the broader research community, promising continued improvements and broader applications beyond the current pandemic scenario.

Markdown Report Issue