- The paper presents WICE as a novel dataset for fine-grained entailment evaluation using Wikipedia claims and citations.
- It introduces Claim-Split, which leverages GPT-3.5 to break complex claims into manageable subclaims for precise annotation.
- Analysis shows that context-aware models perform better, yet current systems still lag behind human-level verification.
The field of NLP often requires models to verify the truthfulness of statements based on provided evidence, which can have applications ranging from fact-checking to document summarization. A new dataset, named W ICE (Wikipedia Citation Entailment), intends to tackle these challenges by offering a more realistic and fine-grained textual entailment setup.
This dataset is rooted in Wikipedia, where claims within articles are automatically identified and linked with the articles they cite as evidence. W ICE not only assesses whether a claim is supported, partially supported, or unsupported by the evidence but also provides detailed annotations for sub-sentence units within the claims, showing exactly which parts are supported by the evidence and which are not.
One notable innovation introduced alongside W ICE is an automatic claim decomposition strategy known as Claim-Split. Utilizing GPT-3.5, it breaks complex claims into more manageable subclaims, making the annotation process more efficient and possibly improving the performance of entailment models, as subclaims can be easier to evaluate than longer, more intricate statements.
W ICE is shown to pose new challenges for current entailment models that generally deal with shorter texts. Existing models, when assessed on real-world claims from the dataset, underperform due to the complex nature of evidence verification and retrieval issues that these models are not yet equipped to handle.
The importance of context and retrieval is underscored in the data analysis. Models trained to predict entailment using chunks of the evidence, combined with context, achieve better performance than those relying solely on individual sentences. However, these systems still fall short of human-level performance.
In summary, W ICE represents a step forward in the realistic assessment of models' capability to determine the factual correctness of real-world claims. Its supporting tools, like Claim-Split and fine-grained annotations, provide ways to both enhance the dataset and potentially improve model performance, emphasizing the importance of context, retrieval, and the granularity of evidence in the continuous evolution of automated fact verification systems.