![]() * The related work section should include a comparison to Orbis. What is meant by 'explanation' is much weaker - the framework provides analysis of the system predictions based on several dimensions, like entity types. This is entirely out of scope of this paper. The term 'explainable' is overloaded nowadays, and my initial expectation was that the paper provides explanations for the decisions made by the human and the automated annotators. In addition, the paper would really benefit from a set of use cases that motivate different aspects of the tool, and show how the Orbis tool fulfills them, ideally in comparison to other tools that don't. This user study could compare the effectiveness/efficiency of tool development with Orbis vs other tools in table 8. The paper critically needs a formal evaluation, perhaps as a user study on tool development, that measures to which extent the nominal claims hold in practice. Sections 6 discusses the impact of Orbis on tool development, but all of this is nominal and vague, using terms like "helped a lot" and "A smaller number of lenses is preferred to a higher number, as nobody has enough time to examine too many views". This would likely leave the reader to understand only a subset of its functionalities.Ģ) Evaluation - The second major issue with this paper is that it falls short on evaluation. Section 5 discussed explainable features, but it is vague on whether and how these work. It is problematic that the set of functionalities of the Orbis system is never described clearly (only buried inside writing), or illustrated with a schema. Conversely, the Orbis section, which is arguably the key contribution of this work, is much shorter and starts with presenting irrelevant information, such as the dark and standard viewing modes of the tool. In some places, writing seems misplaced (e.g., the mention evaluation in 3.3.6 seems to belong in 3.2). Meanwhile, this section 3 is not following a consistent logic - the subsection on entity linking task is very detailed whereas the subsection on NER is much shorter. The background section is surprisingly long and it combines background information with decisions made in the Orbis system (e.g., in 3.1.2) and with in-depth discussion of challenges with evaluating some tasks (3.3). There are two main challenges that prevent me from suggesting acceptance at this stage.ġ) Presentation - The paper should be better structured to make explicit its contributions, and how these contributions are justified by the proposed framework and its evaluation. The framework's focus on versioning and visualization of the system and gold predictions would intuitively help tool developers debug and understand the behavior of their system as a function of the different benchmarks, tasks, and KG versions. Similarly, evaluating IE tasks jointly is a good idea, especially for the tasks chosen in this task, which are compositional (SF>NEL>NER>CE). I find the goal of shedding light on information extraction evaluation to be very valuable. This paper describes a framework for "explainable" benchmarking, which extends prior benchmarking systems with more information extraction (IE) tasks, version control, and visualization tools. This article introduces a unified formal framework for evaluating these tasks, presents Orbis’ architecture, and illustrates how it (i) creates simple, concise visualizations that enable visual benchmarking, (ii) supports different visual classification schemas for evaluation results, (iii) aids error analysis, and (iv) enhances interpretability, reproducibility and explainability of evaluations by adhering to the FAIR principles, and using lenses which make implicit factors impacting evaluation results such as tasks, entity classes, annotation rules and the target knowledge graph more explicit.īy Filip Ilievski submitted on 25/Mar/2022 Suggestion: Major Revision Orbis currently supports four information extraction tasks: content extraction, named entity recognition, named entity linking and slot filling. It, therefore, actively aids developers in better understanding evaluation results and identifying shortcomings in their systems. This work addresses the need for explainability by presenting Orbis, a powerful and extensible explainable evaluation framework which supports drill-down analysis, multiple annotation tasks and resource versioning. Metrics such as F1 and accuracy support comparison of annotators, they do not help in explaining annotator performance. Strengths and weaknesses of their methods and guidance for their development efforts, is very limited. ![]() Nevertheless, methodological support for explainable benchmarking, which provides researchers with feedback on the Competitive benchmarking of information extraction methods has considerably advanced the state of the art in thisįield. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |