[62] |
Alexander Erdmann, Tom Kenter, Markus Becker, and Christian Schallhart.
Frugal Paradigm Completion.
In Annual Conference of the Association for Computational
Linguistics (ACL'20), 2020.
[ » bib | » pdf ] Lexica distinguishing all morphologically related forms of each lexeme are crucial to many language technologies, yet building them is expensive. We propose a frugal paradigm completion approach that predicts all related forms in a morphological paradigm from as few manually provided forms as possible. It induces typological information during training which it uses to determine the best sources at test time. We evaluate our language-agnostic approach on 7 diverse languages. Compared to popular alternative approaches, ours reduces manual labor by 16-63% and is the most robust to typological variation.
|
[61] |
Sandy Ritchie, Eoin Mahon, Kim Heiligenstein, Nikos Bampounis, Daan van Esch,
Christian Schallhart, Jonas Fromseier Mortensen, and Benoît Brard.
Data-Driven Parametric Text Normalization: Rapidly Scaling
Finite-State Transduction Verbalizers to New Languages.
In Joint Spoken Language Technologies for Under-resourced
languages and Collaboration and Computing for Under-Resourced Languages
Workshop (SLTU-CCURL'20), 2020.
[ » bib | » pdf ] |
[60] |
Sandy Ritchie, Richard Sproat, Kyle Gorman, Daan van Esch, Christian
Schallhart, Nikos Bampounis, Benoît Brard, Jonas Fromseier Mortensen,
Millie Holt, and Eoin Mahon.
Unified verbalization for speech recognition & synthesis across
languages.
In INTERSPEECH'19, pages 3530-3534, 2019.
[ » bib | » pdf ] We describe a new approach to converting written tokens to their spoken form, which can be shared by automatic speech recognition (ASR) and text-to-speech synthesis (TTS) systems. Both ASR and TTS need to map from the written to the spoken domain, and we present an approach that enables us to share verbalization grammars between the two systems while exploiting linguistic commonalities to provide simple default verbalizations. We also describe improvements to an induction system for number names grammars. Between these shared ASR/TTS verbalizers and the improved induction system for number names grammars, we achieve significant gains in development time and scalability across languages.
|
[59] |
Georg Gottlob, Tim Furche, Giovanni Grasso, Christian Schallhart, and Giorgio
Orsi.
System for automatically generating wrapper for entire websites.
United States Paptent, 2019.
US 10,325,000 B2.
[ » bib | » pdf ] A system for automaticaly generating a wraper for an entire website, the wrapper characterizing the structure of the website, the system having a plurality of functional elements, including at least one annotation module to classify components of a page and generate an anotated, a page clasification module to identify functional and informational components of an anotated page, and an action module to identify an action to be taken to further navigate the website, wherein at least one of the anotation module, page clasification module and action module is operable in response to aplurality of domain-specific rules, where a domain is understood as a conceptual domain such as real estate, usedcars, or electronics.
|
[58] |
Tim Furche, Giovanni Grasso, Michael Huemer, Christian Schallhart, and Michael
Schrefl.
PEACE-Ful Web Event Extraction and Processing as Bitemporal Mutable
Events.
ACM Transactions on the Web (TWEB), 10(3), 2016.
[ » bib | » pdf ] The web is the largest bulletin board of the world. Events of all types, from flight arrivals to business meetings, are announced on this board. Tracking and reacting to such event announcements, however, is a tedious manual task, only slightly alleviated by email or similar notifications. Announcements are published with human readers in mind, and updates or delayed announcements are frequent. These characteristics have hampered attempts at automatic tracking.
|
[57] |
Tim Furche, Jinsong Guo, Sebastian Maneth, and Christian Schallhart.
Robust and Noise Resistant Wrapper Induction.
In SIGMOD'16, pages 773-784, 2016.
[ » bib | » pdf ] Wrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additional requirements: (1) wrappers should be robust against a large class of changes to the web pages, and (2) the induction process should be noise resistant, i.e., tolerate slightly erroneous (e.g., machine generated) samples. Key to our approach is a query language that is powerful enough to permit accurate selection, but limited enough to force noisy samples to be generalized into wrappers that select the likely intended items. We introduce such a language as subset of XPATH and show that even for such a restricted language, inducing optimal queries according to a suitable scoring is infeasible. Nevertheless, our wrapper induction framework infers highly robust and noise resistant queries. We evaluate the queries on snapshots from web pages that change over time as provided by the Internet Archive, and show that the induced queries are as robust as the human-made queries. The queries often survive hundreds sometimes thousands of days, with many changes to the relative position of the selected nodes (including changes on template level). This is due to the few and discriminative anchor (intermediately selected) nodes of the generated queries. The queries are highly resistant against positive noise (up to 50%) and negative noise (up to 20%).
|
[56] |
Andreas Holzer, Christian Schallhart, Michael Tautschnig, and Helmut Veith.
Closure properties and complexity of rational sets of regular
languages.
Theoretical Computer Science (TCS), 605:62-79, 2015.
[ » bib | » pdf ] The test specification language FQL describes relevant test goals as regular expressions over program locations, such that each matching test case has an execution path matching this expression. To specify not only test goals but entire suites, FQL describes families of related test goals by regular expressions over extended alphabets: Herein, each symbol corresponds to a regular expression over program locations, and thus, a word in an FQL expression corresponds to a regular expression describing a single test goal. In this paper we provide a systematic foundation for FQL test specifications, which are in fact rational sets of regular languages (RSRLs). To address practically relevant problems like query optimization, we tackle open questions about RSRLs: We settle closure properties of general and finite RSRLs under common set theoretic operations. We also prove complexity results for checking equivalence and inclusion of star-free RSRLs, and for deciding whether a regular language is a member of a general or star-free RSRL.
|
[55] |
Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi,
Christian Schallhart, and Cheng Wang.
DIADEM: Thousands of Websites to a Single Database.
Proceedings of the VLDB Endowment/International Conference on
Very Large Databases (PVLDB'14), 7(14):1845-1856, 2014.
[ » bib | » pdf ] The web is overflowing with implicitly structured data, spread over hundreds of thousands of sites, hidden deep behind search forms, or siloed in marketplaces, only accessible as HTML. Automatic extraction of structured data at the scale of thousands of websites has long proven elusive, despite its central role in the “web of data”.
|
[54] |
Martin Franz, Andreas Holzer, Stefan Katzenbeisser, Christian Schallhart, and
Helmut Veith.
CBMC-GC: An ANSI C Compiler for Secure Two-Party Computations.
In International Conference on Compiler Construction (CC'14),
pages 244-249, 2014.
[ » bib | » pdf ] Secure two-party computation (STC) is a computer security paradigm where two parties can jointly evaluate a program with sensitive input data, provided in parts from both parties. By the security guarantees of STC, neither party may learn any information on the other party’s input while performing the STC task. For a long time thought to be impractical, until recently, STC has only been implemented with domain-specific languages or hand-crafted Boolean circuits for specific computations. Our open-source compiler CBMC-GC is the first ANSI C compiler for STC. It turns C programs into Boolean circuits that fit the requirements of garbled circuits, a generic STC approach for circuits. Here, the size of the resulting circuits plays a crucial role since each STC step involves encryption and network transfer and is therefore extremely slow when compared to computations performed on modern HW architectures. We report on newly implemented circuit optimization techniques that substantially reduce the circuit sizes compared to the original release of CBMC-GC.
|
[53] |
Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, and
Christian Schallhart.
The Ontological Key: Automatically Understanding and Integrating
Forms to Access the Deep Web.
VLDB Journal, 22(5), 2014.
615-640.
[ » bib | » pdf ] Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding provides applications, ranging from crawlers over meta-search engines to service integrators, with a key to this content. Yet, it has received little attention other than as component in specific applications such as crawlers or meta-search engines. No comprehensive approach to form understanding exists, let alone one that produces rich models for semantic services or integration with linked open data.
|
[52] |
Andreas Holzer, Christian Schallhart, Michael Tautschnig, and Helmut Veith.
On the structure and complexity of rational sets of regular
languages.
In Foundations of Software Technology and Theoretical Computer
Science (FSTTCS'13), 2013.
377-388.
[ » bib | » pdf ] In the recently designed and implemented test specification language FQL, relevant test goals are specified as regular expressions over program locations. To transition from single test goals to test suites, FQL describes these as regular expressions over alphabets where each symbol corresponds to a regular expression over program locations. Hence, each word in a test suite expression is a test goal specification. Such test suite specifications are in fact rational sets of regular languages (RSRLs). We show closure properties of RSRLs under common set theoretic operations for general and finite RSRLs. We also prove complexity results for checking equivalence and inclusion of star-free RSRLs and for checking whether a regular language is member of a general or star- free RSRL. As the star-free (and thus finite) case underlies FQL specifications, we provide a systematic foundation for FQL test specifications.
|
[51] |
Tim Furche, Giovanni Grasso, Michael Huemer, Christian Schallhart, and Michael
Schrefl.
PEACE-ful Web Event Extraction and Processing.
In 14th International Conference on Web Information System
Engineering (WISE'13), pages 523-526, 2013.
[ » bib | » pdf ] PEACE, our proposed tool, integrates complex event processing and web extraction into a unified framework to handle web event advertisements and to run a notification service atop. Its bitemporal schemata distinguish occurrence and detection time, enabling PEACE to deal with updates and delayed announcements, as often occurring on the web. To consolidate the arising event streams, PEACE combines simple events into complex ones. Depending on their occurrence and detection time, these complex events trigger actions to be executed.
|
[50] |
Tim Furche, Giovanni Grasso, Michael Huemer, Christian Schallhart, and Michael
Schrefl.
Bitemporal Complex Event Processing of Web Event Advertisements.
In 14th International Conference on Web Information System
Engineering (WISE'13), pages 333-346, 2013.
[ » bib | » pdf ] The web is the largest bulletin board of the world. Events of all types, from flight arrivals to business meetings, are announced on this board. Tracking and reacting to such event announcements, however, is a tedious manual task, only slightly alleviated by email or similar notifications. Announcements are published with human readers in mind, and updates or delayed announcements are frequent. These characteristics have hampered attempts at automatic tracking.
|
[49] |
Tim Furche, Giovanni Grasso, and Christian Schallhart.
Effective Web Scraping with OXPath.
In WWW (developer track), pages 23-26, 2013.
[ » bib | » pdf ] Even in the third decade of the Web, scraping web sites remains a challenging task: Most scraping programs are still developed as ad-hoc solutions using a complex stack of languages and tools. Where comprehensive extraction solutions exist, they are expensive, heavyweight, and proprietary.
|
[48] |
Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, and Andrew
Sellers.
OXPATH: A Language for Scalable Data Extraction, Automation, and
Crawling on the Deep Web.
VLDB Journal, 22(1):47-72, 2013.
VLDB Best Papers Issue.
[ » bib | » pdf ] The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth.
|
[47] |
Sagar Chacki, Christian Schallhart, and Helmut Veith.
Verification Across Intellectual Property Boundaries.
ACM Transactions on Software Engineering and Methodology
(TOSEM), 22(2):15:1-15:12, 2013.
[ » bib | » pdf ] In many industries, the importance of software components provided by third-party suppliers is steadily increasing. As the suppliers seek to secure their intellectual property (IP) rights, the customer usually has no direct access to the suppliers’ source code, and is able to enforce the use of verification tools only by legal requirements. In turn, the supplier has no means to convince the customer about successful verification without revealing the source code. This paper presents an approach to resolve the conflict between the IP interests of the supplier and the quality interests of the customer. We introduce a protocol in which a dedicated server (called the “amanat”) is controlled by both parties: the customer controls the verification task performed by the amanat, while the supplier controls the communication channels of the amanat to ensure that the amanat does not leak information about the source code. We argue that the protocol is both practically useful and mathematically sound. As the protocol is based on well-known (and relatively lightweight) cryptographic primitives, it allows a straightforward implementation on top of existing verification tool chains. To substantiate our security claims, we establish the correctness of the protocol by cryptographic reduction proofs.
|
[46] |
Ömer Gunes, Tim Furche, Christian Schallhart, Jens, Lehmann, and Axel Ngonga.
EAGER: Extending Automatically Gazetteers for Entity Recognition.
In Proceedings of the 3rd Workshop on the People’s Web Meets
NLP, pages 29-33, 2012.
[ » bib | » pdf ] Key to named entity recognition, the manual gazetteering of entity lists is a costly, error-prone process that often yields results that are incomplete and suffer from sampling bias. Exploiting current sources of structured information, we propose a novel method for extending minimal seed lists into complete gazetteers. Like previous approaches, we value WIKIPEDIA as a huge, well-curated, and relatively unbiased source of entities. However, in contrast to previous work, we exploit not only its content, but also its structure, as exposed in DBPEDIA. We extend gazetteers through Wikipedia categories, carefully limiting the impact of noisy categorizations. The resulting gazetteers easily outperform previous approaches on named entity recognition.
|
[45] |
Jens Lehmann, Tim Furche, Giovanni Grasso, Axel-Cyrille Ngonga Ngomo, Christian
Schallhart, Andrew Sellers, Christina Unger, Lorenz Bühmann, Daniel
Gerber, David Liu Konrad Höffner and, and Sören Auer.
DEQA: Deep Web Extraction for Question Answering.
In International Semantic Web Conference (ISWC'12), Lecture
Notes in Computer Science (LNCS), pages 131-147, 2012.
[ » bib | » pdf ] Despite decades of effort, intelligent object search remains elusive. Neither search engine nor semantic web technologies alone have managed to provide usable systems for simple questions such as “find me a flat with a garden and more than two bedrooms near a supermarket.” We introduce deqa, a conceptual framework that achieves this elusive goal through combining state-of-the-art semantic technologies with effective data extraction. To that end, we apply deqa to the UK real estate domain and show that it can answer a significant percentage of such questions correctly. deqa achieves this by mapping natural language questions to Sparql patterns. These patterns are then evaluated on an RDF database of current real estate offers. The offers are obtained using OXPath, a state-of-the-art data extraction system, on the major agencies in the Oxford area and linked through Limes to background knowledge such as the location of supermarkets.
|
[44] |
Tim Furche, Giovanni Grasso, Antonino Rullo, Christian Schallhart, and Andrew
Sellers.
Think before you Act! Minimising Action Execution in Wrappers.
In International Workshop on Searching and Integrating New Web
Data Sources (VLDS'12), pages 29-34, 2012.
[ » bib | » pdf ] Web wrappers access databases hidden in the deep web by first interacting with web sites by, e.g., filling forms or clicking buttons, to extract the relevant data from the thus unearthed result pages. Though the (semi-)automatic induction and maintenance of such wrappers has been extensively studied, the efficient execution and optimization of wrappers has seen far less attention.
|
[43] |
Tim Furche, Georg Gottlob, and Christian Schallhart.
DIADEM: Domains to Databases.
In International Conference on Database and Expert Systems
Applications (DEXA'12), pages 1-8, 2012.
invited.
[ » bib | » pdf ] What if you could turn all websites of an entire domain into a single database? Imagine all real estate offers, all airline flights, or all your local restaurants’ menus automatically collected from hundreds or thousands of agencies, travel agencies, or restaurants, presented as a single homogeneous dataset.
|
[42] |
Tim Furche, Giovanni Grasso, Andrey Kravchenko, and Christian Schallhart.
Turn the Page: Automated Traversal of Paginated Websites.
In International Conference on Web Engineering (ICWE'12),
volume 7387 of Lecture Notes in Computer Science (LNCS), pages
332-346, 2012.
[ » bib | » pdf ] Content-intensive web sites, such as Google or Amazon, paginate their results to accommodate limited screen sizes. Thus, human users and automatic tools alike have to traverse the pagination links when they crawl the site, extract data, or automate common tasks, where these applications require access to the entire result set. Previous approaches, as well as existing crawlers and automation tools, rely on simple heuristics (e.g., considering only the link text), falling back to an exhaustive exploration of the site where those heuristics fail. In particular, focused crawlers and data extraction systems target only fractions of the individual pages of a given site, rendering a highly accurate identification of pagination links essential to avoid the exhaustive exploration of irrelevant pages. We identify pagination links in a wide range of domains and sites with near perfect accuracy (99 these results with a novel framework for web block classification, BERyL, that combines rule-based reasoning for feature extraction and machine learning for feature selection and classification. Through this combination, BERyL is applicable in a wide settings range, adjusted to maximise either precision, recall, or speed. We illustrate how BERyL minimises the effort for feature extraction and evaluate the impact of a broad range of features (content, structural, and visual).
|
[41] |
Tim Furche, Georg Gottlob, Giovanni Grasso, Ömer Gunes, Xiaonan Guo, Andrey
Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, and Cheng
Wang.
DIADEM: Domain-centric, Intelligent, Automated Data Extraction
Methodology.
In International Word Wide Web Conference (WWW'12), pages
267-270, 2012.
[ » bib | » pdf ] Search engines are the sinews of the web. These sinews have become strained, however: Where the web’s function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain milage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers—never to be sure a better offer isn’t just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites.
|
[40] |
Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart, and Cheng
Wang.
Automatically Learning Gazetteers from the Deep Web.
In International Word Wide Web Conference (WWW'12), pages
341-344, 2012.
[ » bib | » pdf ] Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (> 98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4 locations in the 4th iteration.
|
[39] |
Xiaonan Guo, Jochen Kranzdorf, Tim Furche, Giovanni Grasso, Giorgio Orsi, and
Christian Schallhart.
OPAL: A Passe-partout for Web Forms.
In International Word Wide Web Conference (WWW'12), pages
353-356, 2012.
[ » bib | » pdf ] Web forms are the interfaces of the deep web. Though modern web browsers provide facilities to assist in form filling, this assistance is limited to prior form fillings or keyword matching.
|
[38] |
Jochen Kranzdorf, Andrew Sellers, Giovanni Grasso, Christian Schallhart, and
Tim Furche.
Visual OXPath: Robust Wrapping by Example.
In International Word Wide Web Conference (WWW'12), pages
369-372, 2012.
[ » bib | » pdf ] Good examples are hard to find, particularly in wrapper induction: Picking even one wrong example can spell disaster by yielding overgeneralized or overspecialized wrappers. Such wrappers extract data with low precision or recall, unless adjusted by human experts at significant cost.
|
[37] |
Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, and
Christian Schallhart.
OPAL: Automated Form Understanding for the Deep Web.
In International Word Wide Web Conference (WWW'12), pages
829-838, 2012.
[ » bib | » pdf ] Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem.
|
[36] |
Andreas Holzer, Daniel Kroening, Christian Schallhart, Michael Tautschnig, and
Helmut Veith.
Proving Reachability using FShell (Competition Contribution).
In International Conference on Tools and Algorithms for the
Construction and Analysis of Systems (TACAS'12), Lecture Notes in Computer
Science (LNCS), pages 538-541, 2012.
[ » bib | » pdf ] FShell is an automated white-box test-input generator for C programs, computing test data with respect to user-specified code coverage criteria. The pillars of FShell are the declarative specification language FQL (FShell Query Language), an efficient back end for computing test data, and a mathematical framework to reason about coverage criteria. To solve the reachability problem posed in SV-COMP we specify coverage of ERROR labels. As back end, FShell uses bounded model checking, building upon components of CBMC and leveraging the power of SAT solvers for efficient enumeration of a full test suite.
|
[35] |
Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Schallhart,
and Cheng Wang.
Little Knowledge Rules The Web: Domain-Centric Result Page
Extraction.
In International Conference on Web Reasoning and Rule Systems
(RR'11), pages 61-76, 2011.
[ » bib | » pdf ] Web extraction is the task of turning unstructured HTML into structured data. Previous approaches rely exclusively on detecting repeated structures in result pages. These approaches trade intensive user interaction for precision.
|
[34] |
Tim Furche, Georg Gottlob, Xiaonan Guo, Christian Schallhart, Andrew Sellers,
and Cheng Wang.
How the Minotaur turned into Ariadne: Ontologies in Web Data
Extraction.
In International Conference on Web Engineering (ICWE'11),
pages 13-27, 2011.
invited.
[ » bib | » pdf ] Humans require automated support to profit from the wealth of data nowadays available on the web. To that end, the linked open data initiative and others have been asking data providers to publish structured, semantically annotated data. Small data providers, such as most UK real-estate agencies, however, are overburdened with this task-often just starting to move from simple, table- or list-like directories to web applications with rich interfaces.
|
[33] |
Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, and Andrew
Sellers.
OXPath: A Language for Scalable, Memory-efficient Data Extraction
from Web Applications.
Proceedings of the VLDB Endowment/International Conference on
Very Large Databases (PVLDB'11), 4(11):1016-1027, 2011.
[ » bib | » pdf ] The evolution of the web has outpaced itself: The growing wealth of information and the increasing sophistication of interfaces necessitate automated processing. Web automation and extraction technologies have been overwhelmed by this very growth.
|
[32] |
Andrew Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, and Christian
Schallhart.
Exploring the Web with OXPath.
In International Workshop on Linked Web Data Management
(LWDM'11), 2011.
[ » bib | » pdf ] OXPath is a careful extension of XPath that facilitates data extraction from the deep web. It is designed to facilitate the large-scale extraction of data from sophisticated modern web interfaces with client-side scripting and asynchronous server communication. Its main characteristics are (1) a minimal extension of XPath to allow page navigation and action execution, (2) a set-theoretic formal semantics for full OXPath, (3) and a sophisticated memory management that minimizespagebuffering.Inthisposter,webrieflyreview the main features of the language and discuss ongoing and future work.
|
[31] |
Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, and
Christian Schallhart.
Real Understanding for Real Estate Forms.
In International Conference on Web Intelligence, Mining and
Semantics (WIMS'11), 2011.
[ » bib | » pdf ] Finding an apartment is a lengthy and tedious process. Once decided, one can never be sure not to have missed an even better offer which would have been just one click away. Form understanding is key to automatically access and process all the relevant—and nowadays readily available—data.
|
[30] |
Andrew Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, and Christian
Schallhart.
OXPath: Little Language, Little Memory, Great Value.
In International Word Wide Web Conference (WWW'11), pages
261-264, 2011.
[ » bib | » pdf ] Data about everything is readily available on the web—but often only accessible through elaborate user interactions. For automated decision support, extracting that data is essential, but infeasible with existing heavy-weight data extraction systems. In this demonstration, we present OX-Path, a novel approach to web extraction, with a system that supports informed job selection and integrates information from several different web sites. By carefully extending XPath, OXPath exploits its familiarity and provides a light-weight interface, which is easy to use and embed. We highlight how OXPath guarantees optimal page buffering, storing only a constant number of pages for non-recursive queries.
|
[29] |
Andreas Holzer, Visar Januzaj, Stefan Kugele, Boris Langer, Christian
Schallhart, Michael Tautschnig, and Helmut Veith.
Seamless Testing for Models and Code.
In Fundamental Approaches to Software Engineering (FASE'11),
Lecture Notes in Computer Science (LNCS), pages 278-293, 2011.
[ » bib | » pdf ] This paper describes an approach to model-based testing where a test suite is generated from a model and automatically concretized to drive an implementation. Motivated by an industrial project involving DO-178B compliant avionics software, where the models are UML activity diagrams and the implementation is ANSI C, we developed a seamless testing environment based on our test specification language FQL. We demonstrate how to apply FQL to activity diagrams in such a way that FQL test specifications easily translate from UML to C code. Our approach does not require any additional glue or auxiliary code but is fully automatic except for straightforward source code annotations that link source and model. In this way, we can check for modeled but unimplemented behavior and vice versa, and we can also evaluate the degree of abstraction between model and implementation.
|
[28] |
Andrew Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, and Christian
Schallhart.
Taking the OXPath down the Deep Web.
In International Conference on Extending Database Technology
(EDBT'11), pages 542-545, 2011.
[ » bib | » pdf ] Although deep web analysis has been studied extensively, there is no succinct formalism to describe user interactions with AJAX-enabled web applications.
|
[27] |
Andreas Bauer, Martin Leucker, and Christian Schallhart.
Runtime Verification for LTL and TLTL.
ACM Transactions on Software Engineering and Methodology
(TOSEM), 20(4), 2011.
[ » bib | » pdf ] This paper studies runtime verification of properties expressed either in lineartime temporal logic (LTL) or timed lineartime temporal logic (TLTL). It classifies runtime verification in identifying its distinguishing features to model checking and testing, respectively. It introduces a three-valued semantics (with truth values true, false, inconclusive) as an adequate interpretation as to whether a partial observation of a running system meets an LTL or TLTL property.
|
[26] |
Andreas Holzer, Michael Tautschnig, Christian Schallhart, and Helmut Veith.
An Introduction to Test Specification in FQL.
In Haifa Verification Confernce (HVC'10), volume 6504 of
Lecture Notes in Computer Science (LNCS), pages 9-22, 2010.
invited.
[ » bib | » pdf ] In a recent series of papers, we introduced a new framework for white-box testing which aims at a separation of concerns between test specications and test generation engines. We believe that establishing a common language for test criteria will have similar benets to testing as temporal logic had to model checking and SQL had to databases. The main challenge was to nd a specication language which is expressive, simple, and precise. This paper gives an introduction to the test specication language FQL and its tool environment.
|
[25] |
Visar Januzaj, Stefan Kugele, Boris Langer, Christian Schallhart, and Helmut
Veith.
New Challenges in the Development of Critical Embedded Systems-An
“aeromotive” Perspective.
In International Symposium on Leveraging Applications
(ISoLA'10), volume 6415 of Lecture Notes in Computer Science (LNCS),
pages 1-2, 2010.
[ » bib ] During the last decades, embedded systems have become increasingly important in highly safety-critical areas such as power plants, medical equipment, cars, and aeroplanes. The automotive and avionics domains are prominent examples of classical engineering disciplines where conflicts between costs, short product cycles and legal requirements concerning dependability, robustness, security, carbon footprint and spatial demands have become a pressing problem.
|
[24] |
Andreas Holzer, Michael Tautschnig, Christian Schallhart, and Helmut Veith.
How did you specify your test suite ?
In Automated Software Engineering (ASE'10), pages 407-416,
2010.
[ » bib | » pdf ] Although testing is central to debugging and software certification, there is no adequate language to specify test suites over source code. Such a language should be simple and concise in daily use, feature a precise semantics, and of course, it has to facilitate suitable engines to compute test suites and assess the coverage achieved by a test suite.
|
[23] |
Somesh Jha, Stefan Katzenbeisser, Christian Schallhart, Helmut Veith, and
Stephen Chenney.
Semantic Integrity in Large-Scale Online Simulations.
ACM Transactions on Internet Technology (TOIT), 10(1), 2010.
[ » bib | » pdf ] As large-scale online simulations such as Second Life and World of Warcraft are gaining economic significance, there is a growing incentive for attacks against such simulation software. We focus on attacks against the semantic integrity of the simulation. This class of attacks exploits the client-server architecture and is specific to online simulations which, for performance reasons, have to delegate the detailed rendering of the simulated world to the clients. Attacks against semantic integrity often compromise the physical laws of the simulated world—enabling the user’s simulation persona to fly, walk through walls, or to run faster than anybody else.
|
[22] |
Andreas Bauer, Martin Leucker, and Christian Schallhart.
Comparing LTL Semantics for Runtime Verification.
Journal of Logic and Computation (JLC), 20(3):651-674, 2010.
[ » bib | » pdf ] When monitoring a system w.r.t. a property defined in a temporal logic such as LTL, a major concern is to settle with an adequate interpretation of observable system events; that is, models of temporal logic formulae are usually infinite words of events, whereas at runtime only finite but incrementally expanding prefixes are available.
|
[21] |
Andreas Bauer, Martin Leucker, Christian Schallhart, and Michael Tautschnig.
Don't care in SMT-Building flexible yet efficient
abstraction/refinement solvers.
Software Tools for Technology Transfer (STTT), 12(1):23-37,
2010.
[ » bib | » pdf ] This paper describes a method for combining “off-the-shelf” SAT and constraint solvers for building an efficient Satisfiability Modulo Theories (SMT) solver for a wide range of theories. Our method follows the abstraction/refinement approach to simplify the implementation of custom SMT solvers. The expected performance penalty by not using an interweaved combination of SAT and theory solvers is reduced by generalising a Boolean solution of an SMT problem first via assigning don’t care to as many variables as possible. We then use the generalised solution to determine a thereby smaller constraint set to be handed over to the constraint solver for a background theory.We showthat for many benchmarks and real-world problems, this optimisation results in considerably smaller and less complex constraint problems. The presented approach is particularly useful for assembling a practically viable SMT solver quickly, when neither a suitable SMT solver nor a corresponding incremental theory solver is available.We have implemented our approach in the ABsolver framework and applied the resulting solver successfully to an industrial case-study: the verification problems arising in verifying an electronic car steering control system impose non-linear arithmetic constraints, which do not fall into the domain of any other available solver.
|
[20] |
Johannes Kinder, Stefan Katzenbeisser, Christian Schallhart, and Helmut Veith.
Proactive Detection of Computer Worms Using Model Checking.
IEEE Transactions on Dependable and Secure Computing (TDSC),
4(7):424-438, 2010.
[ » bib | » pdf ] Although recent estimates are speaking of 200,000 different viruses, worms, and Trojan horses, the majority of them are variants of previously existing malware. As these variants mostly differ in their binary representation rather than their functionality, they can be recognized by analyzing the program behavior, even though they are not covered by the signature databases of current antivirus tools. Proactive malware detectors mitigate this risk by detection procedures which use a single signature to detect whole classes of functionally related malware without signature updates. It is evident that the quality of proactive detection procedures depends on their ability to analyze the semantics of the binary.
|
[19] |
Martin Leucker and Christian Schallhart.
A Brief Account of Runtime Verification.
Journal of Logic and Algebraic Programming (JLAP),
78(5):293-303, 2009.
[ » bib | » pdf ] In this paper, a brief account of the field of runtime verification is given. Starting with a definition of runtime verification, a comparison to well-known verification techniques like model checking and testing is provided, and applications in which runtime verification brings out its distinguishing features are pointed out. Moreover, extensions of runtime verification such as monitor-oriented programming, and monitor-based runtime reflection are sketched and their similarities and differences are discussed. Finally, the use of runtime verification for contract enforcement is briefly pointed out.
|
[18] |
Andreas Holzer, Michael Tautschnig, Christian Schallhart, and Helmut Veith.
Query-Driven Program Testing.
In Verification, Model Checking, and Abstract Interpretation
(VMCAI'09), volume 5403 of Lecture Notes in Computer Science (LNCS),
pages 151-166, 2009.
[ » bib | » pdf ] We present a new approach to program testing which enables the programmer to specify test suites in terms of a versatile query language. Our query language subsumes standard coverage criteria ranging from simple basic block coverage all the way to predicate complete coverage and multiple condition coverage, but also facilitates on-the-fly requests for test suites specific to the code structure, to external requirements, or to ad hoc needs arising in program understanding/ exploration. The query language is supported by a model checking backend which employs the CBMC framework. Our main algorithmic contribution is a method called iterative constraint strengthening which enables us to solve a query for an arbitrary coverage criterion by a single call to the model checker and a novel form of incremental SAT solving: Whenever the SAT solver finds a solution, our algorithm compares this solution against the coverage criterion, and strengthens the clause database with additional clauses which exclude redundant new solutions. We demonstrate the scalability of our approach and its ability to compute compact test suites with experiments involving device drivers, automotive controllers, and open source projects.
|
[17] |
Wei Dong, Martin Leucker, and Christian Schallhart.
Impartial Anticipation in Runtime-Verification.
In Automated Technology for Verification and Analysis
(ATVA'08), volume 5311 of Lecture Notes in Computer Science (LNCS),
2008.
386-396.
[ » bib | » pdf ] In this paper, a uniform approach for synthesizing monitors checking correctness properties specified in linear-time logics at runtime is provided. Therefore, a generic three-valued semantics is introduced reflecting the idea that prefixes of infinite computations are checked. Then a conceptual framework to synthesize monitors from a logical specification to check an execution incrementally is established, with special focus on resorting to the automata-theoretic approach. The merits of the presented framework are shown by providing monitor synthesis approaches for a variety of different logics such as LTL, the linear-time μ-calculus, PLTLmod, S1S, and RLTL.
|
[16] |
Andreas Holzer, Michael Tautschnig, Christian Schallhart, and Helmut Veith.
FSHELL: Systematic Test Case Generation for Dynamic Analysis and
Measurement.
In Computer Aided Verification (CAV'08), volume 5123 of
Lecture Notes in Computer Science (LNCS), pages 209-213, 2008.
[ » bib | » pdf ] Although the principal analogy between counterexample generation and white box testing has been repeatedly addressed, the usage patterns and performance requirements for software testing are quite different from formal verification. Our tool FSHELL provides a versatile testing environment for C programs which supports both interactive explorative use and a rich scripting language. More than a frontend for software model checkers, FSHELL is designed as a database engine which dispatches queries about the program to program analysis tools. We report on the integration of CBMC into FSHELL and describe architectural modifications which support efficient test case generation.Although the principal analogy between counterexample generation and white box testing has been repeatedly addressed, the usage patterns and performance requirements for software testing are quite different from formal verification. Our tool FSHELL provides a versatile testing environment for C programs which supports both interactive explorative use and a rich scripting language. More than a frontend for software model checkers, FSHELL is designed as a database engine which dispatches queries about the program to program analysis tools. We report on the integration of CBMC into FSHELL and describe architectural modifications which support efficient test case generation.
|
[15] |
Andreas Bauer, Martin Leucker, Christian Schallhart, and Michael Tautschnig.
Don't care in SMT-Building flexible yet efficient
abstraction/refinement solvers.
In Workshop On Leveraging Applications of Formal Methods,
Verification and Validation (ISoLA'07), pages 135-146, 2007.
[ » bib | » pdf ] This paper describes a method for combining “off-the-shelf” SAT and constraint solvers for building an efficient Satisfiability Modulo Theories (SMT) solver for a wide range of theories. Our method follows the abstraction/refinement approach to simplify the implementation of customSMT solvers. The expected performance penalty by not using an interweaved combination of SAT and theory solvers is reduced by generalising a Boolean solution of an SMT problem first via assigning don’t care to as many variables as possible. We then use the generalised solution to determine a thereby smaller constraint set to be handed over to the constraint solver for a background theory. We show that for many benchmarks and real-world problems, this optimisation results in considerably smaller and less complex constraint problems.
|
[14] |
Andreas Bauer, Martin Leucker, and Christian Schallhart.
The good, the bad, and the ugly, but how ugly is ugly?
In Workshop on Runtime Verification (RV'07), volume 4839 of
Lecture Notes in Computer Science (LNCS), pages 126-138, 2007.
[ » bib | » pdf ] When monitoring a system wrt. a property defined in some temporal logic, e. g., LTL, a major concern is to settle with an adequate interpretation of observable system events; that is, models of temporal logic formulae are usually infinite streams of events, whereas at runtime only prefixes are available. This work defines a four-valued semantics for LTL over finite traces, which extends the classical semantics, and allows to infer whether a system behaves (1) according to the monitored property, (2) violates the property, (3) will possibly violate the property in the future, or (4) will possibly conform to the property in the future, once the system has stabilised. Notably, (1) and (2) correspond to the classical semantics of LTL, whereas (3) and (4) are chosen whenever an observed system behaviour has not yet lead to a violation or acceptance of the monitored property.
|
[13] |
Somesh Jha, Stefan Katzenbeisser, Helmut Veith, and Stephen Chenny.
Enforcing Semantic Integrity on Untrusted Clients in Networked
Virtual Environments (Extended Abstract).
In IEEE Security and Privacy (SP'07), pages 179-186, 2007.
[ » bib | » pdf ] In the computer gaming industry, large-scale simulations of realistic physical environments over the Internet have attained increasing importance. Networked virtual environments (NVES) are typically based on a client-server architecture where part of the simulation workload is delegated to the clients. This architecture renders the simulation vulnerable to attacks against the semantic integrity of the simulation: malicious clients may attempt to compromise the physical and logical rules governing the simulation, or to alter the causality of events. This paper initiates the systematic study of semantic integrity in NVES from a security point of view. We present a new provably secure semantic integrity protocol which enables the server system to audit the local computations of the clients on demand.
|
[12] |
Sagar Chaki, Christian Schallhart, and Helmut Veith.
Verification Across Intellectual Property Boqundaries.
In Computer Aided Verification (CAV'07), volume 4590 of
Lecture Notes in Computer Science (LNCS), pages 82-94, 2007.
[ » bib | » pdf ] In many industries, the share of software components provided by third-party suppliers is steadily increasing. As the suppliers seek to secure their intellectual property (IP) rights, the customer usually has no direct access to the suppliers’ source code, and is able to enforce the use of verification tools only by legal requirements. In turn, the supplier has no means to convince the customer about successful verification without revealing the source code. This paper presents a new approach to resolve the conflict between the IP interests of the supplier and the quality interests of the customer. We introduce a protocol in which a dedicated server (called the “amanat”) is controlled by both parties: the customer controls the verification task performed by the amanat, while the supplier controls the communication channels of the amanat to ensure that the amanat does not leak information about the source code. We argue that the protocol is both practically useful and mathematically sound. As the protocol is based on well-known (and relatively lightweight) cryptographic primitives, it allows a straightforward implementation on top of existing verification tool chains. To substantiate our security claims, we establish the correctness of the protocol by cryptographic reduction proofs.
|
[11] |
Andreas Bauer, Martin Leucker, and Christian Schallhart.
Monitoring of Realtime Properties.
In Foundations of Software Technology and Theoretical Computer
Science (FSTTCS'06), volume 4337 of Lecture Notes in Computer Science
(LNCS), pages 260-272, 2006.
[ » bib | » pdf ] This paper presents a construction for runtime monitors that check real-time properties expressed in timed LTL (TLTL). Due to D’Souza’s results, TLTL can be considered a natural extension of LTL towards real-time. Moreover, a typical obstacle in runtime verification is solved both for untimed and timed formulae, in that standard models of linear temporal logic are infinite traces, whereas in runtime verification only finite system behaviours are at hand. Therefore, a 3-valued semantics (true, false, inconclusive) for LTL and TLTL on finite traces is defined that resembles the infinite trace semantics in a suitable and intuitive manner. Then, the paper describes how to construct, given a (T)LTL formula, a deterministic monitor with three output symbols that reads a finite trace and yields its according 3-valued (T)LTL semantics. Notably, the monitor rejects a trace as early as possible, in that any minimal bad prefix results in false as a return value.
|
[10] |
Andreas Bauer, Martin Leucker, and Christian Schallhart.
Runtime Reflection: Dynamic model-based analyis of component-based
distributed embedded systems.
In Modellierung von Automotive Systems, 2006.
[ » bib | » pdf ] Distributed embedded systems have pervaded the automotive domain, but often still lack measures to ensure adequate behaviour in the presence of unforeseen events, or even errors at runtime. As interactions and dependencies within distributed automotive systems increase, the problem of detecting failures which depend on the exact situation and environment conditions they occur in grows. As a result, not only the detection of failures is increasingly difficult, but also the differentiation between the symptoms of a fault, and the actual fault itself, i. e., the cause of a problem.
|
[9] |
Andreas Bauer, Martin Leucker, and Christian Schallhart.
Model-Based Runtime Analysis of Distributed Reactive Systems.
In Australian Software Engineering Conference (ASWEC'06), pages
243-252, 2006.
[ » bib | » pdf ] Reactive distributed systems have pervaded everyday life and objects, but often lack measures to ensure adequate behaviour in the presence of unforeseen events or even errors at runtime. As interactions and dependencies within distributed systems increase, the problem of detecting failures which depend on the exact situation and environment conditions they occur in grows. As a result, not only the detection of failures is increasingly difficult, but also the differentiation between the symptoms of a fault, and the actual fault itself, i. e., the cause of a problem.
|
[8] |
Christian Schallhart and Luca Trevisan.
Approximating Succinct MaxSat.
Journal of Logic and Computation (JLC), 15(4):551-557, 2005.
[ » bib | » pdf ] We study the approximability of the version of MaxSat where exponentially large instances are succinctly represented using circuits. First, we prove that the NP-hardness for approximating MaxSat can be lifted to a corresponding NEXP-hardness for approximating circuit-succinct MaxSat for some constant performance ratio. Second, we consider the approximability of circuit-succinct MaxSat with respect to lower complexity classes: In particular, we prove that computing (2-ε)-approximate solutions for circuit-succinct MaxSat is at least as hard as inverting one-way permutations. On the other hand, a simple randomized approximation algorithm computes a (2+ε)-approximate solution with high probability. Recall that the standard (not succinctly represented) version of the MaxSat problem is approximable to within a .78 factor [?] and that the MaxThreeSat problem is approximable to within a 7/8 factor [?].
|
[7] |
Axel Belinfante, Lars Franzen, and Christian Schallhart.
Model-Based Testing of Reactive Systems, volume 3472 of
Lecture Notes in Computer Science (LNCS), chapter Tools for Test Case
Generation, pages 391-438.
Springer, 2005.
[ » bib | » pdf ] The preceding parts of this book have mainly dealt with test theory, aimed at improving the practical techniques which are applied by testers to enhance the quality of soft- and hardware systems. Only if these academic results can be efficiently and successfully transferred back to practice, they were worth the effort.
|
[6] |
Johannes Kinder, Stefan Katzenbeisser, Christian Schallhart, and Helmut Veith.
Detecting Malicious Code by Model Checking.
In Conference on Detection of Intrusions and Malware and
Vulnerability Assessment (DIMVA'05), volume 3548 of Lecture Notes in
Computer Science (LNCS), pages 174-187, 2005.
[ » bib | » pdf ] The ease of compiling malicious code from source code in higher programming languages has increased the volatility of malicious programs: The first appearance of a new worm in the wild is usually followed by modified versions in quick succession. As demonstrated by Christodorescu and Jha, however, classical detection software relies on static patterns, and is easily outsmarted. In this paper, we present a flexible method to detect malicious code patterns in executables by model checking. While model checking was originally developed to verify the correctness of systems against specifications, we argue that it lends itself equally well to the specification of malicious code patterns. To this end, we introduce the specification language CTPL (Computation Tree Predicate Logic) which extends the well-known logic CTL, and describe an efficient model checking algorithm. Our practical experiments demonstrate that we are able to detect a large number of worm variants with a single specification.
|
[5] |
Jana Dittmann, Stefan Katzenbeisser, Christian Schallhart, and Helmut Veith.
Ensuring Media Integrity on Third-Party Infrastructures.
In IFIP International Information Security Conference, pages
493-508, 2005.
[ » bib | » pdf ] In many heterogeneous networked applications the integrity of multimedia data plays an essential role, but is not directly supported by the application. In this paper, we propose a method which enables an individual user to detect tampering with a multimedia file without changing the software application provided by the third party. Our method is based on a combination of cryptographic signatures and fragile watermarks, i.e., watermarks that are destroyed by illegitimate tampering. We show that the proposed system is provably secure under standard cryptographic assumptions.
|
[4] |
Stefan Katzenbeisser, Christian Schallhart, and Helmut Veith.
Malware Engineering.
In Sicherheit 2005, Sicherheit - Schutz und Zuverlässigkeit, 2.
Jahrestagung, Fachbereich Sicherheit der Gesellschaft für Informatik, pages
139-148, 2005.
[ » bib | » pdf ] Starting from simple hand-crafted viruses, today’s malware has evolved to constitute highly infectious computer diseases. The technical development of malware was mainly driven by the wish to improve and accelerate both attacks and proliferation. Although these programs have incurred significant hazard and financial losses, their mechanisms are relatively simple and are amenable to effective countermeasures— once, the first attack has been launched. From a software technology point of view, malicious software in fact is often very similar to network services with the main difference that security holes are exploited to enforce participation in the protocol.
|
[3] |
Markus Holzer, Stefan Katzenbeisser, and Christian Schallhart.
Towards a formal semantics for odrl.
In First ODRL International Workshop, pages 137-148, 2004.
[ » bib | » pdf ] We give a brief overview of a new way to model the semantics of ODRL permissions in a formal manner by using finite-automata like structures. The constructed automata capture the sequence of actions that a user is allowed to perform according to a specific permission. In contrast to previous approaches, our semantics is able to model sell and lend permissions.
|
[2] |
Christian Schallhart.
Transaction Processing for Clustered Virtual Environments.
In NATO Advanced Research Workshop on Concurrent Information
Processing and Computing, pages 146-158, 2003.
invited.
[ » bib | » pdf ] This paper introduces Massively Multi-Player Online Role Games (MMORGS) which are currently amain focus of the gaming industry. MMORGS are Networked Virtual Environments (NVES) where a player can navigate his or her character through a large game-world which is populated by thousands of other players and non player characters. Based the problems arising in the context of MMORGS, we will motivate the development of a domain-independent middleware to support MMORG in particular and networked NVES in general. Finally, we will introduce APEIRON, a middleware which is based on a flexible transaction processing framework. APEIRON is designed to serve as basis for the next generation MMORG.
|
[1] |
Ulrich Neumerkel, Christoph Rettig, and Christian Schallhart.
Visualizing solutions with viewers.
In Workshop on Logic Programming Environment (LPE'97), pages
43-50, 1997.
[ » bib ] |
Copyright Notice. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder. Also, some of these works have been submitted for publication. Copyright may be transferred without further notice and this version may no longer be accessible.
Acknowledgement. This page has been generated with » bibtex2html 1.95.