(» hide abstracts)
[62] Alexander Erdmann, Tom Kenter, Markus Becker, and Christian Schallhart. Frugal Paradigm Completion. In Annual Conference of the Association for Computational Linguistics (ACL'20), 2020.
[ » bib | » pdf ]
Lexica distinguishing all morphologically related forms of each lexeme are crucial to many language technologies, yet building them is expensive. We propose a frugal paradigm completion approach that predicts all related forms in a morphological paradigm from as few manually provided forms as possible. It induces typological information during training which it uses to determine the best sources at test time. We evaluate our language-agnostic approach on 7 diverse languages. Compared to popular alternative approaches, ours reduces manual labor by 16-63% and is the most robust to typological variation.

[61] Sandy Ritchie, Eoin Mahon, Kim Heiligenstein, Nikos Bampounis, Daan van Esch, Christian Schallhart, Jonas Fromseier Mortensen, and Benoît Brard. Data-Driven Parametric Text Normalization: Rapidly Scaling Finite-State Transduction Verbalizers to New Languages. In Joint Spoken Language Technologies for Under-resourced languages and Collaboration and Computing for Under-Resourced Languages Workshop (SLTU-CCURL'20), 2020.
[ » bib | » pdf ]
[60] Sandy Ritchie, Richard Sproat, Kyle Gorman, Daan van Esch, Christian Schallhart, Nikos Bampounis, Benoît Brard, Jonas Fromseier Mortensen, Millie Holt, and Eoin Mahon. Unified verbalization for speech recognition & synthesis across languages. In INTERSPEECH'19, pages 3530-3534, 2019.
[ » bib | » pdf ]
We describe a new approach to converting written tokens to their spoken form, which can be shared by automatic speech recognition (ASR) and text-to-speech synthesis (TTS) systems. Both ASR and TTS need to map from the written to the spoken domain, and we present an approach that enables us to share verbalization grammars between the two systems while exploiting linguistic commonalities to provide simple default verbalizations. We also describe improvements to an induction system for number names grammars. Between these shared ASR/TTS verbalizers and the improved induction system for number names grammars, we achieve significant gains in development time and scalability across languages.

[59] Georg Gottlob, Tim Furche, Giovanni Grasso, Christian Schallhart, and Giorgio Orsi. System for automatically generating wrapper for entire websites. United States Paptent, 2019. US 10,325,000 B2.
[ » bib | » pdf ]
A system for automaticaly generating a wraper for an entire website, the wrapper characterizing the structure of the website, the system having a plurality of functional elements, including at least one annotation module to classify components of a page and generate an anotated, a page clasification module to identify functional and informational components of an anotated page, and an action module to identify an action to be taken to further navigate the website, wherein at least one of the anotation module, page clasification module and action module is operable in response to aplurality of domain-specific rules, where a domain is understood as a conceptual domain such as real estate, usedcars, or electronics.

[58] Tim Furche, Giovanni Grasso, Michael Huemer, Christian Schallhart, and Michael Schrefl. PEACE-Ful Web Event Extraction and Processing as Bitemporal Mutable Events. ACM Transactions on the Web (TWEB), 10(3), 2016.
[ » bib | » pdf ]
The web is the largest bulletin board of the world. Events of all types, from flight arrivals to business meetings, are announced on this board. Tracking and reacting to such event announcements, however, is a tedious manual task, only slightly alleviated by email or similar notifications. Announcements are published with human readers in mind, and updates or delayed announcements are frequent. These characteristics have hampered attempts at automatic tracking.

PEACE provides the first integrated framework for event processing on top of web event ads, consisting of event extraction, complex event processing, and action execution in response to these events. Given a schema of the events to be tracked, the framework populates this schema by extracting events from announcement sources. This extraction is performed by little programs called wrappers that produce the events including updates and retractions. PEACE then queries these events to detect complex events, often combining announcements from multiple sources. To deal with updates and delayed announcements, PEACE’s schemas are bitemporal, to distinguish between occurrence and detection time. This allows complex event specifications to track updates and to react upon differences in occurrence and detection time. In case of new, changing, or deleted events, PEACE allows one to execute actions, such as tweeting or sending out email notifications. Actions are typically specified as web interactions, for example, to fill and submit a form with attributes of the triggering event.

Our evaluation shows that PEACE’s processing is dominated by the time needed for accessing the web to extract events and perform actions, allotting to 97.4%. Thus, PEACE requires only 2.6% overhead, and therefore, the complex event processor scales well even with moderate resources. We further show that simple and reasonable restrictions on complex event specifications and the timing of constituent events suffice to guarantee that PEACE only requires a constant buffer to process arbitrarily many event announcements.

[57] Tim Furche, Jinsong Guo, Sebastian Maneth, and Christian Schallhart. Robust and Noise Resistant Wrapper Induction. In SIGMOD'16, pages 773-784, 2016.
[ » bib | » pdf ]
Wrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additional requirements: (1) wrappers should be robust against a large class of changes to the web pages, and (2) the induction process should be noise resistant, i.e., tolerate slightly erroneous (e.g., machine generated) samples. Key to our approach is a query language that is powerful enough to permit accurate selection, but limited enough to force noisy samples to be generalized into wrappers that select the likely intended items. We introduce such a language as subset of XPATH and show that even for such a restricted language, inducing optimal queries according to a suitable scoring is infeasible. Nevertheless, our wrapper induction framework infers highly robust and noise resistant queries. We evaluate the queries on snapshots from web pages that change over time as provided by the Internet Archive, and show that the induced queries are as robust as the human-made queries. The queries often survive hundreds sometimes thousands of days, with many changes to the relative position of the selected nodes (including changes on template level). This is due to the few and discriminative anchor (intermediately selected) nodes of the generated queries. The queries are highly resistant against positive noise (up to 50%) and negative noise (up to 20%).

[56] Andreas Holzer, Christian Schallhart, Michael Tautschnig, and Helmut Veith. Closure properties and complexity of rational sets of regular languages. Theoretical Computer Science (TCS), 605:62-79, 2015.
[ » bib | » pdf ]
The test specification language FQL describes relevant test goals as regular expressions over program locations, such that each matching test case has an execution path matching this expression. To specify not only test goals but entire suites, FQL describes families of related test goals by regular expressions over extended alphabets: Herein, each symbol corresponds to a regular expression over program locations, and thus, a word in an FQL expression corresponds to a regular expression describing a single test goal. In this paper we provide a systematic foundation for FQL test specifications, which are in fact rational sets of regular languages (RSRLs). To address practically relevant problems like query optimization, we tackle open questions about RSRLs: We settle closure properties of general and finite RSRLs under common set theoretic operations. We also prove complexity results for checking equivalence and inclusion of star-free RSRLs, and for deciding whether a regular language is a member of a general or star-free RSRL.

[55] Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, and Cheng Wang. DIADEM: Thousands of Websites to a Single Database. Proceedings of the VLDB Endowment/International Conference on Very Large Databases (PVLDB'14), 7(14):1845-1856, 2014.
[ » bib | » pdf ]
The web is overflowing with implicitly structured data, spread over hundreds of thousands of sites, hidden deep behind search forms, or siloed in marketplaces, only accessible as HTML. Automatic extraction of structured data at the scale of thousands of websites has long proven elusive, despite its central role in the “web of data”.

Through an extensive evaluation spanning over 10000 web sites from multiple application domains, we show that automatic, yet accurate full-site extraction is no longer a distant dream. DIADEM is the first automatic full-site extraction system that is able to extract structured data from different domains at very high accuracy. It combines automated exploration of websites, identification of relevant data, and induction of exhaustive wrappers. Automating these components is the first challenge. DIADEM overcomes this challenge by combining phenomenological and ontological knowledge. Integrating these components is the second challenge. DIADEM overcomes this challenge through a self-adaptive network of relational transducers that produces effective wrappers for a wide variety of websites.

Our extensive and publicly available evaluation shows that, for more than 90% of sites from three domains, DIADEM obtains an effective wrapper that extracts all relevant data with 97% average precision. DIADEM also tolerates noisy entity recognisers, and its components individually outperform comparable approaches.

[54] Martin Franz, Andreas Holzer, Stefan Katzenbeisser, Christian Schallhart, and Helmut Veith. CBMC-GC: An ANSI C Compiler for Secure Two-Party Computations. In International Conference on Compiler Construction (CC'14), pages 244-249, 2014.
[ » bib | » pdf ]
Secure two-party computation (STC) is a computer security paradigm where two parties can jointly evaluate a program with sensitive input data, provided in parts from both parties. By the security guarantees of STC, neither party may learn any information on the other party’s input while performing the STC task. For a long time thought to be impractical, until recently, STC has only been implemented with domain-specific languages or hand-crafted Boolean circuits for specific computations. Our open-source compiler CBMC-GC is the first ANSI C compiler for STC. It turns C programs into Boolean circuits that fit the requirements of garbled circuits, a generic STC approach for circuits. Here, the size of the resulting circuits plays a crucial role since each STC step involves encryption and network transfer and is therefore extremely slow when compared to computations performed on modern HW architectures. We report on newly implemented circuit optimization techniques that substantially reduce the circuit sizes compared to the original release of CBMC-GC.

[53] Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, and Christian Schallhart. The Ontological Key: Automatically Understanding and Integrating Forms to Access the Deep Web. VLDB Journal, 22(5), 2014. 615-640.
[ » bib | » pdf ]
Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding provides applications, ranging from crawlers over meta-search engines to service integrators, with a key to this content. Yet, it has received little attention other than as component in specific applications such as crawlers or meta-search engines. No comprehensive approach to form understanding exists, let alone one that produces rich models for semantic services or integration with linked open data.

In this paper, we present OPAL, the first comprehensive approach to form understanding and integration. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL advances the state of the art: For form labeling, it combines features from the text, structure, and visual rendering of a web page. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches for form labeling by a significant margin. For form interpretation, OPAL uses a schema (or ontology) of forms in a given domain. Thanks to this domain schema, it is able to produce nearly perfect (> 97% accuracy in the evaluation domains) form interpretations. Yet, the effort to produce a domain schema is very low, as we provide a Datalog-based template language that eases the specification of such schemata and a methodology for deriving a domain schema largely automatically from an existing domain ontology. We demonstrate the value of OPAL’s form interpretations through a light-weight form integration system that successfully translates and distributes master queries to hundreds of forms with no error, yet is implemented with only a handful translation rules.

[52] Andreas Holzer, Christian Schallhart, Michael Tautschnig, and Helmut Veith. On the structure and complexity of rational sets of regular languages. In Foundations of Software Technology and Theoretical Computer Science (FSTTCS'13), 2013. 377-388.
[ » bib | » pdf ]
In the recently designed and implemented test specification language FQL, relevant test goals are specified as regular expressions over program locations. To transition from single test goals to test suites, FQL describes these as regular expressions over alphabets where each symbol corresponds to a regular expression over program locations. Hence, each word in a test suite expression is a test goal specification. Such test suite specifications are in fact rational sets of regular languages (RSRLs). We show closure properties of RSRLs under common set theoretic operations for general and finite RSRLs. We also prove complexity results for checking equivalence and inclusion of star-free RSRLs and for checking whether a regular language is member of a general or star- free RSRL. As the star-free (and thus finite) case underlies FQL specifications, we provide a systematic foundation for FQL test specifications.

[51] Tim Furche, Giovanni Grasso, Michael Huemer, Christian Schallhart, and Michael Schrefl. PEACE-ful Web Event Extraction and Processing. In 14th International Conference on Web Information System Engineering (WISE'13), pages 523-526, 2013.
[ » bib | » pdf ]
PEACE, our proposed tool, integrates complex event processing and web extraction into a unified framework to handle web event advertisements and to run a notification service atop. Its bitemporal schemata distinguish occurrence and detection time, enabling PEACE to deal with updates and delayed announcements, as often occurring on the web. To consolidate the arising event streams, PEACE combines simple events into complex ones. Depending on their occurrence and detection time, these complex events trigger actions to be executed.

We demonstrate PEACE’s capabilities with a business trip scenario, involving as raw events business trips, flight bookings, scheduled flights, and flight arrivals and departures. These events are scrapped from the web and combined into complex events, triggering actions to be executed, such as updating facebook status messages. Our demonstrator records and reruns event sequences at different speeds to show the system dealing with complex scenarios spanning several days.

[50] Tim Furche, Giovanni Grasso, Michael Huemer, Christian Schallhart, and Michael Schrefl. Bitemporal Complex Event Processing of Web Event Advertisements. In 14th International Conference on Web Information System Engineering (WISE'13), pages 333-346, 2013.
[ » bib | » pdf ]
The web is the largest bulletin board of the world. Events of all types, from flight arrivals to business meetings, are announced on this board. Tracking and reacting to such event announcements, however, is a tedious manual task, only slightly alleviated by email or similar notifications. Announcements are published with human readers in mind, and updates or delayed announcements are frequent. These characteristics have hampered attempts at automatic tracking.

PEACE provides the first integrated framework for event processing on top of web event ads. Given a schema of events to be tracked, the framework populates this schema through compact wrappers for event announcement sources. These wrappers produce events including updates and retractions. PEACE then queries these events to detect complex events, often combining announcements from multiple sources. To deal with updates and delayed announcements, PEACE’s schemas are bitemporal so as to distinguish between occurrence and detection time. This allows complex event specifications to track updates and to react to differences in occurrence and detection time. Our evaluation shows that extracting the event from an announcement dominates the processing of PEACE and that the complex event processor deals with several event announcement sources even with moderate resources. We further show, that simple restrictions on the complex event specifications suffice to guarantee that PEACE only requires a constant buffer to process arbitrarily many event announcements.

[49] Tim Furche, Giovanni Grasso, and Christian Schallhart. Effective Web Scraping with OXPath. In WWW (developer track), pages 23-26, 2013.
[ » bib | » pdf ]
Even in the third decade of the Web, scraping web sites remains a challenging task: Most scraping programs are still developed as ad-hoc solutions using a complex stack of languages and tools. Where comprehensive extraction solutions exist, they are expensive, heavyweight, and proprietary.

OXPath is a minimalistic wrapping language that is nevertheless expressive and versatile enough for a wide range of scraping tasks. In this presentation, we want to introduce you to a new paradigm of scraping: declarative navigation—instead of complex scripting or heavyweight, limited visual tools, OXPath turns scraping into a simple two step process: pick the relevant nodes through an XPath expression and then specify which action to apply to those nodes. OXPath takes care of browser synchronisation, page and state management, making scraping as easy as node selection with XPath. To achieve this, OXPath does not require a complex or heavyweight infrastructure. OXPath is an open source projects and has seen first adoption in a wide variety of scraping tasks.

[48] Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, and Andrew Sellers. OXPATH: A Language for Scalable Data Extraction, Automation, and Crawling on the Deep Web. VLDB Journal, 22(1):47-72, 2013. VLDB Best Papers Issue.
[ » bib | » pdf ]
The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth.

To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) Interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies.

We introduce OXPATH as an extension of XPATH for interacting with web applications and extracting data thus revealed – matching all the above requirements. OXPATH’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPATH’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sub-languages and properties of OXPATH, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPATH outperforms existing commercial and academic data extraction tools by a wide margin.

[47] Sagar Chacki, Christian Schallhart, and Helmut Veith. Verification Across Intellectual Property Boundaries. ACM Transactions on Software Engineering and Methodology (TOSEM), 22(2):15:1-15:12, 2013.
[ » bib | » pdf ]
In many industries, the importance of software components provided by third-party suppliers is steadily increasing. As the suppliers seek to secure their intellectual property (IP) rights, the customer usually has no direct access to the suppliers’ source code, and is able to enforce the use of verification tools only by legal requirements. In turn, the supplier has no means to convince the customer about successful verification without revealing the source code. This paper presents an approach to resolve the conflict between the IP interests of the supplier and the quality interests of the customer. We introduce a protocol in which a dedicated server (called the “amanat”) is controlled by both parties: the customer controls the verification task performed by the amanat, while the supplier controls the communication channels of the amanat to ensure that the amanat does not leak information about the source code. We argue that the protocol is both practically useful and mathematically sound. As the protocol is based on well-known (and relatively lightweight) cryptographic primitives, it allows a straightforward implementation on top of existing verification tool chains. To substantiate our security claims, we establish the correctness of the protocol by cryptographic reduction proofs.

[46] Ömer Gunes, Tim Furche, Christian Schallhart, Jens, Lehmann, and Axel Ngonga. EAGER: Extending Automatically Gazetteers for Entity Recognition. In Proceedings of the 3rd Workshop on the People’s Web Meets NLP, pages 29-33, 2012.
[ » bib | » pdf ]
Key to named entity recognition, the manual gazetteering of entity lists is a costly, error-prone process that often yields results that are incomplete and suffer from sampling bias. Exploiting current sources of structured information, we propose a novel method for extending minimal seed lists into complete gazetteers. Like previous approaches, we value WIKIPEDIA as a huge, well-curated, and relatively unbiased source of entities. However, in contrast to previous work, we exploit not only its content, but also its structure, as exposed in DBPEDIA. We extend gazetteers through Wikipedia categories, carefully limiting the impact of noisy categorizations. The resulting gazetteers easily outperform previous approaches on named entity recognition.

[45] Jens Lehmann, Tim Furche, Giovanni Grasso, Axel-Cyrille Ngonga Ngomo, Christian Schallhart, Andrew Sellers, Christina Unger, Lorenz Bühmann, Daniel Gerber, David Liu Konrad Höffner and, and Sören Auer. DEQA: Deep Web Extraction for Question Answering. In International Semantic Web Conference (ISWC'12), Lecture Notes in Computer Science (LNCS), pages 131-147, 2012.
[ » bib | » pdf ]
Despite decades of effort, intelligent object search remains elusive. Neither search engine nor semantic web technologies alone have managed to provide usable systems for simple questions such as “find me a flat with a garden and more than two bedrooms near a supermarket.” We introduce deqa, a conceptual framework that achieves this elusive goal through combining state-of-the-art semantic technologies with effective data extraction. To that end, we apply deqa to the UK real estate domain and show that it can answer a significant percentage of such questions correctly. deqa achieves this by mapping natural language questions to Sparql patterns. These patterns are then evaluated on an RDF database of current real estate offers. The offers are obtained using OXPath, a state-of-the-art data extraction system, on the major agencies in the Oxford area and linked through Limes to background knowledge such as the location of supermarkets.

[44] Tim Furche, Giovanni Grasso, Antonino Rullo, Christian Schallhart, and Andrew Sellers. Think before you Act! Minimising Action Execution in Wrappers. In International Workshop on Searching and Integrating New Web Data Sources (VLDS'12), pages 29-34, 2012.
[ » bib | » pdf ]
Web wrappers access databases hidden in the deep web by first interacting with web sites by, e.g., filling forms or clicking buttons, to extract the relevant data from the thus unearthed result pages. Though the (semi-)automatic induction and maintenance of such wrappers has been extensively studied, the efficient execution and optimization of wrappers has seen far less attention.

We demonstrate that static and adaptive optimisation techniques, as used for query languages, significantly improve the wrapper execution performance. At the same time, we highlight difference between wrapper optimisation and common query optimisation for databases: (1) The runtime of wrappers is entirely dominated by page loads, while other operations (such as querying DOMs) have almost no impact, requiring a new cost model to guide the optimisation. (2) While adaptive query planning is otherwise often considered inessential, wrappers need to be optimised during run-time, since crucial information on the structure of the visited pages becomes only accessible at runtime. We introduce two basic, but highly effective optimisation techniques, one static, one adaptive, and show that they can easily cut wrapper evaluation time by one order of magnitude. We demonstrate our approach with wrappers specified in OXPath.

[43] Tim Furche, Georg Gottlob, and Christian Schallhart. DIADEM: Domains to Databases. In International Conference on Database and Expert Systems Applications (DEXA'12), pages 1-8, 2012. invited.
[ » bib | » pdf ]
What if you could turn all websites of an entire domain into a single database? Imagine all real estate offers, all airline flights, or all your local restaurants’ menus automatically collected from hundreds or thousands of agencies, travel agencies, or restaurants, presented as a single homogeneous dataset.

Historically, this has required tremendous effort by the data providers and whoever is collecting the data: Vertical search engines aggregate offers through specific interfaces which provide suitably structured data. The semantic web vision replaces the specific interfaces with a single one, but still requires providers to publish structured data.

Attempts to turn human-oriented HTML interfaces back into their underlying databases have largely failed due to the variability of web sources. In this paper, we demonstrate that this is about to change: The availability of comprehensive entity recognition together with advances in ontology reasoning have made possible a new generation of knowledge-driven, domain-specific data extraction approaches. To that end, we introduce diadem, the first automated data extraction system that can turn nearly any website of a domain into structured data, working fully automatically, and present some preliminary evaluation results.

[42] Tim Furche, Giovanni Grasso, Andrey Kravchenko, and Christian Schallhart. Turn the Page: Automated Traversal of Paginated Websites. In International Conference on Web Engineering (ICWE'12), volume 7387 of Lecture Notes in Computer Science (LNCS), pages 332-346, 2012.
[ » bib | » pdf ]
Content-intensive web sites, such as Google or Amazon, paginate their results to accommodate limited screen sizes. Thus, human users and automatic tools alike have to traverse the pagination links when they crawl the site, extract data, or automate common tasks, where these applications require access to the entire result set. Previous approaches, as well as existing crawlers and automation tools, rely on simple heuristics (e.g., considering only the link text), falling back to an exhaustive exploration of the site where those heuristics fail. In particular, focused crawlers and data extraction systems target only fractions of the individual pages of a given site, rendering a highly accurate identification of pagination links essential to avoid the exhaustive exploration of irrelevant pages. We identify pagination links in a wide range of domains and sites with near perfect accuracy (99 these results with a novel framework for web block classification, BERyL, that combines rule-based reasoning for feature extraction and machine learning for feature selection and classification. Through this combination, BERyL is applicable in a wide settings range, adjusted to maximise either precision, recall, or speed. We illustrate how BERyL minimises the effort for feature extraction and evaluate the impact of a broad range of features (content, structural, and visual).

[41] Tim Furche, Georg Gottlob, Giovanni Grasso, Ömer Gunes, Xiaonan Guo, Andrey Kravchenko, Giorgio Orsi, Christian Schallhart, Andrew Sellers, and Cheng Wang. DIADEM: Domain-centric, Intelligent, Automated Data Extraction Methodology. In International Word Wide Web Conference (WWW'12), pages 267-270, 2012.
[ » bib | » pdf ]
Search engines are the sinews of the web. These sinews have become strained, however: Where the web’s function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain milage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers—never to be sure a better offer isn’t just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites.

Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula.

[40] Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart, and Cheng Wang. Automatically Learning Gazetteers from the Deep Web. In International Word Wide Web Conference (WWW'12), pages 341-344, 2012.
[ » bib | » pdf ]
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (> 98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4 locations in the 4th iteration.

[39] Xiaonan Guo, Jochen Kranzdorf, Tim Furche, Giovanni Grasso, Giorgio Orsi, and Christian Schallhart. OPAL: A Passe-partout for Web Forms. In International Word Wide Web Conference (WWW'12), pages 353-356, 2012.
[ » bib | » pdf ]
Web forms are the interfaces of the deep web. Though modern web browsers provide facilities to assist in form filling, this assistance is limited to prior form fillings or keyword matching.

Automatic form understanding enables a broad range of applications, including crawlers, meta-search engines, and usability and accessibility support for enhanced web browsing. In this demonstration, we use a novel form understanding approach, OPAL, to assist in form filling even for complex, previously unknown forms. OPAL associates form labels to fields by analyzing structural properties in the HTML encoding and visual features of the page rendering. OPAL interprets this labeling and classifies the fields according to a given domain ontology. The combination of these two properties, allows OPAL to deal effectively with many forms outside of the grasp of existing form filling techniques. In the UK real estate domain, OPAL achieves > 99 form understanding.

[38] Jochen Kranzdorf, Andrew Sellers, Giovanni Grasso, Christian Schallhart, and Tim Furche. Visual OXPath: Robust Wrapping by Example. In International Word Wide Web Conference (WWW'12), pages 369-372, 2012.
[ » bib | » pdf ]
Good examples are hard to find, particularly in wrapper induction: Picking even one wrong example can spell disaster by yielding overgeneralized or overspecialized wrappers. Such wrappers extract data with low precision or recall, unless adjusted by human experts at significant cost.

Visual OXPath is an open-source, visual wrapper induction system that requires minimal examples and eases wrapper refinement: Often it derives the intended wrapper from a single example through sophisticated heuristics that determine the best set of similar examples. To ease wrapper refinement, it offers a list of wrappers ranked by example similarity and robustness. Visual OXPath offers extensive visual feedback for this refinement which can be performed without any knowledge of the underlying wrapper language. Where further refinement by a human wrapper is needed, Visual OXPath profits from being based on OXPath, a declarative wrapper language that extends XPath with a thin layer of features necessary for extraction and page navigation.

[37] Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, and Christian Schallhart. OPAL: Automated Form Understanding for the Deep Web. In International Word Wide Web Conference (WWW'12), pages 829-838, 2012.
[ » bib | » pdf ]
Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem.

In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).

[36] Andreas Holzer, Daniel Kroening, Christian Schallhart, Michael Tautschnig, and Helmut Veith. Proving Reachability using FShell (Competition Contribution). In International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS'12), Lecture Notes in Computer Science (LNCS), pages 538-541, 2012.
[ » bib | » pdf ]
FShell is an automated white-box test-input generator for C programs, computing test data with respect to user-specified code coverage criteria. The pillars of FShell are the declarative specification language FQL (FShell Query Language), an efficient back end for computing test data, and a mathematical framework to reason about coverage criteria. To solve the reachability problem posed in SV-COMP we specify coverage of ERROR labels. As back end, FShell uses bounded model checking, building upon components of CBMC and leveraging the power of SAT solvers for efficient enumeration of a full test suite.

[35] Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Schallhart, and Cheng Wang. Little Knowledge Rules The Web: Domain-Centric Result Page Extraction. In International Conference on Web Reasoning and Rule Systems (RR'11), pages 61-76, 2011.
[ » bib | » pdf ]
Web extraction is the task of turning unstructured HTML into structured data. Previous approaches rely exclusively on detecting repeated structures in result pages. These approaches trade intensive user interaction for precision.

In this paper, we introduce the Amber (“Adaptable Model-based Extraction of Result Pages”) system that replaces the human interaction with a domain ontology applicable to all sites of a domain. It models domain knowledge about (1) records and attributes of the domain, (2) low-level (textual) representations of these concepts, and (3) constraints linking representations to records and attributes. Parametrized with these constraints, otherwise domain-independent heuristics exploit the repeated structure of result pages to derive attributes and records. Amber is implemented in logical rules to allow an explicit formulation of the heuristics and easy adaptation to different domains.

We apply Amber to the UK real estate domain where we achieve near perfect accuracy on a representative sample of 50 agency websites.

[34] Tim Furche, Georg Gottlob, Xiaonan Guo, Christian Schallhart, Andrew Sellers, and Cheng Wang. How the Minotaur turned into Ariadne: Ontologies in Web Data Extraction. In International Conference on Web Engineering (ICWE'11), pages 13-27, 2011. invited.
[ » bib | » pdf ]
Humans require automated support to profit from the wealth of data nowadays available on the web. To that end, the linked open data initiative and others have been asking data providers to publish structured, semantically annotated data. Small data providers, such as most UK real-estate agencies, however, are overburdened with this task-often just starting to move from simple, table- or list-like directories to web applications with rich interfaces.

We argue that fully automated extraction of structured data can help resolve this dilemma. Ironically, automated data extraction has seen a recent revival thanks to ontologies and linked open data to guide data extraction. First results from the DIADEM project illustrate that high quality, fully automated data extraction at a web scale is possible, if we combine domain ontologies with a phenomenology describing the representation of domain concepts. We briefly summarise the DIADEM project and discuss a few preliminary results.

[33] Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, and Andrew Sellers. OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications. Proceedings of the VLDB Endowment/International Conference on Very Large Databases (PVLDB'11), 4(11):1016-1027, 2011.
[ » bib | » pdf ]
The evolution of the web has outpaced itself: The growing wealth of information and the increasing sophistication of interfaces necessitate automated processing. Web automation and extraction technologies have been overwhelmed by this very growth.

To address this trend, we identify four key requirements of web extraction: (1) Interact with sophisticated web application interfaces, (2) Precisely capture the relevant data for most web extraction tasks, (3) Scale with the number of visited pages, and (4) Readily embed into existing web technologies.

We introduce OXPath, an extension of XPath for interacting with web applications and for extracting information thus revealed. It addresses all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We validate experimentally the theoretical complexity and demonstrate that its evaluation is dominated by the page rendering of the underlying browser.

Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin. OXPath is available under an open source license.

[32] Andrew Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, and Christian Schallhart. Exploring the Web with OXPath. In International Workshop on Linked Web Data Management (LWDM'11), 2011.
[ » bib | » pdf ]
OXPath is a careful extension of XPath that facilitates data extraction from the deep web. It is designed to facilitate the large-scale extraction of data from sophisticated modern web interfaces with client-side scripting and asynchronous server communication. Its main characteristics are (1) a minimal extension of XPath to allow page navigation and action execution, (2) a set-theoretic formal semantics for full OXPath, (3) and a sophisticated memory management that minimizespagebuffering.Inthisposter,webrieflyreview the main features of the language and discuss ongoing and future work.

[31] Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, and Christian Schallhart. Real Understanding for Real Estate Forms. In International Conference on Web Intelligence, Mining and Semantics (WIMS'11), 2011.
[ » bib | » pdf ]
Finding an apartment is a lengthy and tedious process. Once decided, one can never be sure not to have missed an even better offer which would have been just one click away. Form understanding is key to automatically access and process all the relevant—and nowadays readily available—data.

We introduce opal (ontology-based web pattern analysis with logic), a novel, purely logical approach to web form understanding: opal labels, structures, and groups form fields according to a domain-specific ontology linked through phenomenological rules to a logical representation of a DOM. The phenomenological rules describe how ontological concepts appear on the web; the ontology formalizes and structures common patterns of web pages observed in a domain. A unique feature of opal is that all domain-independent assumptions about web forms are represented in rules, whereas domain-specific assumptions are represented in the ontology. This yields a coherent logical framework, robust in face of changing web trends.

We apply opal to a significant, randomly selected sample of UK real estate sites, showing that straightforward rules suffice to achieve high precision form understanding.

[30] Andrew Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, and Christian Schallhart. OXPath: Little Language, Little Memory, Great Value. In International Word Wide Web Conference (WWW'11), pages 261-264, 2011.
[ » bib | » pdf ]
Data about everything is readily available on the web—but often only accessible through elaborate user interactions. For automated decision support, extracting that data is essential, but infeasible with existing heavy-weight data extraction systems. In this demonstration, we present OX-Path, a novel approach to web extraction, with a system that supports informed job selection and integrates information from several different web sites. By carefully extending XPath, OXPath exploits its familiarity and provides a light-weight interface, which is easy to use and embed. We highlight how OXPath guarantees optimal page buffering, storing only a constant number of pages for non-recursive queries.

[29] Andreas Holzer, Visar Januzaj, Stefan Kugele, Boris Langer, Christian Schallhart, Michael Tautschnig, and Helmut Veith. Seamless Testing for Models and Code. In Fundamental Approaches to Software Engineering (FASE'11), Lecture Notes in Computer Science (LNCS), pages 278-293, 2011.
[ » bib | » pdf ]
This paper describes an approach to model-based testing where a test suite is generated from a model and automatically concretized to drive an implementation. Motivated by an industrial project involving DO-178B compliant avionics software, where the models are UML activity diagrams and the implementation is ANSI C, we developed a seamless testing environment based on our test specification language FQL. We demonstrate how to apply FQL to activity diagrams in such a way that FQL test specifications easily translate from UML to C code. Our approach does not require any additional glue or auxiliary code but is fully automatic except for straightforward source code annotations that link source and model. In this way, we can check for modeled but unimplemented behavior and vice versa, and we can also evaluate the degree of abstraction between model and implementation.

[28] Andrew Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, and Christian Schallhart. Taking the OXPath down the Deep Web. In International Conference on Extending Database Technology (EDBT'11), pages 542-545, 2011.
[ » bib | » pdf ]
Although deep web analysis has been studied extensively, there is no succinct formalism to describe user interactions with AJAX-enabled web applications.

Toward this end, we introduce OXPath as a superset of XPath 1.0. Beyond XPath, OXPath is able (1) to fill web forms and trigger DOM events, (2) to access dynamically computed CSS attributes, (3) to navigate between visible form fields, and (4) to mark relevant information for extraction. This way, OXPath expressions can closely simulate the human interaction relevant for navigation rather than rely exclusively on the HTML structure. Thus, they are quite resilient against technical changes.

We demonstrate the expressiveness and practical efficacy of OXPath to tackle a group flight planning problem. We use the OXPath implementation and visual interface to access the popular, highly-scripted travel site Kayak. We show, both visually and manually, how to formulate OXPath expressions to extract all booking information with just a few lines of code.

[27] Andreas Bauer, Martin Leucker, and Christian Schallhart. Runtime Verification for LTL and TLTL. ACM Transactions on Software Engineering and Methodology (TOSEM), 20(4), 2011.
[ » bib | » pdf ]
This paper studies runtime verification of properties expressed either in lineartime temporal logic (LTL) or timed lineartime temporal logic (TLTL). It classifies runtime verification in identifying its distinguishing features to model checking and testing, respectively. It introduces a three-valued semantics (with truth values true, false, inconclusive) as an adequate interpretation as to whether a partial observation of a running system meets an LTL or TLTL property.

For LTL, a conceptually simple monitor generation procedure is given, which is optimal in two respects: First, the size of the generated deterministic monitor is minimal, and, second, the monitor identifies a continuously monitored trace as either satisfying or falsifying a property as early as possible. The feasibility of the developed methodology is demontrated using a collection of real-world temporal logic specifications. Moreover, the presented approach is related to the properties monitorable in general and is compared to existing concepts in the literature. It is shown that the set of monitorable properties does not only encompass the safety and co-safety properties but is strictly larger.

For TLTL, the same road map is followed by first defining a three-valued semantics. The corresponding construction of a timed monitor is more involved, yet, as shown, possible.

[26] Andreas Holzer, Michael Tautschnig, Christian Schallhart, and Helmut Veith. An Introduction to Test Specification in FQL. In Haifa Verification Confernce (HVC'10), volume 6504 of Lecture Notes in Computer Science (LNCS), pages 9-22, 2010. invited.
[ » bib | » pdf ]
In a recent series of papers, we introduced a new framework for white-box testing which aims at a separation of concerns between test specications and test generation engines. We believe that establishing a common language for test criteria will have similar benets to testing as temporal logic had to model checking and SQL had to databases. The main challenge was to nd a specication language which is expressive, simple, and precise. This paper gives an introduction to the test specication language FQL and its tool environment.

[25] Visar Januzaj, Stefan Kugele, Boris Langer, Christian Schallhart, and Helmut Veith. New Challenges in the Development of Critical Embedded Systems-An “aeromotive” Perspective. In International Symposium on Leveraging Applications (ISoLA'10), volume 6415 of Lecture Notes in Computer Science (LNCS), pages 1-2, 2010.
[ » bib ]
During the last decades, embedded systems have become increasingly important in highly safety-critical areas such as power plants, medical equipment, cars, and aeroplanes. The automotive and avionics domains are prominent examples of classical engineering disciplines where conflicts between costs, short product cycles and legal requirements concerning dependability, robustness, security, carbon footprint and spatial demands have become a pressing problem.

[24] Andreas Holzer, Michael Tautschnig, Christian Schallhart, and Helmut Veith. How did you specify your test suite ? In Automated Software Engineering (ASE'10), pages 407-416, 2010.
[ » bib | » pdf ]
Although testing is central to debugging and software certification, there is no adequate language to specify test suites over source code. Such a language should be simple and concise in daily use, feature a precise semantics, and of course, it has to facilitate suitable engines to compute test suites and assess the coverage achieved by a test suite.

This paper introduces the language FQL designed to fit these purposes. We achieve the necessary expressive power by a natural extension of regular expressions which matches test suites rather than individual executions. To evaluate the language, we show for a list of informal requirements how to express them in FQL. Moreover, we present a test case generation engine for C programs and perform practical experiments with the sample specifications.

[23] Somesh Jha, Stefan Katzenbeisser, Christian Schallhart, Helmut Veith, and Stephen Chenney. Semantic Integrity in Large-Scale Online Simulations. ACM Transactions on Internet Technology (TOIT), 10(1), 2010.
[ » bib | » pdf ]
As large-scale online simulations such as Second Life and World of Warcraft are gaining economic significance, there is a growing incentive for attacks against such simulation software. We focus on attacks against the semantic integrity of the simulation. This class of attacks exploits the client-server architecture and is specific to online simulations which, for performance reasons, have to delegate the detailed rendering of the simulated world to the clients. Attacks against semantic integrity often compromise the physical laws of the simulated world—enabling the user’s simulation persona to fly, walk through walls, or to run faster than anybody else.

We introduce the Secure Semantic Integrity Protocol (SSIP) which enables the simulation provider to audit the client computations. Then we analyze the security and scalability of SSIP. First, we show that under standard cryptographic assumptions SSIP will detect semantic integrity attacks. Second, we analyze the network overhead, and determine the optimum tradeoff between cost of bandwidth and audit frequency for our protocol.

[22] Andreas Bauer, Martin Leucker, and Christian Schallhart. Comparing LTL Semantics for Runtime Verification. Journal of Logic and Computation (JLC), 20(3):651-674, 2010.
[ » bib | » pdf ]
When monitoring a system w.r.t. a property defined in a temporal logic such as LTL, a major concern is to settle with an adequate interpretation of observable system events; that is, models of temporal logic formulae are usually infinite words of events, whereas at runtime only finite but incrementally expanding prefixes are available.

In this work, we review LTL-derived logics for finite traces from a runtime-verification perspective. In doing so, we establish four maxims to be satisfied by any LTL-derived logic aimed at runtime verification. As no pre-existing logic readily satisfies all of them, we introduce a new four-valued logic RuntimeVerification Linear Temporal Logic RV-LTL in accordance to these maxims. The semantics of Runtime Verification Linear Temporal Logic (RV-LTL) indicates whether a finite word describes a system behaviour which either (i) satisfies the monitored property, (ii) violates the property, (iii) will presumably violate the property, or (iv) will presumably conform to the property in the future, once the system has stabilized. Notably, (i) and (ii) correspond to the classical semantics of LTL, whereas (iii) and (iv) are chosen whenever an observed system behaviour has not yet lead to a violation or acceptance of the monitored property.

Moreover, we present a monitor construction for RV-LTL properties in terms of Moore machines signalizing the semantics of the so far obtained execution trace w.r.t. the monitored property.

[21] Andreas Bauer, Martin Leucker, Christian Schallhart, and Michael Tautschnig. Don't care in SMT-Building flexible yet efficient abstraction/refinement solvers. Software Tools for Technology Transfer (STTT), 12(1):23-37, 2010.
[ » bib | » pdf ]
This paper describes a method for combining “off-the-shelf” SAT and constraint solvers for building an efficient Satisfiability Modulo Theories (SMT) solver for a wide range of theories. Our method follows the abstraction/refinement approach to simplify the implementation of custom SMT solvers. The expected performance penalty by not using an interweaved combination of SAT and theory solvers is reduced by generalising a Boolean solution of an SMT problem first via assigning don’t care to as many variables as possible. We then use the generalised solution to determine a thereby smaller constraint set to be handed over to the constraint solver for a background theory.We showthat for many benchmarks and real-world problems, this optimisation results in considerably smaller and less complex constraint problems. The presented approach is particularly useful for assembling a practically viable SMT solver quickly, when neither a suitable SMT solver nor a corresponding incremental theory solver is available.We have implemented our approach in the ABsolver framework and applied the resulting solver successfully to an industrial case-study: the verification problems arising in verifying an electronic car steering control system impose non-linear arithmetic constraints, which do not fall into the domain of any other available solver.

[20] Johannes Kinder, Stefan Katzenbeisser, Christian Schallhart, and Helmut Veith. Proactive Detection of Computer Worms Using Model Checking. IEEE Transactions on Dependable and Secure Computing (TDSC), 4(7):424-438, 2010.
[ » bib | » pdf ]
Although recent estimates are speaking of 200,000 different viruses, worms, and Trojan horses, the majority of them are variants of previously existing malware. As these variants mostly differ in their binary representation rather than their functionality, they can be recognized by analyzing the program behavior, even though they are not covered by the signature databases of current antivirus tools. Proactive malware detectors mitigate this risk by detection procedures which use a single signature to detect whole classes of functionally related malware without signature updates. It is evident that the quality of proactive detection procedures depends on their ability to analyze the semantics of the binary.

In this paper, we propose the use of model checking—a well established software verification technique—for proactive malware detection. We describe a tool which extracts an annotated control flow graph from the binary and automatically verifies it against a formal malware specification. To this end, we introduce the new specification language CTPL, which balances the high expressive power needed for malware signatures with efficient model checking algorithms. Our experiments demonstrate that our technique indeed is able to recognize variants of existing malware with a low risk of false positives.

[19] Martin Leucker and Christian Schallhart. A Brief Account of Runtime Verification. Journal of Logic and Algebraic Programming (JLAP), 78(5):293-303, 2009.
[ » bib | » pdf ]
In this paper, a brief account of the field of runtime verification is given. Starting with a definition of runtime verification, a comparison to well-known verification techniques like model checking and testing is provided, and applications in which runtime verification brings out its distinguishing features are pointed out. Moreover, extensions of runtime verification such as monitor-oriented programming, and monitor-based runtime reflection are sketched and their similarities and differences are discussed. Finally, the use of runtime verification for contract enforcement is briefly pointed out.

[18] Andreas Holzer, Michael Tautschnig, Christian Schallhart, and Helmut Veith. Query-Driven Program Testing. In Verification, Model Checking, and Abstract Interpretation (VMCAI'09), volume 5403 of Lecture Notes in Computer Science (LNCS), pages 151-166, 2009.
[ » bib | » pdf ]
We present a new approach to program testing which enables the programmer to specify test suites in terms of a versatile query language. Our query language subsumes standard coverage criteria ranging from simple basic block coverage all the way to predicate complete coverage and multiple condition coverage, but also facilitates on-the-fly requests for test suites specific to the code structure, to external requirements, or to ad hoc needs arising in program understanding/ exploration. The query language is supported by a model checking backend which employs the CBMC framework. Our main algorithmic contribution is a method called iterative constraint strengthening which enables us to solve a query for an arbitrary coverage criterion by a single call to the model checker and a novel form of incremental SAT solving: Whenever the SAT solver finds a solution, our algorithm compares this solution against the coverage criterion, and strengthens the clause database with additional clauses which exclude redundant new solutions. We demonstrate the scalability of our approach and its ability to compute compact test suites with experiments involving device drivers, automotive controllers, and open source projects.

[17] Wei Dong, Martin Leucker, and Christian Schallhart. Impartial Anticipation in Runtime-Verification. In Automated Technology for Verification and Analysis (ATVA'08), volume 5311 of Lecture Notes in Computer Science (LNCS), 2008. 386-396.
[ » bib | » pdf ]
In this paper, a uniform approach for synthesizing monitors checking correctness properties specified in linear-time logics at runtime is provided. Therefore, a generic three-valued semantics is introduced reflecting the idea that prefixes of infinite computations are checked. Then a conceptual framework to synthesize monitors from a logical specification to check an execution incrementally is established, with special focus on resorting to the automata-theoretic approach. The merits of the presented framework are shown by providing monitor synthesis approaches for a variety of different logics such as LTL, the linear-time μ-calculus, PLTLmod, S1S, and RLTL.

[16] Andreas Holzer, Michael Tautschnig, Christian Schallhart, and Helmut Veith. FSHELL: Systematic Test Case Generation for Dynamic Analysis and Measurement. In Computer Aided Verification (CAV'08), volume 5123 of Lecture Notes in Computer Science (LNCS), pages 209-213, 2008.
[ » bib | » pdf ]
Although the principal analogy between counterexample generation and white box testing has been repeatedly addressed, the usage patterns and performance requirements for software testing are quite different from formal verification. Our tool FSHELL provides a versatile testing environment for C programs which supports both interactive explorative use and a rich scripting language. More than a frontend for software model checkers, FSHELL is designed as a database engine which dispatches queries about the program to program analysis tools. We report on the integration of CBMC into FSHELL and describe architectural modifications which support efficient test case generation.Although the principal analogy between counterexample generation and white box testing has been repeatedly addressed, the usage patterns and performance requirements for software testing are quite different from formal verification. Our tool FSHELL provides a versatile testing environment for C programs which supports both interactive explorative use and a rich scripting language. More than a frontend for software model checkers, FSHELL is designed as a database engine which dispatches queries about the program to program analysis tools. We report on the integration of CBMC into FSHELL and describe architectural modifications which support efficient test case generation.

[15] Andreas Bauer, Martin Leucker, Christian Schallhart, and Michael Tautschnig. Don't care in SMT-Building flexible yet efficient abstraction/refinement solvers. In Workshop On Leveraging Applications of Formal Methods, Verification and Validation (ISoLA'07), pages 135-146, 2007.
[ » bib | » pdf ]
This paper describes a method for combining “off-the-shelf” SAT and constraint solvers for building an efficient Satisfiability Modulo Theories (SMT) solver for a wide range of theories. Our method follows the abstraction/refinement approach to simplify the implementation of customSMT solvers. The expected performance penalty by not using an interweaved combination of SAT and theory solvers is reduced by generalising a Boolean solution of an SMT problem first via assigning don’t care to as many variables as possible. We then use the generalised solution to determine a thereby smaller constraint set to be handed over to the constraint solver for a background theory. We show that for many benchmarks and real-world problems, this optimisation results in considerably smaller and less complex constraint problems.

The presented approach is particularly useful for assembling a practically viable SMT solver quickly, when neither a suitable SMT solver nor a corresponding incremental theory solver is available. We have implemented our approach in the ABSOLVER framework and applied the resulting solver successfully to an industrial case-study: The verification problems arising in verifying an electronic car steering control system impose non-linear arithmetic constraints, which do not fall into the domain of any other available solver.

[14] Andreas Bauer, Martin Leucker, and Christian Schallhart. The good, the bad, and the ugly, but how ugly is ugly? In Workshop on Runtime Verification (RV'07), volume 4839 of Lecture Notes in Computer Science (LNCS), pages 126-138, 2007.
[ » bib | » pdf ]
When monitoring a system wrt. a property defined in some temporal logic, e. g., LTL, a major concern is to settle with an adequate interpretation of observable system events; that is, models of temporal logic formulae are usually infinite streams of events, whereas at runtime only prefixes are available. This work defines a four-valued semantics for LTL over finite traces, which extends the classical semantics, and allows to infer whether a system behaves (1) according to the monitored property, (2) violates the property, (3) will possibly violate the property in the future, or (4) will possibly conform to the property in the future, once the system has stabilised. Notably, (1) and (2) correspond to the classical semantics of LTL, whereas (3) and (4) are chosen whenever an observed system behaviour has not yet lead to a violation or acceptance of the monitored property.

Moreover, we present a monitor construction for RV-LTL properties in terms of a Moore machine signalising the semantics of the so far obtained execution trace.

[13] Somesh Jha, Stefan Katzenbeisser, Helmut Veith, and Stephen Chenny. Enforcing Semantic Integrity on Untrusted Clients in Networked Virtual Environments (Extended Abstract). In IEEE Security and Privacy (SP'07), pages 179-186, 2007.
[ » bib | » pdf ]
In the computer gaming industry, large-scale simulations of realistic physical environments over the Internet have attained increasing importance. Networked virtual environments (NVES) are typically based on a client-server architecture where part of the simulation workload is delegated to the clients. This architecture renders the simulation vulnerable to attacks against the semantic integrity of the simulation: malicious clients may attempt to compromise the physical and logical rules governing the simulation, or to alter the causality of events. This paper initiates the systematic study of semantic integrity in NVES from a security point of view. We present a new provably secure semantic integrity protocol which enables the server system to audit the local computations of the clients on demand.

[12] Sagar Chaki, Christian Schallhart, and Helmut Veith. Verification Across Intellectual Property Boqundaries. In Computer Aided Verification (CAV'07), volume 4590 of Lecture Notes in Computer Science (LNCS), pages 82-94, 2007.
[ » bib | » pdf ]
In many industries, the share of software components provided by third-party suppliers is steadily increasing. As the suppliers seek to secure their intellectual property (IP) rights, the customer usually has no direct access to the suppliers’ source code, and is able to enforce the use of verification tools only by legal requirements. In turn, the supplier has no means to convince the customer about successful verification without revealing the source code. This paper presents a new approach to resolve the conflict between the IP interests of the supplier and the quality interests of the customer. We introduce a protocol in which a dedicated server (called the “amanat”) is controlled by both parties: the customer controls the verification task performed by the amanat, while the supplier controls the communication channels of the amanat to ensure that the amanat does not leak information about the source code. We argue that the protocol is both practically useful and mathematically sound. As the protocol is based on well-known (and relatively lightweight) cryptographic primitives, it allows a straightforward implementation on top of existing verification tool chains. To substantiate our security claims, we establish the correctness of the protocol by cryptographic reduction proofs.

[11] Andreas Bauer, Martin Leucker, and Christian Schallhart. Monitoring of Realtime Properties. In Foundations of Software Technology and Theoretical Computer Science (FSTTCS'06), volume 4337 of Lecture Notes in Computer Science (LNCS), pages 260-272, 2006.
[ » bib | » pdf ]
This paper presents a construction for runtime monitors that check real-time properties expressed in timed LTL (TLTL). Due to D’Souza’s results, TLTL can be considered a natural extension of LTL towards real-time. Moreover, a typical obstacle in runtime verification is solved both for untimed and timed formulae, in that standard models of linear temporal logic are infinite traces, whereas in runtime verification only finite system behaviours are at hand. Therefore, a 3-valued semantics (true, false, inconclusive) for LTL and TLTL on finite traces is defined that resembles the infinite trace semantics in a suitable and intuitive manner. Then, the paper describes how to construct, given a (T)LTL formula, a deterministic monitor with three output symbols that reads a finite trace and yields its according 3-valued (T)LTL semantics. Notably, the monitor rejects a trace as early as possible, in that any minimal bad prefix results in false as a return value.

[10] Andreas Bauer, Martin Leucker, and Christian Schallhart. Runtime Reflection: Dynamic model-based analyis of component-based distributed embedded systems. In Modellierung von Automotive Systems, 2006.
[ » bib | » pdf ]
Distributed embedded systems have pervaded the automotive domain, but often still lack measures to ensure adequate behaviour in the presence of unforeseen events, or even errors at runtime. As interactions and dependencies within distributed automotive systems increase, the problem of detecting failures which depend on the exact situation and environment conditions they occur in grows. As a result, not only the detection of failures is increasingly difficult, but also the differentiation between the symptoms of a fault, and the actual fault itself, i. e., the cause of a problem.

In this paper, we present a novel and efficient approach built around the notion of a software component similar to AUTOSAR, for dynamically analysing distributed embedded systems in the testing phase or even in standard operation, in that we provide a framework for detecting failures as well as identifying their causes. Our approach is based upon monitoring safety properties, specified in a language that allows to express dynamic system properties. For such specifications so-called monitor components are generated automatically to detect violations of software components. Based on the results of the monitors, a dedicated diagnosis is then performed in order to identify explanations for the misbehaviour of a system. These may be used to store detailed error logs, or to trigger recovery measures.

[9] Andreas Bauer, Martin Leucker, and Christian Schallhart. Model-Based Runtime Analysis of Distributed Reactive Systems. In Australian Software Engineering Conference (ASWEC'06), pages 243-252, 2006.
[ » bib | » pdf ]
Reactive distributed systems have pervaded everyday life and objects, but often lack measures to ensure adequate behaviour in the presence of unforeseen events or even errors at runtime. As interactions and dependencies within distributed systems increase, the problem of detecting failures which depend on the exact situation and environment conditions they occur in grows. As a result, not only the detection of failures is increasingly difficult, but also the differentiation between the symptoms of a fault, and the actual fault itself, i. e., the cause of a problem.

In this paper, we present a novel and efficient approach for analysing reactive distributed systems at runtime, in that we provide a framework for detecting failures as well as identifying their causes. Our approach is based upon monitoring safety-properties, specified in the linear time temporal logic LTL (respectively, TLTL) to automatically generate monitor components which detect violations of these properties. Based on the results of the monitors, a dedicated diagnosis is then performed in order to identify explanations for the misbehaviour of a system. These may be used to store detailed log files, or to trigger recovery measures. Our framework is built modular, layered, and uses merely a minimal communication overhead—especially when compared to other, similar approaches. Further, we sketch first experimental results from our implementations, and describe how it can be used to build a variety of distributed systems using our techniques.

[8] Christian Schallhart and Luca Trevisan. Approximating Succinct MaxSat. Journal of Logic and Computation (JLC), 15(4):551-557, 2005.
[ » bib | » pdf ]
We study the approximability of the version of MaxSat where exponentially large instances are succinctly represented using circuits. First, we prove that the NP-hardness for approximating MaxSat can be lifted to a corresponding NEXP-hardness for approximating circuit-succinct MaxSat for some constant performance ratio. Second, we consider the approximability of circuit-succinct MaxSat with respect to lower complexity classes: In particular, we prove that computing (2-ε)-approximate solutions for circuit-succinct MaxSat is at least as hard as inverting one-way permutations. On the other hand, a simple randomized approximation algorithm computes a (2+ε)-approximate solution with high probability. Recall that the standard (not succinctly represented) version of the MaxSat problem is approximable to within a .78 factor [?] and that the MaxThreeSat problem is approximable to within a 7/8 factor [?].

[7] Axel Belinfante, Lars Franzen, and Christian Schallhart. Model-Based Testing of Reactive Systems, volume 3472 of Lecture Notes in Computer Science (LNCS), chapter Tools for Test Case Generation, pages 391-438. Springer, 2005.
[ » bib | » pdf ]
The preceding parts of this book have mainly dealt with test theory, aimed at improving the practical techniques which are applied by testers to enhance the quality of soft- and hardware systems. Only if these academic results can be efficiently and successfully transferred back to practice, they were worth the effort.

In this chapter we will present a selection of model-based test tools which are (partly) based on the theory discussed so far. After a general introduction of every single tool we will hint at some papers which try to find a fair comparison of some of them.

Any selection of tools must be incomplete and might be biased by the background of the authors. We tried to select tools which represent a broad spectrum of different approaches. Also, to provide some insight into recent developments, new tools such as AsmL and AGEDIS have been added. Therefore, the tools differ a lot with respect to theoretical foundation, age, and availability. Due to commercial restrictions, only limited information was available on the theoretical basis of some of the tools. For the same reason, it was not always possible to obtain hands-on experience.

[6] Johannes Kinder, Stefan Katzenbeisser, Christian Schallhart, and Helmut Veith. Detecting Malicious Code by Model Checking. In Conference on Detection of Intrusions and Malware and Vulnerability Assessment (DIMVA'05), volume 3548 of Lecture Notes in Computer Science (LNCS), pages 174-187, 2005.
[ » bib | » pdf ]
The ease of compiling malicious code from source code in higher programming languages has increased the volatility of malicious programs: The first appearance of a new worm in the wild is usually followed by modified versions in quick succession. As demonstrated by Christodorescu and Jha, however, classical detection software relies on static patterns, and is easily outsmarted. In this paper, we present a flexible method to detect malicious code patterns in executables by model checking. While model checking was originally developed to verify the correctness of systems against specifications, we argue that it lends itself equally well to the specification of malicious code patterns. To this end, we introduce the specification language CTPL (Computation Tree Predicate Logic) which extends the well-known logic CTL, and describe an efficient model checking algorithm. Our practical experiments demonstrate that we are able to detect a large number of worm variants with a single specification.

[5] Jana Dittmann, Stefan Katzenbeisser, Christian Schallhart, and Helmut Veith. Ensuring Media Integrity on Third-Party Infrastructures. In IFIP International Information Security Conference, pages 493-508, 2005.
[ » bib | » pdf ]
In many heterogeneous networked applications the integrity of multimedia data plays an essential role, but is not directly supported by the application. In this paper, we propose a method which enables an individual user to detect tampering with a multimedia file without changing the software application provided by the third party. Our method is based on a combination of cryptographic signatures and fragile watermarks, i.e., watermarks that are destroyed by illegitimate tampering. We show that the proposed system is provably secure under standard cryptographic assumptions.

[4] Stefan Katzenbeisser, Christian Schallhart, and Helmut Veith. Malware Engineering. In Sicherheit 2005, Sicherheit - Schutz und Zuverlässigkeit, 2. Jahrestagung, Fachbereich Sicherheit der Gesellschaft für Informatik, pages 139-148, 2005.
[ » bib | » pdf ]
Starting from simple hand-crafted viruses, today’s malware has evolved to constitute highly infectious computer diseases. The technical development of malware was mainly driven by the wish to improve and accelerate both attacks and proliferation. Although these programs have incurred significant hazard and financial losses, their mechanisms are relatively simple and are amenable to effective countermeasures— once, the first attack has been launched. From a software technology point of view, malicious software in fact is often very similar to network services with the main difference that security holes are exploited to enforce participation in the protocol.

In this position paper we outline the wide range of possible malware-specific engineering techniques which are not used in known viruses and worms, but are technically feasible and will therefore be realized in the foreseeable future—less likely by hackers than by organized illegal entities. The techniques we describe enable the malware to obfuscate its functionality, monitor and analyze its environment, and modify or extend itself in non-trivial ways. Consequently, future security policies and risk assessments have to account for these new classes of malware.

[3] Markus Holzer, Stefan Katzenbeisser, and Christian Schallhart. Towards a formal semantics for odrl. In First ODRL International Workshop, pages 137-148, 2004.
[ » bib | » pdf ]
We give a brief overview of a new way to model the semantics of ODRL permissions in a formal manner by using finite-automata like structures. The constructed automata capture the sequence of actions that a user is allowed to perform according to a specific permission. In contrast to previous approaches, our semantics is able to model sell and lend permissions.

[2] Christian Schallhart. Transaction Processing for Clustered Virtual Environments. In NATO Advanced Research Workshop on Concurrent Information Processing and Computing, pages 146-158, 2003. invited.
[ » bib | » pdf ]
This paper introduces Massively Multi-Player Online Role Games (MMORGS) which are currently amain focus of the gaming industry. MMORGS are Networked Virtual Environments (NVES) where a player can navigate his or her character through a large game-world which is populated by thousands of other players and non player characters. Based the problems arising in the context of MMORGS, we will motivate the development of a domain-independent middleware to support MMORG in particular and networked NVES in general. Finally, we will introduce APEIRON, a middleware which is based on a flexible transaction processing framework. APEIRON is designed to serve as basis for the next generation MMORG.

[1] Ulrich Neumerkel, Christoph Rettig, and Christian Schallhart. Visualizing solutions with viewers. In Workshop on Logic Programming Environment (LPE'97), pages 43-50, 1997.
[ » bib ]

Copyright Notice. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder. Also, some of these works have been submitted for publication. Copyright may be transferred without further notice and this version may no longer be accessible.

Acknowledgement. This page has been generated with » bibtex2html 1.95.