Web Data Extraction
The information dispersed over the Internet is an enormous but elusive treasure. While tools for wrapping a priori known sites exist, data extraction from hitherto unknown sites remains a challenge. In the » DIADEM project [» PVLDB'14 | » WWW'12 | » DEXA'12 | » ICWE'11], we develop an approach for extracting data from arbitrary sites of a given domain, such as the real estate or rare book market. Thus, with DIADEM, it will be possible to obtain and continously maintain a database covering all offers of the respective market. Within DIADEM, I mainly work on three subprojects:
- In OXPath [» VLDB Journal'13 | » PVLDB'11 | » WWW'11 | » EDBT'11], we develop an extension to XPath which navigates within and between web-pages and extracts data thereof. It deals with modern, scripted sites in using a live DOM, provides access to visual features, allows to perform complex interaction sequences, e.g., for filling AJAX forms, and extracts the data marked relevant on the visited pages. We also provide a visual environment to assist the developing OXPath expressions [» WWW'12] and integrate OXPath into an event processing system [» WISE'13].
- In OPAL (Ontology-based web Pattern Analysis with Logic) [» VLDB Journal'14 | » WWW'12 | » WIMS'11], we utilize textual, structural, and visual information to classify form elements against a set of domain parameterized patterns, described in a suitable template language. With this approach, OPAL reaches an accuracy of 98% on several hundreds pages from the real estate and used car domain.
- AMBER [» RR'11] annotates attributes fundamental to the considered domain and identifies records in leveraging repeated attribute patterns. This contrasts to previous approaches that analyze only repeated structures in HTML code or its rendering, as no domain knowledge is available. AMBER also reaches an accuracy above 98% on hundreds of sites from different domains. Moreover, we demonstrated how AMBER can bootstrap parts of the necessary domain knowledge [» WWW'12].
Test Case Generation
Although central to software engineering, there appears to be no adequate language to specify test suites concisely, precisely, and workable for corresponding tools. For example, most certification standards for dependable software require test suites satisfying some structural coverage criteria, such as statement or decision coverage. But in most cases, the precise meaning of the required criteria remains ambiguous.
As a possible solution to this problem, the FShell Query Language (FQL) extends regular expressions to match test suites rather than individual executions [» ASE'10]. Its associated test input generator » FShell [» VMCAI'09 | » CAV'08] automatically generates test suites for ANSI C programs which satisfy FQL coverage specifications. Since FQL supports not only common coverage criteria but also specifically tailored variants thereof, FQL allows to use coverage criteria in tasks which are traditionally driven by ad-hoc testing, such as program exploration. We address some of the thereoretical issues arising in the design of the FQL language in [» FSTTCS'13].
Runtime Verification
In runtime verification (see [» JLAP'09] for an overview), properties of the system under scrutiny are continuously monitored during runtime---either as part of a testing procedure or as foundation for introspective systems which monitor their own behaviour, for example to achieve some degree of fault-tolerance. To this end, we need suitable specification languages to express the properties to be monitored and efficient methods to turn such specifications into runnable monitoring procedures.
In our approach, we developed for Linear Time Logic (LTL) both, a semantics for runtime verification and a corresponding monitoring approach. This semantics is impartial, i.e., it does not yield premature verdicts, and anticipatory, i.e., it provides a definitive verdict as early as possible. While monitor generation for this semantics is intractable in general, it proved to be efficient in practical cases. We lifted this approach to Timed LTL (TLTL) for real-time properties [» TOSEM'11] and generalized the approach towards other LTL variants [» ATVA'08]. We also compared our and other LTL semantics dealing with finite systems traces in [» JLC'10] for their usefulness in runtime verification.