Paper Abstracts

Big Dikes & Big Data

Ioannis Giotis, Frens-Jan Rumph, Gert-Jan van Dijk

The topic of this presentation is a real-world business case of sensor data analysis in the field of water management, where time constraints and high accuracy levels are essential. We focus on the description of a complete framework for the scalable storage and the efficient analysis of data. The proposed framework allows us to store and process data in linear time or faster and at the same time facilitates a representation that allows for rich query-by-example search tasks.

The research and development of cognitive visualization technology for the large volumes of experimental data

Vladimir Vitkovskiy, Vladimir Gorohov, Sergey Komarinskiy, Alexsey Velichko, Olga Zhelenkova

Modern fundamental science brings more and more volumes of experimental data. Astrophysical observation archives alone contain petabytes of data. Data analysis is becoming a huge problem for investigators. This challenge can be resolved with the help of modern information technologies and cognitive computer graphics. The SW projectis a concept of pictorial, descriptive visualization of data contained in multidimensional catalogs and databases. In the project resolving the fundamental problems of the use of the methodology of experimental sciences developed in phenomenology is assumed in connection with the data reduction processes. New methodology must ensure the successful application of software of the visualization of multidimensional data and systems of the visual programming. New procedures and means of work with the cognitive graphic figures give impetus to the development of fundamentally new algorithmic software for visualization of experimental data. On the basis of the cognitive graphics concept, we worked out the SW-system for visualization and analysis. It allows training of the intuition of researchers in order to increase their creative and scientific cognition. The Space Hedgehog is a cognitive visualization system that allows direct dynamic visualization of 6D objects in multidimensional data. The development of cognitive machine drawing is important and can also be applied to a variety of archives and data banks. The Space Hedgehog system is capable of representing the full content of terabytes of multidimensional datasets. Furthermore, the abovementioned techniques of cognitive drawing can be used very effectively in network technologies.

Flooding Landscape Maps (WOLK)-project

Mark Verlaat

In collaboration with engineering Tauw and Hanze University Groningen the Geo Services made ​​a 3D version of flooding landscape maps. This map uses a lot of Geographical Big Data, such as a Digital Elevation Model (DEM, 8 points per m2 height data), Key registration Topography (Scale 1:10.000), Key registration Addresses and Buildings and the Waterways registration. This particular scenario is located in Groningen, where, along with a script, a rainfall of 60 mm is simulated. With a flooding landscape map you can see at a glance where water is going and where it gives nuisance in case of extremely heavy rainfall. Your municipality or county can thus determine which measures are to be taken to prevent for example disruption to traffic or flooding in buildings.

Efficient visualization solutions to turn Big Data into insight

Parisa Noorishad, Hugo Buddelmeijer, David Williams, Edwin Valentijn, Milena Ivanova , Jos Roerdink

We propose an efficient solution for turning big data into insight using a data-centric information system such as Astro-WISE. Big data processing has been handled very well by the request-driven approach in Astro-WISE. We extend this approach to the visual analysis domain and make the data model more intelligent such that all data handling is automated and scalable. In the visualization domain, a scientist has both his/her domain-specific visualization tools and generic visual analytic tools that interoperate to help scientists make the best decision. The benefits of our method are highlighted in searching for distant quasars in the 1500 square optical KiDS survey, which is one of the use-cases of our project.

Applied Query Driven Visualization

Hugo Buddelmeijer

Survey repositories need to provide more than periodic data releases: science often requires data that is not captured in such releases. This mismatch between the data release and the needs of the scientists is solved in the information system Astro-WISE by extending its request driven data handling into the analysis domain. This leads to Query Driven Visualization, where all data handling is automated and scalable because data is pulled by the visualization. Astro-WISE is data-centric: new data creates itself automatically if no suitable existing data can be found to fulfil a request. This allows scientists to visualize exactly the data they need, without any manual data management,freeing their time for research.
The benefits of query driven visualization are highlighted by searching for distant quasars in the 1500 square optical KiDS survey. Minimizing the time between observation and (spectral) followup requires treating KiDS as a living survey, because the window of opportunity would be missed by waiting for data releases. The results from the default processing pipelines are used for a quick and broad selection of quasar candidates and more precise measurements of source properties are subsequently requested to downsize the candidate set, requiring partial reprocessing of the images. Finally, the raw and reduced pixels themselves are inspected by eye to rank the final candidate list. The quality of the resulting candidate list and the speed in which it was produced were only achievable due to query driven visualization of the living archive.
This work is in part funded by the research programme of the Netherlands eScience Center ( www.nlesc.nl).

Visual exploration and selection in high-dimensional point cloud datasets

Ingyun Yu, Bilkis J. Ferdosi, Hugo Buddelmeijer, Scott Trager, Michael H.F. Wilkinson, Konstantinos Efstathiou, Petra Isenberg, Tobias Isenberg, and Jos B.T.M. Roerdink

Data selection is a fundamental task in visualization because it serves as a pre-requisite to many follow-up interactions. Efficient spatial selection in 3D point cloud datasets consisting of thousands or millions of particles can be particularly challenging. We present a structure-aware selection technique, CloudLasso, that supports the selection of subsets in large 3D particle datasets in an interactive and visually intuitive manner. Furthermore, we integrated our selection technique in a visual analytics system for subspace exploration of high dimensional datasets on touch-sensitive displays. Such datasets commonly appear in astronomical applications which were also our original motivation. Our system has a number of analytical components for ranking subspaces in high-dimensional data in terms of their relevance for clustering, and for dimension reordering in parallel coordinate plots. Particle selection is done using the CloudLasso method which suits the visual analytics system very well in the sense that the user just needs to draw a 2D lasso around the intended target on the 2D surface and the method automatically performs the selection in 3D space. At the same time, the selection volume also helps the user better distinguish the clusters in the analytical components. Moreover, the touch-sensitive display supports the user to visually analyze and explore the data in a collaborative fashion. An observational study using astronomical datasets was carried out to evaluate the complete system.

To the Problem of Organizing Heterogeneous Information

Olga Zhelenkova, Vladimir Vitkovskiy

Diverse and heterogeneous data, which as a result turn out to a researcher, require new ways of organizing the information for efficient and comfortable work with them. The virtual observatory provides users with web services for retrieving data from distributed information resources and tools for visualization and analysis of the extracted data. These means greatly increase efficiency of working with digital data. However, studies involving modern astronomical catalogues are still time-consuming. We believe this is due to the following problems: (1) The knowledge gained in analysis of catalogues is not stored anywhere; (2) There is no tool for updating compilative user data when new releases of surveys or new catalogues appear; (3) Software for supporting and organizing compilative data, making it easier to work with them, is needed;(4) Existing formats of data representation do not provide a support for decisions the above named problems. For operating of request results the interactive sky atlas Aladin, which is an excellent tool for working with different astronomical data, uses a stack which is a collection of planes with results of requests. For researcher whom works with heterogeneous resources will be useful to store all collected information about one object into one file. We think this idea may be expanded and include into a stack also planes with a semantic description of an object or/and repeated references to web services for data actualization. For standardization of data structures similar to an Aladin stack we propose to develop an extension of the FITS format.

Big Data analytics in the Geo-Spatial Domain

Romulo Goncalves, Milena Ivanova, Martin Kersten, Henk Scholten, Sisi Zlatanova, Foteini Alvanaki, Pirouz Nourian and Eduardo Dias

Big data collections in many scientific domains have inherently rich spatial and geo-spatial features. Spatial location is among the core aspects of data in Earth observation sciences, astronomy, and seismology to name a few. The goal of our project is to design an efficient data management layer for a generic geo-spatial analysis system with focus on three dimensional (3D) city models. Digital 3D city models play a crucial role in research of urban phenomena; they form the basis for flow simulations (e.g. wind streams, water runoff and heat island effects), urban planning, and analysis of underground formations. Urban scenes consist of large collections of semantically rich objects which have a large number of properties such as material and color. Modeling and storing these properties indicating the relationships between them is best handled in a relational database. The provision of spatial and geo-spatial features in database systems needs to be extended and brought to maturity to fulfill the requirements of real-world scientific applications. A class of DBMSs, called column-stores, have proven efficiency for analytic applications on extremely large data sets. Column stores have become the de-facto standard for managing large data warehouses. Although column stores have a proven track record in business analytics, their pros- and cons- for GIS applications are not yet well understood. Our goal is to have a spatial DBMS which iteratively loads data from different sources and converts it into a common format to enable 3D operations and analyses, such as 3D intersections, and semantic properties management.

Fishing on the Sea of Big Data: The Swarming Agent Method

Daniel Caputo

While big data may appear to be a new problem on the surface it is in fact older than mankind. Consider the amount of data that passes through your eyes each moment, which is processed, analyzed, and finally the important bits are focused on for more detail. The world around us is full of data, and has been since before man first tread upon it. Other creatures are also very effective at filtering out useful signals from the noise that surrounds them; but often doing so with a vastly reduced computational, communication, and memory capacity. Not entirely unlike modern computers. We propose a data analysis method based on the 'swarm intelligence' of schooling fish. While relative dumb with limited communication and memory many fish manage to organize themselves into schools which provide them, collectively, with an enhanced ability to locate food and evade predators. This algorithm, the swarming agent method, is based on a few simple rules of behavior, only very limited inter-agent communication, and memory. This method, which is classified as an artificial life algorithm, does not fall victim to premature convergence that plagues many similar algorithms. In fact, it out preforms all other artificial life algorithms in both speed and accuracy.

Application tracing for storage optimisation

Y.G. Grange, Y. Kim, C. Wu, H.A. Holties

The Square Kilomtere Array (SKA) will be the next-generation radio telescope, consisting of many small antennae distributed over South Africa and Australia. When operational, the SKA will produce exabytes of data per year to be stored and processed. To store this amount of data in a cost-effective way, it will be necessary to utilize a storage system consisting of multiple storage tiers (e.g. tape, disk, SSD). To allow efficient analysis and distribution, the storage system needs to make predictions based on data access patterns so that optimal pre-fetching and data placement decisions can be taken. For this mechanism to be useful in a production environment, applications running on the system need to be traced in such a way that the relevant information can be acquired while the tool used to obtain trace information should incur only a minimal overhead in terms of system resources so that it does not interfere with the processing it targets to assist. This proves to be a challenging requirement. We report on our investigations of how to obtain traces by using the LOFAR software as a prototype for the SKA software. We discuss the different levels at which one can tackle this problem and which bits of information can be obtained at each level and the effect of some methods on the performance of the system.

Future Big Data capabilities at the RUG-CIT

Haije Wind

The University of Groningen has a long computational tradition running for the last 50 years. With the IBM Blue Gene it briefly hosted one of the top 10 supercomputers worldwide. Currently, we are working on the ICT strategy program for 2016 - 2020. In this program, which is still under construction, the current (Big Data) infrastructure and visualisation facilities will be upgraded with new technology and capabilities. Our partners and users come from a wide range of scientific fields, each one of them posing different demands on our data center and services. We are looking at different types of GPU’s which will be tested in our next Linux cluster. One of the HPC trends is that the variety of compute nodes is growing. On the other hand, clock speeds and number of cores (except for GPU’s) are not growing as fast as in the previous years. Is it time to invest in smarter software instead of just buying faster computers? We are working with a team of data scientists who can advise researchers facing Big Data challenges.

The European Research Center for Exascale Technology

TBA

Coming soon...

Target moving to Data Federations - Euclid as an ultimate case

Edwin A. Valentijn

Target was one of the first to explore a datacentric approach to the processing, handling and storage of Big Data. Using the Astro-Wise concepts, developed in 2000--2008, Target has deployed its Wise technology to various domains: ranging from the astronomical optical wide-field imagers, o.a. OmegaCAM@VST with its KiDS survey, the LOFAR Long Term Archive, the Muse@VLT imaging spectrometer to handwritten text (Monk) and medical cohorts of the Lifelines survey. Target Holding has enrolled various applications in industry, particular in the field of internet data mining. The project/domain oriented approach has been very fruitfull and will be enhanced in the Data Federations approach, which Target is now setting up for 2015-2019. In these federations the North of the Netherlands plays a key-role, while the data is federated NL and European wide. Pooling Data federations in a computing center has the advantage to apply common hardware and data modeling techniques (data servers and databases), while recognizing in each federation the differences in requirements on security, the availability of standards and protocols, and the various cultures in the different domains. The case of the Euclid satellite Ground Segment, with 8 data centers fedarated over Europe and serving 1000+ researches, is an interesting example of a data-centric approach. The Euclid experiment requires a common information system to handle, administrate and monitor the European wide procesing and storage in a zoo of hardware. The Eulid Archive System will provide such an information system in a living archive used for the production, quality control, analysis and dissemination of 10+ Petabyte of Space data, from which the users require 1-2 order of magnitude higher precision than ground based survey. In Groningen, the first prototype has been build.

Why and How to Tier Your Data

Slavisa Sarafijanovic

The data volume growth trends require not only scaling adequately the storage devices capacity and price, but also rethinking the data tiering techniques and solutions. To achieve both low storage cost and good I/O performance, the frequently accessed and performance critical data is typically stored to a fast and expensive devices tier while the rarely accessed and low-performance tolerant data is stored to a slow and cheap devices tier. In this talk, I will overview a typical approach for modeling data volume and I/O workloads and outline some of the existing heuristics for dimensioning the number of the devices in each tier and for placing the data the tiers. I will present a new technique for storage dimensioning and data placement, that significantly improves the storage cost and/or performance as compared to the state of the art heuristics. Unlike the state of the art heuristics, the new technique is based on the queuing model of the storage system. A part of the talk is about a practical approach for adding a tape tier to an SSD and disk based storage system, aimed at storing very cheaply the rarely accessed data.

Lexical and semantic matching for biobank data integration

Chao Pang, Dennis Hendriksen, Martijn Dijkstra, K. Joeri van der Velde, Hans Hillege, Morris Swertz

Harmonization of data across studies requires the identification of “content equivalent” measurements or variables. However, searching thousands of available data items and harmonizing differences in terminology, data collection, and structure are arduous and time-consuming tasks. To simplify how data items are described and named, researchers typically assess the potential to harmonize data using a ‘mapping step’: in this process the key variables are matched, usually by hand, with ‘data dictionaries’ from each biobank listing all of the available data items. To simplify this process, the authors (in collaboration with the Molgenis project) developed the semi-automatic system BiobankConnect. It automates the ways terminologies are recognized through ontologies, or formalized knowledge representations, which describe concepts and relationships between them. HTN and increased high blood pressure, for instance, are both synonyms of hypertension. The system also identifies the most relevant data item based on a given variable, using string-matching algorithms.BiobankConnect provides an easy user interface to significantly speed-up the harmonization of biobanks by automating a considerable part of the work. It is available for download as a MOLGENIS open source app at http://www.github.com/molgenis, with a manual and demo available at http://www.biobankconnect.org.

Wearable Technologies: Impact of Big Data on Personal Health

Henk Hindriks

2014 is proclaimed the year of Wearable Technologies. A massive amount of new devices that can track people's physical and mental condition are hitting the market at an unprecedented rate. But where is all this personal health related data stored?? Somewhere “in the cloud”.. Mostly in proprietary databases owned by the device or App builders. What if this data could be connected to other databases and shared with others. This brings many challenges: quality of data, data sharing and access, privacy and safety issues, as well as many legal questions.

The LifeLines Cohort Study and Biobank: Data and IT Solutions

S. Scholtens, A. Dotinga, M.J. Bonthuis, J.L van der Ploeg, M.A. Swertz, Ronald P Stolk

LifeLines is a prospective population-based cohort study and biobank that will follow 167,229 individuals for at least 30 years in the northern part of the Netherlands in a three generation design. High quality data, by means of physical examinations, questionnaires and analysis of biomaterial, are collected to push forward research on 'Healthy Ageing’. Also, linkage is established with medical registries, national registry data and environmental exposures. LifeLines aims to make the data available for researchers worldwide. When the data is processed for data release, the dataset is pseudonimized and is moved to a new database in a separate geographic location. The level of possible re-identification in each dataset is assessed and the aggregation level are adjusted. Each researcher selects the requested data using the LifeLines data catalogue (www.lifelines.net) and gets access to a tailor-made dataset with only the data needed to answer the research question. The data is made available for researchers by means of a virtual environment: the LifeLines workspace. LifeLines and partners have developed an innovative ICT and data management solution for making the LifeLines data available to researchers while maximally protect the privacy of the participants. These solutions can be implemented in other biobank and cohort studies.

Data collection and machine learning to optimize radiotherapy treatment of individual patients

A van der Schaaf, NM Sijtsema, JA Langendijk

Patients exhibit a large variability in their response to radiotherapy. Therefore, the tumor control and complication rates can be optimized by tailoring the radiotherapy treatment to the characteristics of each individual patient. For this purpose accurate models are necessary that describe the probability of tumor control and the risk of complications in relation to a large number of patient (e.g. genetics, age, sex, medical image features) and treatment characteristics (e.g. addition of chemotherapy, treatment with photons or protons, dose distribution). To build these models a large quantity of patient and treatment characteristics and treatment outcome measures are collected in so called prospective data registration (PRODARE) programs and machine learning techniques are used to identify prognostic variables. At the department of radiation oncology of the UMCG, a database with extensive information on patient, tumour and dose-volume characteristics as well as follow up information is available which contains data of over 6,000 patients. This database is continuously growing with data of approximately 1,000 patients annually. Based on this data, we developed multivariable Normal Tissue Complication Probability (NTCP) models to estimate the risk of several complications for head and neck cancer patients. These models are used to optimize the dose distribution with currently available radiation delivery technology and to identify patients that will benefit most from a treatment with protons in the future. In the near future this dataset will be extended to include genomic information and detailed treatment and imaging data. For this purpose a new database structure will to be developed.

The Euclid Netherlands Science Data Centre: preparing to map the dark universe and more!

Rees Williams

Euclid is ESA's mission to map the dark universe. Due for launch in 2020, it will produce an unprecendented amount of data for a satellite mission. The Euclid Netherlands Science Data Centre (SDC-NL), located at the University of Groningen will have special responsibility for the organisation of the Euclid data processing via its development of the Euclid Information System and the Euclid Distributed Storage System. In addition, the data centre will take a lead role in the processing of the massive amounts of ground-based data, needed to fulfill the Euclid mission objectives. The Euclid Netherlands Data Centre has been able to undertake such a major role in the Euclid mission due to the key expertise available in its community. In particular, the use of WISE technology underlies the development of the Euclid Information System. Similarly, the expertise acquired during the KiDS survey on the VST telescope in Chile underlies the systems used for processing external data. The ambition of the Euclid Netherlands Science Data Centre is not limited to the Euclid mission. The intention to to found a permanent Satellite Science Data Centre, which can apply its skills to a range of projects in the future.

From Terabytes of pixels to 35 numbers in the Early Universe: the WISE approach to Big Data quality control

Gijs Verdoes Kleijn

In the context of Target, the Groningen OmegaCEN astronomical datacenter coordinates the data handling for a diverse ensemble of astronomical spectropic and imaging surveys. Instruments will continue to operate into the 2020s yielding dozens of survey datasets and establishing a Petabyte-regime pool of astronomical Big Data. A set of globally spread and relatively independent international science teams uses this pool of astronomical Big Data for very different science goals. To this end Target-OmegaCEN operates information systems based on the WISE technology. They are federated information systems which pool databases, compute and storage facilities across countries. Each survey team collaborates in its own project environment to perform calibration, quality control, scientific analysis and public data releases. In this presentation, we focus on massive quality control on the suite of data products released. The WISE approach allows to go through successive cycles of improving Big Data quality as instrumental behavior, insight and algorithms change.

LOFAR Long Term Archive: The first fifteen Petabyte

G.A.Renting(ASTRON),H.A. Holties (ASTRON), F. Dijkstra (RUG), C. Schrijvers (SARA), W.-J. Vriend (RUG)

The LOFAR Long Term Archive (LTA) is a distributed information system that provides integrated services for data analysis as well as long term preservation of astronomical datasets and their provenance for the LOFAR telescope. LOFAR started full operations with the start of Cycle 0 on December 1, 2012. In this presentation we present the experiences and results of the the first year of full scale operational use of the LOFAR archive. Descriptions are given on how the very different modes of data collection (Interferometry, Tied Array, Fly's Eye, Transient Buffer Board) and processing (Radio Sky Imaging, Radio Sky Survey, Pulsar Search and Characterization, Transient and Cosmic-Ray detection) are supported and how the different types of current and future data products (MeasurementSets, RadioSkyImages, BeamFormedData, CosmicRayImages, DynamicSpectra, Time-Series, Rotation Measure Synthesis) are modelled and represented. We will relate our experiences and results in the handling, storing and providing to users of millions of dataproducts and Petabytes of data distributed across 4 sites in two countries and multiple organizations. Some of the challenges we faced will be detailed: In tape, disk and network hardware, the software to manage the data and metadata streams, the management and organisational complexities, and the user interfaces and user support needed. The volume and multitude of dataproducts required a paradigm shift in thinking for us and our users, not just in how to handle and use big data, but also the challenges faces in supporting operations on this scale with limited resources.

Poster Abstracts

Herschel/HIFI data processing

Russell Shipman

As part of the Herschel Science Observatory’s 3.5 year mission, the Heterodyne Instrument for the Far-Infrared (HIFI) performed over 8500 astronomical observations and stored in the Herschel Science Archive for the astronomical community. These data have extremely high legacy value. Our task is to calibrate data from the HIFI instrument, to identify instrument artefacts, to improve the meta-data about a given observation and to provide documentation to the astronomical community. We have developed a standard processing pipeline which is very flexible and extensible. By applying new algorithms, introducing new calibration data, and creating new meta-data regarding the quality of the archived observations, we provide future astronomers high quality legacy data.

JWST-MIRI IFU characterization and data analysis

Fred Lahuis

The NASA James Webb Space Telescope (JWST) is the next major infrared space mission scheduled for a 2018 launch. On board is a suite of four sensitive near- and mid-IR instruments. Combined with the collecting power and spatial resolution of its 6.5 meter dish this provides a huge discovery potential ranging from searches for the first light after the Big Bang to the characterization of planets around stars in our local Universe. One of the instruments is the European/US Mid-Infrared Instrument (MIRI) offering imaging, coronagraphy and low and medium resolution spectroscopy (LRS and MRS) between wavelengths of 5 to 28,7 microns. NOVA, the Netherlands Research School for Astronomy, designed and built the Spectrometer Main Optics Module (SMO) for the medium resolution spectrometer. Most of the work for the SMO was performed at ASTRON in collaboration with TNO/TPD for the optical design. MIRI was the first instrument delivered to NASA in 2012 and is currently in one of the first satellite integration and testing phases. In this poster, I will present a number of calibration and data analysis aspects of the MIRI MRS which is equipped with integral field units (IFUs) to provide imaging spectroscopy over the full MIRI wavelength range.

Classifying bird tracking: From acceleration to behavior

Stefan Verhoeven, Christiaan Meijer, Elena Ranguelova, Judy Shamoun-Baranes, Willem Bouten

GPS trackers and accelerometers are ever getting smaller thanks to the miniaturization of components for mobile devices. Therefore ecologists can use them now on medium sized birds, such as Godwits and Gulls. These trackers generate massive amounts of data, much more than ecologists are used to handle. For the interpretation and classification of data, tools and web-based services were developed. With the accelerometer, behavior can now remotely be studied day and night. Classification of the accelerometer data and GPS tracks will help to better understand the behaviour throughout their daily routines and annual cycles. To cope with all these data, we are building a classification pipeline. The pipeline uses Weka implementations of classification algorithms. Features are derived from the GPS trackers coordinates, acceleration, speed etc. A visualization for training data and classifications has been made for easy use and cross-domain applications. Combining GPS tracker data with satellite and video data allows us to make better classifiers.

Challenges for visualization of HI in galaxies

D.Punzo, J.Mvan der Hulst, J.B.T.M. Roerdink

APERTIF surveys will produce 2048x2048x16384 pixel data cubes covering 3 × 3 degrees over a bandwidth of 300 MHz every day. HI surveys will detect hundreds of well resolved sources, thousands of sources with a limited number of resolution elements and tens of thousands of objects which are at best marginally resolved. The second class of sources contains a wealth of morphological and kinematic information but extracting it quantitatively is difficult due the complexity of the data. Our aim is to develop a fully interactive visualization tool with quantitative and comparative capabilities which will enable flexible and fast interaction with the data. Full 3D visualization, coupled to modeling, provides additional capabilities helping the discovery of subtle structures in the 3D domain.

Interactive visualization and exploration of billion row catalogues: the era of Gaia

Maarten Breddels

The Gaia satellite that was launched in december in 2013 will determine the positions, velocities and astrophysical properties of a billion stars in the Milky Way. This will lead to a high dimensional and very large catalogue. To unravel the mysteries of our own Milky Way, we need a method not only to visualize this catalogue, but also aid in finding interesting subspaces that may be richer in information content. These ideas are implemented in the program 'vaex’ (Visualization And EXploration). Here we described some of its salient features: On a relatively modern desktop machine 1d and 2d histograms can be visualized in less that 1 sec for a billion rows. Updates are made in the background so that the program stays responsive, and feels fluent. Volume rendering to visualize 3d and selection in 3d are in progress. Basic 1d and 2d selections can be made by lasso, rectangular, x/y selections in combination with logical operator (and,or,xor,invert). A selection made in 1 window, is visible in all windows, a feature known as 'linked views'. This also allows selection of a single object which is shown in all views, and in tabular form, for the identification and inspection of e.g. outliers. To cope with the large dimensionality, we implemented ranking of subspaces. Showing the 2 and 3d subspaces sorted by the mutual information for all the data, or the selection made in the user interface, one can focus on the most important subspaces. Since no program can implement all algorithms, and because some algorithms are in different frameworks or languages, we have a method to share the data with other processes without copying the data, that uses SAMP in combination with memory mapping.

OraVC – the database objects version control tool for Oracle

Andrey Tsyganov

Many users of databases at some point in time are confronted with problems caused by the inevitable evolution of the databases schema. These might include changes of the source code or the schema objects. In such cases, most of the time it is not easy to use well-known source code version control tools like SVN, GIT or CVS because they rely on developer’s discipline and do not protect from the rogue client. Implementing source control at the database level solves this problem. In particular, OraVC is a tool that makes it possible to track changes of the schema objects directly in the Oracle database. This might bring a crucial reduction in effort for the big business application development environment with parallel Oracle packages/triggers or java objects development.