This chapter presents an overview of how DQP implements its distributed query functionality. The chapter should be considered an optional chapter for those that are interested. You do not need to know this information in order to deploy or use DQP.
The most important feature to note is that DQP does not store any data locally but instead queries each of the distributed data resources as required to obtain the data necessary to execute each query. For those familiar with data integration terminologies DQP uses the Global as View (GAV) approach within a mediator/wrapper architecture.
Before a DQP resource executes its first query it must construct the schema of the virtual database. To do this, DQP obtains the database schema from each of the resources in the federation, applies the appropriate prefix to each table and thus creates the schema of the virtual database which consist of each table in the distributed resources.
When a DQP resource receives a query the following actions take place:
The following example illustrates the process. In this example, we have four OGSA-DAI servers. Each of the three data resources (MySurvey, JoesSurvey and PostcodeStats) are on different OGSA-DAI servers and the DQP resource is on the fourth OGSA-DAI server. Consider the following query to discover the average age of males in the two surveys grouped by the region type:
SELECT regionType, avg(age) as avgAge FROM ((SELECT age, gender, postcode FROM MySurvey_People) UNION ALL (SELECT age, gender, postcode FROM JoesSurvey_People)) as u JOIN PostcodeStats_PostcodeRegionType r ON u.postcode = r.postcode WHERE gender = 'M' GROUP BY regionType
The abstract syntax tree for this query is shown in Figure 148.1, “AST for SQL query sent to DQP.” and the initial query plan is Figure 148.2, “DQP initial query plan.”.
The final, optimised query plan shown in Figure 148.3, “DQP query plan.”. This final query plan is coloured in to show on which OGSA-DAI service each of the operators will execute. OGSA-DAI's default strategy is to distribute the work to the remote OGSA-DAI servers whenever possible rather than simply pulling all the data back to the DQP mediator and carrying out the computation there.
This optimised query plan results in four OGSA-DAI workflows that are executed in parallel. As an example, the workflow sent to the OGSA-DAI server that hosts the MySurvey resource is shown in Figure 148.4, “Example DQP OGSA-DAI workflow”. Notice that this workflow exchanges data with other workflows using OGSA-DAI's data sink functionality, this can be seen by the inclusion of the ReadFromDataSink and DeliverToDataSink activities in the workflow.