Chapter 133. How DQP works

This chapter presents an overview of how DQP implements its distributed query functionality. The chapter should be considered an optional chapter for those that are interested. You do not need to know this information in order to deploy or use DQP.

The most important feature to note is that DQP does not store any data locally but instead queries each of the distributed data resources as required to obtain the data necessary to execute each query. For those familiar with data integration terminologies DQP uses the Global as View (GAV) approach within a mediator/wrapper architecture.

Before a DQP resource executes its first query it must construct the schema of the virtual database. To do this, DQP obtains the database schema from each of the resources in the federation, applies the appropriate prefix to each table and thus creates the schema of the virtual database which consist of each table in the distributed resources.

When a DQP resource receives a query the following actions take place:

  1. The query is parsed according to DQP's SQL grammar (see Section 132.3.1, “Grammar”) to produce an abstract syntax tree
  2. The abstract syntax tree is used to build a query plan
  3. The query plan is optimised for efficient execution. This includes deciding on which OGSA-DAI server each operation will be executed
  4. The optimised query plan in converted into multiple OGSA-DAI workflows that are executed in parallel to run the query.

The following example illustrates the process. In this example, we have four OGSA-DAI servers. Each of the three data resources (MySurvey, JoesSurvey and PostcodeStats) are on different OGSA-DAI servers and the DQP resource is on the fourth OGSA-DAI server. Consider the following query to discover the average age of males in the two surveys grouped by the region type:

SELECT regionType, avg(age) as avgAge
FROM 
  ((SELECT age, gender, postcode FROM MySurvey_People) UNION ALL 
   (SELECT age, gender, postcode FROM JoesSurvey_People)) as u
  JOIN PostcodeStats_PostcodeRegionType r ON u.postcode = r.postcode
WHERE gender = 'M'
GROUP BY regionType

The abstract syntax tree for this query is shown in Figure 133.1, “AST for SQL query sent to DQP.” and the initial query plan is Figure 133.2, “DQP initial query plan.”.

AST for SQL query sent to OGSA-DAI.
Abstract Syntax Tree (AST) for query sent to DQP. This has been scaled to fit the page click here to see it at full size.

Figure 133.1. AST for SQL query sent to DQP.


DQP query plan before optimisers are applied.
Initial DQP query plan before any optimisers are applied. For each operator box, the top section shows the schema of the data output by that operator and the bottom section shows the operator name. The middle section shown any addition properties of the operator, for example for the TABLE_SCAN operator it shows the SQL query to be executed. This has been scaled to fit the page click here to see it at full size.

Figure 133.2. DQP initial query plan.


The final, optimised query plan shown in Figure 133.3, “DQP query plan.”. This final query plan is coloured in to show on which OGSA-DAI service each of the operators will execute. OGSA-DAI's default strategy is to distribute the work to the remote OGSA-DAI servers whenever possible rather than simply pulling all the data back to the DQP mediator and carrying out the computation there.

DQP query plan after optimisers have been applied.
Final DQP query plan after optimisers have been applied. The operators have been coloured to show which node they will be executed on. For each operator the top section shows the schema of the data output by that operator, the bottom section shows the operator name and execution node. The middle section shown any addition properties of the operator, for example for the TABLE_SCAN operator it shows the SQL query to be executed. This has been scaled to fit the page click here to see it at full size.

Figure 133.3. DQP query plan.


This optimised query plan results in four OGSA-DAI workflows that are executed in parallel. As an example, the workflow sent to the OGSA-DAI server that hosts the MySurvey resource is shown in Figure 133.4, “Example DQP OGSA-DAI workflow”. Notice that this workflow exchanges data with other workflows using OGSA-DAI's data sink functionality, this can be seen by the inclusion of the ReadFromDataSink and DeliverToDataSink activities in the workflow.

OGSA-DAI workflow sent to MySurvey resource to execute the DQP query.
OGSA-DAI workflow sent to MySurvey resource to execute the DQP query. This has been scaled to fit the page click here to see it at full size.

Figure 133.4. Example DQP OGSA-DAI workflow