The DQP functionality was originally designed and developed by at the University of Manchester and the University of Newcastle upon Tyne as part of the OGSA-DAI project and the MyGrid project. Previous versions of DQP have been released as a separate product from OGSA-DAI using the title OGSA-DQP. From OGSA-DAI release 3.2 the DQP functionality has been rewritten and is now simply an extension to OGSA-DAI rather than a separate product. For those familiar with previous versions of DQP, details of how this version differs can be found in Section 147.5, “How this version of DQP differs from previous versions”.
DQP allows the tables from multiple distributed relational databases to be queried, using SQL, as if there were multiple tables in a single database. Consider two distributed data resources. The first is called MySurvey and contains survey data of people:
| Data Resource: MySurvey | |||
|---|---|---|---|
| Table: People | |||
| name | gender | age | postcode |
| Alan | M | 20 | AB1 2GB |
| Bob | M | 50 | EH5 2TG |
| Cath | F | 25 | JK3 4RE |
The second data resource is called PostcodeStats and maps postcodes to classifications of a region type:
| Data Resource: PostcodeStats | |
|---|---|
| Table: PostcodeRegionType | |
| postcode | regionType |
| AB1 2GB | City |
| FG2 5GB | Rural |
| EH5 2TG | City |
| JK3 4RE | Rural |
If these two distributed relational data resources could be queried as if they were two tables in a single database then useful questions could be answered. DQP allows the creation of a virtual data resource that appears to the user to contain these two tables. By default, table names in the virtual data resource are prefixed with the original resource names so the two tables in our virtual data resource become:
| Data Resource: MyDQPResource | |||
|---|---|---|---|
| Table: MySurvey_People | |||
| name | gender | age | postcode |
| Alan | M | 20 | AB1 2GB |
| Bob | M | 50 | EH5 2TG |
| Cath | F | 25 | JK3 4RE |
| Data Resource: MyDQPResource | |
|---|---|
| Table: PostcodeStats_PostcodeRegionType | |
| postcode | regionType |
| AB1 2GB | City |
| FG2 5GB | Rural |
| EH5 2TG | City |
| JK3 4RE | Rural |
The MyDQPResource resource can be used to execute SQL queries that use both these tables. For example, to obtain the region type that Alan lives in, the SQL query is:
SELECT regionType FROM MySurvey_People p JOIN PostcodeStats_PostcodeRegionType r ON p.postcode = r.postcode WHERE name = 'Alan'
which produces result:
| regionType | | City |
The SQL query to find the the average age for people living in each region type would be:
SELECT regionType, avg(age) as avgAge FROM MySurvey_People p JOIN PostcodeStats_PostcodeRegionType r ON p.postcode = r.postcode GROUP BY regionType
which produces the result:
| regionType | avgAge | | City | 35.0 | | Rural | 25.0 |
The DQP federation will contain one table for each table in each of the distributed resources. Consider the following new relational resource that we now wish to add to our federation.
| Data Resource: JoesSurvey | |||
|---|---|---|---|
| Table: People | |||
| name | gender | age | postcode |
| Darren | M | 20 | FG2 5GB |
| Ed | M | 31 | EH5 2TG |
| Fiona | F | 29 | FG2 5GB |
This table will appear in our data federation as JoesSurvey_People. This will be an entirely different table from the MySurvey_People table. If we wish to consider these two tables to be portions of a single table then we must use a UNION operator in our SQL queries. For example, the query to obtain the average age for each region type now becomes:
SELECT regionType, avg(age) as avgAge FROM ((SELECT age, postcode FROM MySurvey_People) UNION ALL (SELECT age, postcode FROM JoesSurvey_People)) as u JOIN PostcodeStats_PostcodeRegionType r ON u.postcode = r.postcode GROUP BY regionType
It will be possible in the future to combine DQP with OGSA-DAI's support of views to hide the need for union operators such as this from the client. For example, the DQP resource could be wrapped in a view resource that defines a new People table as:
SELECT * FROM ((SELECT name, age, postcode FROM MySurvey_People) UNION ALL (SELECT name, age, postcode FROM JoesSurvey_People))
As the time of writing the OGSA-DAI Views extension pack is not compatible with DQP. Check the OGSA-DAI website to see if a new version of the OGSA-DAI views extension pack that is compatible with DQP has been released.
The virtual federated resources that DQP provides can be configured statically at the server or dynamically by a client. Once the DQP resource has been created it can be used by clients just like any other relational resource.
This section shows DQP's SQL grammar, functions and some example SQL queries.
DQP supports a subset of SQL 92. The grammar that DQP aims to support is shown below. This release of DQP does not support the whole grammar. The following features are not supported:
statement
: query_expression order_by? ';'? EOF
;
query_expression
: query_term (set_op query_term)*
;
query_term
: query ( 'INTERSECT' query)*
;
set_op
: (setop='UNION' | setop='EXCEPT') 'ALL'?
;
query
: sub_query
| 'SELECT' set_quantifier? select_list
'FROM' table_expression
('WHERE' s1=search_condition)?
('GROUP BY' column_list)?
('HAVING' s2=search_condition)?
;
set_quantifier
: 'DISTINCT' | 'ALL';
sub_query
: '(' query_expression ')';
select_list
: '*'
| derived_column (',' derived_column)*;
derived_column
: value_expression ('AS'? ID)?
;
order_by
: 'ORDER' 'BY' sort_specification (',' sort_specification)* ;
sort_specification
: column_name | INT | reserved_word_column_name
;
reserved_word_column_name
: (tableid=ID'.')?
(s='DATE' | s='TIMESTAMP' | s='TIME' | s='INTERVAL' | s='YEAR' |
s='MONTH' | s='DAY' | s='HOUR' | s='MINUTE' | s='SECOND' )
;
value_expression
: string_value_expression
| numeric_value_expression
;
numeric_value_expression
: factor (('+'|'-') factor)* ;
factor : numeric_primary (('*'|'/') numeric_primary)*;
numeric_primary
: ('+'|'-')? value_expression_primary;
value_expression_primary
: '(' value_expression ')'
| function
| column_name
| literal
| sub_query
;
literal : INT | FLOAT | NUMERIC | STRING | datetime | interval |
'NULL' | 'TRUE' | 'FALSE';
datetime
: ('DATE' | 'TIMESTAMP' | 'TIME') STRING
| (tableid=ID'.')?(s='DATE' | s='TIMESTAMP' | s='TIME')
;
interval
: 'INTERVAL' STRING ('YEAR' | 'MONTH' | 'DAY' | 'HOUR' | 'MINUTE' | 'SECOND')
| (tableid=ID'.')?(s='INTERVAL' | s='YEAR' | s='MONTH' | s='DAY' | s='HOUR' |
s='MINUTE' | s='SECOND')
;
function: (name=ID) '(' value_expression? (',' value_expression)* ')'
| (name=ID) '(' '*' ')'
;
string_value_expression
: (column_name | STRING) ('||' (column_name | STRING))+
;
table_expression
: table_reference
;
table_reference
: table (',' table_reference)*
;
join_type
: 'RIGHT' 'OUTER'? 'JOIN'
| 'LEFT' 'OUTER'? 'JOIN'
| 'FULL' 'OUTER'? 'JOIN'
| 'INNER'? 'JOIN'
;
table : non_join_table (join_type non_join_table 'ON' search_condition)*
;
non_join_table
: table_name correlation_specification?
| table_function correlation_specification
| sub_query correlation_specification
;
table_function
: name=ID '(' table_function_subquery? (',' table_function_subquery)*
(','? table_function_param)* ')'
;
table_function_subquery
: sub_query correlation_specification
;
table_function_param
: search_condition
| value_expression
;
relation
: table_name
| table_function
| query
;
search_condition
: boolean_factor ('OR' boolean_factor)*;
boolean_factor
: boolean_term ('AND' boolean_term)*
;
boolean_term
: boolean_test
| 'NOT' boolean_term ;
boolean_test
: boolean_primary;
boolean_primary
: predicate | '(' search_condition ')';
predicate
: comparison_predicate | like_predicate | in_predicate | null_predicate |
exists_predicate | between_predicate;
null_predicate
: row_value 'IS' 'NULL'
| row_value 'IS' 'NOT' 'NULL'
;
in_predicate
: row_value 'NOT' 'IN' in_predicate_tail
| row_value 'IN' in_predicate_tail
;
in_predicate_tail
: sub_query
| '(' (value_expression (',' value_expression)*) ')';
between_predicate
: value=row_value 'BETWEEN' btw1=row_value 'AND' btw2=row_value
| value=row_value 'NOT' 'BETWEEN' btw1=row_value 'AND' btw2=row_value
;
exists_predicate
: 'EXISTS' sub_query;
comparison_predicate
: row_value ('=' | '<>' | '!=' | '<' | '>' | '>=' | '<=')
('ALL'|'SOME'|'ANY')? row_value;
like_predicate
: row_value 'LIKE' row_value
| v1=row_value 'NOT' 'LIKE' v2=row_value;
row_value
: value_expression | 'NULL' ;
correlation_specification
: ('AS')? ID;
table_name
: ID;
column_list
: (column_name | reserved_word_column_name)
(',' (column_name | reserved_word_column_name))*;
column_name
: (tableid=ID'.')?columnid=ID
;
ID : ('a'..'z' | 'A'..'Z') ( ('a'..'z' | 'A'..'Z') | ('0'..'9') | '_' )*
| '`' (~('\''|'\n'|'\r'|'`')) ( (~('\''|'\n'|'\r'|'`')) )* '`' ;
FLOAT : ('0'..'9')+ '.' ('0'..'9')+ ;
INT : ('0'..'9')+ ;
NUMERIC : (INT | FLOAT) 'E' INT ;
STRING : '"' (~('"'|'\n'|'\r'))* '"'
| '\'' (~('\''|'\n'|'\r'))* '\'';
WS : (' ' | '\t' | '\r' | '\n' ) {skip();} ;
DQP supports aggregation and scalar functions. These are extensibility points of DQP so if the function you require is not provided it is possible to write your own and deploy them into DQP, see Chapter 151, DQP for service developers.
DQP supports the aggregation functions shown in Table 147.1, “DQP aggregation functions.”. All the aggregation functions ignore null values except COUNT(*) which counts all the rows regardless of the data values they contain.
| Name | Supported types | Description | Examples |
|---|---|---|---|
| COUNT | All | Returns a count of the number of rows returned. | COUNT(*), COUNT(colName) |
| AVG | Short, Integer, Long, Float, Double, BigDecimal | Returns the average value of the argument. | AVG(colName) |
| MAX | Short, Integer, Long, Float, Double, BigDecimal | Returns the maximum value of the argument. | MAX(colName) |
| MIN | Short, Integer, Long, Float, Double, BigDecimal | Returns the minimum value of the argument. | MIN(colName) |
| SUM | Short, Integer, Long, Float, Double, BigDecimal | Returns the sum of the values of the argument. | SUM(colName) |
| STDDEV | Short, Integer, Long, Float, Double, BigDecimal | Returns the population standard deviation. | STDDEV(colName) |
| VARIANCE | Short, Integer, Long, Float, Double, BigDecimal | Returns the population standard variance. | VARIANCE(colName) |
| MEDIAN | Short, Integer, Long, Float, Double, BigDecimal | Returns the median of the values of the argument. | MEDIAN(colName) |
Table 147.1. DQP aggregation functions.
DQP supports the scalar functions shown in Table 147.2, “DQP scalar functions.”.
| Name | Supported signatures | Description | Examples |
|---|---|---|---|
| GetHourOfDay | (Time)->Long, (Timestamp)->Long | Returns the hour of the day. | GetHourOfDay(colName) |
| GetDate | (Timestamp)->Date | Extracts the date from a timestamp. | GetData(colName) |
| AddDays | (Date, Number)->Date, (Timestamp, Number)->Timestamp | Adds the given number of whole days. | AddDays(colName, 2) |
| SubDays | (Date, Number)->Date, (Timestamp, Number)->Timestamp | Subtracts the given number of whole days. | SubDays(colName, 2) |
| CurrentDate | ()->Date | Returns the current date. | CurrentDate() |
| CurrentTime | ()->Time | Returns the current time. | CurrentTime() |
| DateDiff | (Date,Date)->Integer, (Date,Timestamp)->Integer, (Timestamp, Timestamp)->Integer, (Timestamp,Date)->Integer | Returns the number of whole days between the two given values. Return value will be negative if the first date chronologically precedes the second date, and positive if first date is chronologically after the second date. | DateDiff(colName1, colName2) |
Table 147.2. DQP scalar functions.
This section shows some example SQL queries that DQP supports. If you are having difficulty forming a query that DQP can execute try adapting those shown here.
Simple SELECT... FROM...WHERE... queries:
SELECT * FROM MyResource_People
SELECT name, age, gender FROM MyResource_People
SELECT name, age FROM MyResource_People WHERE age > 18 AND gender = 'F'
SELECT name, age FROM MyResource_People WHERE name LIKE 'J%'
SELECT DISTINCT age FROM MyResource_People
Implicit join queries:
SELECT * FROM MyResource_People p, OtherResource_PostcodesStats pc WHERE p.postcode = pc.postcode
Explicit join queries:
SELECT * FROM MyResource_People p JOIN OtherResource_PostcodesStats pc ON p.postcode = pc.postcode
Union queries:
SELECT intT, dateTimeT FROM ((select intT, datetimeT from MySQLResource_mysql_types) UNION ALL (select intT, datetimeT from Resource2_mysql_types) ) sub
Using aggregate functions:
SELECT COUNT(*), AVG(p.age) FROM MyResource_People p JOIN OtherResource_PostcodesStats pc ON p.postcode = pc.postcode
Using scalar functions:
SELECT GetHourOfDay(startTime) FROM MyResource_runRecord
SELECT * FROM MyResource_runRecord WHERE GetHourOfDay(startTime) > 10
Using grouping:
SELECT C.eid, MAX(A.cruisingrange) FROM MySQLResource_certified C, MySQLResource_aircraft A WHERE C.aid = A.aid AND A.aid>10 GROUP BY C.eid
SELECT C.eid, MAX(A.cruisingrange) FROM MySQLResource_certified C, Resource2_aircraft A WHERE C.aid = A.aid GROUP BY C.eid HAVING COUNT(C.eid) > 3
SELECT d, h, AVG(i) FROM (SELECT GetDate(startTime) AS d, GetHourOfDay(startTime) AS h, numValues AS i FROM MyResource_runRecord) AS sub GROUP BY d, h ORDER BY d, h
Ordering:
SELECT name, age, gender FROM MyResource_People ORDER BY age
Using subqueries:
SELECT DISTINCT E.ename FROM MySQLResource_employees E WHERE E.salary < ( SELECT MIN (F.price) FROM MySQLResource_flights F WHERE F.origin = 'Los Angeles' AND F.destination = 'Honolulu' )
SELECT Temp.name, Temp.AvgSalary FROM ( SELECT A.aid, A.aname AS name, AVG (E.salary) AS AvgSalary FROM MySQLResource_aircraft A, Resource2_certified C, Resource3_employees E WHERE A.aid = C.aid AND C.eid = E.eid AND A.cruisingrange > 1000 GROUP BY A.aid, A.aname ) AS Temp ORDER BY Temp.name
Queryies using various combinations of supported functionality:
SELECT d, h, AVG(i) FROM (SELECT GetDate(datetimeT) AS d, GetHourOfDay(datetimeT) AS h, intT AS i FROM ((select intT, datetimeT from Resource1_runRecord) UNION ALL (select intT, datetimeT from Resource2_runRecord) ) sub1 WHERE datetimeT >= '1978-09-29' and datetimeT < '1978-11-30' ) AS sub2 GROUP BY d, h ORDER BY d, h
All the existing optimisers are essentially heuritic and do not make use of a cost model.
Join ordering makes use of very basic cardinality estimates. This model could be considerable improved and it an extensibility point in OGSA-DAI.
Many SQL operations are not pushed to the database even when it is possible to do so and are instead executed inside OGSA-DAI activities. This is true for all functions, ordering and group by operations.
Most join implementations store the data from one side of the join in memory. This can lead to excessive use of memory. Alternative approaches not yet implemented include writing to disk, executing a streaming merge join on sorted data streams or using a parallel hash join algorithm to distribute the load to multiple OGSA-DAI servers.
The errors returned to the user if the SQL query does not parse are often vague and unhelpful.
There is no dynamic fault tolerence so DQP fails if one of the resources it federates is unavailable.
This version of DQP is a major redesign and rewrite from the previous OGSA-DQP release (titled OGSA-DQP 3.2). The main differences include:
Publications describing the current DQP release.
Publications describing the design and applications of the previous OGSA-DQP release. Note that the design and implementation details differ significantly from the current version of DQP.