Conditional functional dependencies validation for XML data: an approach based on attribute grammar

. The representation and the exchange of information originating from different data sources is an increasingly common need for companies and industries to integrate their operations and also to publish and trade information with government and other enterprises. For this purpose, there are many standards based on XML language that were created to allow effective data communication and exchange in a particular domain. In order to ensure data quality for XML data, this paper presents an approach based on conditional functional dependencies verification. Conditional dependencies are an extension of traditional database dependencies with the ability to enforce bindings of semantically related data values. The basis of our verification method is a generic grammarware for validating XML integrity constraints in one tree traversal. We use an attribute grammar to describe XML documents and constraints.


Introduction
XML language has become the preferred format for data integration and information exchange between organizations, and has emerged as a well-accepted data model for heterogeneous data in many application domains.With the growing use of XML as a format for permanent storage of data, the topic of integrity constraints has received increased importance and has been identified as a challenging subject of XML research, as there is a relevant need for developing principles, algorithms and techniques for efficiently managing XML data, considering its semi-structured nature.Although XML provides many advantages based on the ability to contain both data and information about the data and to provide an extensible and adaptable data format, it is hard for XML to express semantic information.The definition and use of integrity constraints is one of the topics of XML semantics, which is fundamental to other XML research areas, such as normalization, query optimization and data quality.
Several types of integrity constraints have been studied in the context of XML language, as key constraints, functional dependencies, inclusion dependencies and path constraints (Buneman et al., 2001;Deutsch and Tannen, 2005;Hartmann and Trinh, 2006;Karlinger et al., 2009).Those constraints are a mechanism to express how elements contained in an XML document are associated to each other.Functional dependency is an important kind of integrity constraint and many definitions for XML functional dependencies with different notions were proposed in Arenas and Libkin (2004), Liu et al. (2003), and Wang and Topor (2005).As XML databases are designed without much consideration of integrity constraints, tools for dependencies discovery can be very useful to ensure data quality and knowledge exploration.For that reason, algorithms for dependencies discovery have been proposed in the XML field (Trinh, 2008).
A novel extension of traditional functional dependency referred to as conditional functional dependency was recently introduced.It was conceived to capture the notion of correct data in a specific situation.A conditional dependency represents a weaker form of dependency defined to act within a scope of data Lima, Rezende and Oliveira | Conditional functional dependencies validation for XML data limited by conditions on several attributes.Only the tuples that satisfy these conditions should be evaluated and this conditional application allows contextualizing analysis on data, and in addition, to find and correct specific inconsistencies.In this way, conditional functional dependencies for XML (XCFD) extend functional dependencies with a conditional expression that allows the application of a functional dependency only over parts of an XML document, representing a specific subset of data.As an example, suppose a database of bank accounts in which we want to check (i) if the bank is B1 then a unique account number identifies each customer, (ii) if the credit card is PlatinumCard for bank B2, then the card type identifies the interest rate.Those conditions are useful to understand the characteristics of a given data subset, and also to assess quality.
In this work, an XML document is seen as a structure composed of an unranked nodelabeled tree and some functions for handling this tree.A path expression defines a way of navigating XML trees and is fundamental for the specification of integrity constraints in an XML context.In this way, we want to express conditional functional dependencies (XCFD) based on path expressions using a homogeneous formalism for their verification, introduced in Bouchou et al. (2011), founded on attribute grammar and finite state automata.The validation of an XML document is done in one tree traversal, in the document reading order, going top-down until leaf nodes are reached and then, bottom-up, to finish each node visit, until the root node.In the topdown direction, the validation process uses attributes to specify the role of each node visited with respect to the defined constraints.In the bottom-up direction, the values concerned by the constraints are pulled up and treated via some other attributes.Our approach for XCFD validation does not depend on document schema and is established by a traversal function that receives an XML document and a set of XCFDs, and checks if each constraint is respected.The validation result is a Boolean value that is a conjunction of each XCFD Boolean result, as shown in Figure 1.

Paper organization:
In Related Work we discuss related work on XML constraints and their verification.Next, we introduce our basic definitions.In Validation of conditional functional dependencies for XML we show our algorithms, based on attribute grammar to check if a set of XCFD is satisfied by an XML document and we discuss the verification process.In Framework for XML integrity constraints validation with conditions, we consider aspects of design patterns to implement the verification proposal taking into account homogeneous formalism.Finally, we present the conclusion of this paper and future work perspectives.

Related Work
In the relational context, many works are being developed to improve the quality of integrated data, with renewed interest in the application of classical dependencies (Fan, 2008;Liu et al., 2011;Bakhtouchi et al., 2011) and conditional dependencies (Bohannon et al., 2007;Fan et al., 2008;Ma et al., 2014) for the preservation of semantics, detection and repairing of possible inconsistencies.In a similar way, classical and conditional dependencies on XML data have been proposed for data semantic verification.Several approaches for XML functional dependencies have been proposed (Buneman et al., 2001;Liu et al., 2003;Wang and Topor, 2005;Shahriar and Liu, 2008;Bouchou et al., 2012;Tan and Zhang, 2011), and also for conditional functional dependencies (Vo et al., 2011).
The implementation of constraint validators has received less attention.Our approach performance is comparable to implementations in Vincent and Liu (2005), Shahriar and Liu (2009), but contrary to them, it intends to be a generic model for XML constraints validation.The ideas guiding this work are the ones outlined in Bouchou et al. (2011).An incremental validation method for keys is defined in Bouchou et al. (2007), using the generic model considering multiple updates in an XML tree.In Gire and Idabal (2010), the notion of incremental validation is considered via the static verification of functional dependencies with respect to updates.However, in that work, XFDs are defined as tree queries.

Basic definitions
Following the basic definitions used throughout this paper are shown.
Definition 1. XML Document: Let  = ele ∪ att ∪ data be an alphabet where ele is the set of element names and att is the set of attribute names.An XML document is represented by a tuple T = (t, type, value).The tree t is the function t: dom(t)   where  is a set of tags and dom(t) is a set of positions.Given a tree position p, function type (t, p) returns a value in {data, element, attribute}.We also recall that, in an XML tree, attributes are unordered while elements are ordered.
A tree representing an XML document containing bank account information is illustrated in Figure 2. At each node, its position and its label (e.g., t(0) = bank and t(0.0.1) = card) are shown together and values (in italic) are associated to leaves.A path for an XML tree t is defined by a sequence of tags or labels.The path language PL is used to define integrity constraints over XML trees.In PL, [] represents the empty path, l is a tag in  the symbol "/" is the concatenation operation, "//" represents a finite sequence (possibly empty) of tags, and "_" is any tag in .The language PL is a common fragment of regular expressions and XPath.A path P is valid if it conforms to the syntax of PL and for all tags l in P, if l=data or l  att then l is the last symbol in P. We consider that a path P defines a finite-state automaton A P having XML labels as its input alphabet.
Definition 2. Instance of a path P over t: Let P be a path in PL, A P the finite-state automaton defined according to P, and L(A P ) the language accepted by A P .In a sequence of positions I= Then I is an instance of P over t if and only if the sequence t(v 1 )/… /t(v n ) L(A P ).
A tuple is formed by the values or position values found at the end of a path instance for a path P over a tree t.The notion of tuple is important for integrity constraints for composing values that may be compared and verified.A functional dependency in XML (XFD) is denoted X Y (where X and Y are sets of paths) and it imposes that for each pair of tuples t 1 and t [Y].A XFD has a single path on the right-hand side and possibly more than one path on the left-hand side.This approach generalizes the proposals in Arenas et al. (2004), Liu et al. (2003), and Wang and Topor (2005).
The dependency can be imposed in a specific part of the document, and, for this reason, a context path can be specified.We distinguish two kinds of equality in an XML tree, namely, value and node equality.Two nodes are value equal when they are roots of isomorphic subtrees.Two nodes are node equal when they are the same position.To combine both equality notions we use the symbol E, which can be represented by V for value equality, or N for node equality.Definition 3. Functional dependency for XML: Given an XML tree t, a functional dependency for XML (XFD) is an expression of the form where C is a path representing the context path, ending at the context node; {P 1 , …, P k } is a non-empty set of paths in t and Q is a single path in t, both P i and Q starting at the context node.The set {P 1 , …, P k } is the left-hand side or determinant of an XFD, and Q is the right-hand side or the dependent path.The symbols E 1 ,…,E k ,E represent the equality type associated to each dependency path.When the symbols E or E 1 ,…,E k are omitted, value equality is the default choice.

Definition 4. Satisfaction of functional dependency for XML
)) an XFD.The set S ={C/P 1 , …, C/P k , C/Q} gathers all paths from the root for constraint .Each instance I of set S is defined by: I S = {t 1 , …,t k , t q } where t i (i in [1..k]) is the tuple formed by the values (or position values, depending on E i ) found following C/P i [E i ] for instance I, and t q is tuple obtained The document represented by tree t satisfies the constraint  if and only if for all two instances of S, namely I S and I R that coincide at least on their prefix C, we have: Conditional functional dependencies are similar to functional dependencies, but with a peculiarity of having a conditional expression that allows establishing a restriction associated to the values contained in a database.Definition 5. Conditional functional dependency for XML: Given an XML tree t, a conditional functional dependency for XML (XCFD) is an expression of the form are defined similarly as in functional dependencies for XML, and Cond is a combination of expressions of the form: expr 1 θ 1 expr 2 θ 2 …. θ n-1 expr n .Each expression expr i is a relational expression and represents a condition.The conditions are connected by logical operators (θ i for operations ⋀, ⋁), thus, several specific conditions can be imposed in the constraint definition.An expression expr i is defined as a sentence of the type PC ϕ vc, where PC is a path expression, ϕ is a relational operator (= ; ≠ ; < ; > ; ≤ ; ≥) and vc is the expected value.

Definition 6. Satisfaction of conditional functional dependency for XML:
PC n , C/P 1 , …, C/P k , C/Q} gathers all paths from the root for constraint , being C/PC 1 , …, C/PC n the conditional paths.Each instance I of set S is defined by: I = {e 1 ,…,e n ,t 1 ,…,t k , t q } where each e i is a value obtained from C/PC i and t 1 ,…,t k , t q are tuples obtained similarly to functional dependencies for an instance I.
The document represented by tree t satisfies the constraint  if and only if for all two instances of S, namely I S , I R that coincide at least on their prefix C, we have: Example 1: In a context of a banking environment we consider a situation where checking accounts are integrated, as illustrated in Figure 2.There are many kinds of checking accounts: personal, salary, university, business, etc., and for each one we can find differences in type of data stored, operations and also and restrictions that can be conditionally applied.For example: 1.If the checking account is of type salary, then it must be an individual personal account.In this situation, it is not possible to associate other clients to this account.2. Many employees can be associated to a business account, but the same employee cannot be associated to more than one business account.3. Pre-approved credit is offered to personal accounts depending on the average balance factor, if this factor is greater than 0. Those rules can be translated in two conditional functional dependencies, respectively: η 1 = (/bank/checkingAccounts ( /account/type = "salary", {account/number} \account\idClient)) η 2 = (/bank/checkingAccounts ( /account/type = "business", {account/employees/employee/idClient, account/employees/employee/idEmp}  \ac-count\number)) η 3 = (/bank/checkingAccounts ( /account/type = "personal" ⋀ /account/averageBalanceFactor > 0, {account/averageBalanceFactor}  \account\pre-approvedCredit))

Finite state automata for functional dependencies in XML
We use finite-state automata (FSA) or transducers (FST) to formalize paths in integrity constraints.The input alphabet for the finite-state automata is the set of XML tags.The output alphabet for transducers is composed by our equality symbols (for XFDs) and also by the expected values in conditions with the respective relational operator (for XCFDs).We denote a FSA by 5-tuple A = (Θ, V, Δ, e, F) where Θ is a finite set of states; V is the alphabet; e  Θ is the initial state; F  Θ is the set of final states; and Δ: Θ × V  Θ is the transition function.A FST is a 7-tuple A = (Θ, V, Γ, Δ, e, F, λ) such that: (i) (Θ, V, Δ, e, F) is a FSA (ii) Γ is an output alphabet (iii) λ is a function from F to Γ indicating the output associated to each final state From Definition 3 we know that in an XFD, path expressions C, Pi and Q (i  [1, k]) specify the constraint context, the determinant paths and the dependent path, respectively.These paths define path instances on an XML tree t.To verify whether a path instance corresponds to one of these paths we use the following automata and transducers: -The context automaton M = (Θ,Σ, Δ , e, F) expresses path C. The alphabet Σ is composed by XML document tags.
-The determinant transducer T' = (Θ',Σ, Γ', Δ', e', F', λ') expresses paths Pi (i  [1, k]).The set of output symbols is Γ′ = {V,N}×N* such that V (value equality) and N (node equality) are the equality types to be associated to each path.Each path is numbered because there may be more than one path in the dependent side.Thus, the output function λ′ associates a pair (equality, rank) to each final state q  F′; -Path Q is expressed by the dependent transducer T" = (Θ", Σ, Γ", Δ", e", F", λ").The set of output symbols is Γ" = {V,N} and the output function λ" associates a symbol V or N to each final state q  F".
For conditional functional dependencies, there is another set of paths, representing the conditions that must be verified.For this purpose, a new transducer is used to formalize the paths in the part Cond, specified in Definition 5: -All paths contained in the Boolean expressions defined in Cond are expressed by the conditional transducer T C = (Θ c , Σ, Γ c , Δ c , e c , F c , λ c ).The set of output symbols is Γ c = Σ × {= ; ≠; < ; > ; ≤ ;≥} × N* such that the expected values defined in conditions and also the respective relational operation can be associated to each conditional path.Thus, the output function λ c associates a triple (vc, ϕ, rank) to each final state q  F c .
A finite-state automaton is a machine that can be in one of a finite number of states, and in certain conditions, it can switch to another state by a transition.When the machine starts working it may begin from an initial state, in the case of XML data, the state representing the root node.Figure 3 illustrates FSAs and FSTs for XCFDs η 1 and η 3 defined in Example 1.

Attribute grammar
The general process for validating integrity constraints in XML documents can be performed with the use of an attribute grammar.Attribute grammars are extensions of context-free grammars that allow to specify not only the syntax, but also the semantics of a language.We consider a context-free grammar G = (V N , V T , P, B) where V N is the set of nonterminal symbols, V T is a set of terminal symbols, P is the list of productions and B is the start symbol.In order to annotate extra information to a symbol, we attach semantic rules to its productions.In a semantic rule we can create attributes that may represent anything: a string, a number, a type, a memory location Aho et al. (1988).Those rules are declarative specifications describing how the attached attributes are computed.Two types of attributes can be found in a semantic rule: synthesized and inherited.Synthesized attributes carry information from the leaves of a tree to its root, while inherited ones transport information inversely, from root to leaves.
An attribute grammar is a triple GA = (G, A, F) where: G is a context-free grammar; A is the set of attributes and F is a set of semantic rules attached to the productions.For X  V N ∪ V T , we have A(X) = S(X)+I(X), i.e., A(X) is a set composed by the disjoint union of S(X), that is the set of synthesized attributes of X and I(X), the set of inherited attributes of X.For each production p: X 0  X 1 . ..X n , the set F p contains the semantic rules that handle the set of attributes of p and describe its semantic features.In consequence, the semantic parsing of a sentence is executed using the set of actions associated to each production rule.In each action definition, the values of attribute occurrences are calculated in terms of other attribute values.
In this work we assume that G is a simple grammar describing any XML tree.To verify integrity constraints, one may augment G by semantic rules, using attributes that can constitute information to be used in the validation.Consider a context-free grammar G with the following three generic production rules: where (i) defines the production in which α 1 … α m are direct descendent nodes from the root node, (ii) defines the production rule for an internal node that must have at least one direct descendent, and (ii) defines the production for a leaf node.
The grammar G can be augmented by semantic rules, containing grammar attributes, defining the exact actions that must be performed concerning integrity constraints validation.The parsing of an XML document is done by a top-down traversal in its tree using open-tag and close-tag events.During the descendent direction, the validation process defines the role of each node regarding the constraints being verified by using the finite state automata that formalize its paths.This information is stored in an inherited attribute, since it is calculated in the descendent direction.When leaves are reached then an upward trajectory begins to treat and store the encountered values, concerning the constraints, into synthesized attributes.

Validation of conditional functional dependencies for XML
The conditional functional dependencies validation process receives a set of XCFDs and a XML document.The validation of a XCFD is accomplished using an attribute grammar approach, wherein for each node in the XML tree, inherited and synthesized attributes are associated.Each association takes advantage of one of two parser events over the XML tree to be validated.The two events are: open-tag and close-tag.Considering ω a set of XCFDs to be verified, the traversal in the XML tree is performed according to Algorithm 1.

Algorithm 1 -Validation of conditional functional dependencies
Input: (i) ω: a set of z XCFDs (ii) Doc: an XML document Output: the Boolean value true if the document satisfies the set of XCFDs, otherwise false.

Local Variables:
(i) tg : document tag referring to a node (ii) InhStack: stack to store inherited attributes tuple (iii) SyntStack: stack to store synthesized attributes tuple (iv) InhAttList: k-tuple to organize inherited attributes for XCFDs (v) SynAttList: k-tuple to organize synthesized attributes for XCFDs (1) for each XCFD ηi  ω do (2) build Mηi, T'ηi, T"ηi, Tcondηi; // FSA and FSTs for XCFDs (3) push (NULL,…, NULL) into SynStack; (4) push ({Mη 1 .e0 },.Algorithm 1 expresses a non-recursive function that uses stacks to direct the traversal of an unranked XML tree for the verification of z XCFDs.Two stacks are used to organize the association between grammar attributes and tree nodes.The first stack, inhStack is responsible for storing, for each node that is found (at open-tag event), a ztuple containing the inherited attributes that were calculated for all XCFDs at that point.The second one is synStack and it is used for saving the z-tuple of the synthesized attributes computed, at close-tag event, for all constraints, during the tree visit.At the end of the tree traversal, the constraints verification is finally computed at the root node using its associated z-tuple at synStack.If, for all XCFDs, no violations were found then the function calculateResult returns true, otherwise, false.

Grammar attributes and their computation
In this section we detail inherited and synthesized attributes used in Algorithm 1 and their computation.The grammar attributes are responsible for storing values and partial results of treatments and comparisons between the encountered values that concern defined constraints.One inherited attribute is used for each constraint at each node to assign the rule of a node tag with respect to a given XCFD.On the other side, various synthesized attributes are needed for each constraint at each node, because they are used for creating the tuples for the encountered values (determinant and dependent side of the dependencies), to compute and store values referring to the conditions, and to store the result of manipulations and comparisons between dependencies values.

Inherited attribute
The inherited attribute used in the XCFD verification is named conf.As shown in Algorithm 1, line 7, it is calculated for each constraint when an opening tag is found.The computation of this attribute uses the automaton and transducers that were built using the path expressions given by an XCFD and also information from the conf attribute value associated to its parent node.The conf attribute calculation for each node (at open-tag event) for each XCFD is specified in Algorithm 2, considering the rules for XML grammar defined in section 3.2.

Algorithm 2 -Calculation of conf attribute
Function name: calculateInhAttributes Input: (i) A: tag opening (current node) (ii) ParentConf: attribute conf (from parent node) (iii) M, T', T", T cond (FSA and FSTs for a XCFD η) (7) else ( 8) conf := {K.q'| Δ K (q, A) = q'}; (9) return conf; The conf attribute is calculated according to Algorithm 2 and it stores a set of configurations of type M.e, where M is a FSA or FST, and e is a state of M. In line (1) from Algorithm 2 we specify the case where the current node is the tree root.In this case the attribute conf is calculated by initializing the FSA M (for context path) and executing a transition using the root tag.If the current node is not the root (line (3)) then we must check if the final state from FSA M is reached.If it is the case, then transitions for FSTs T', T" and T C from their initial states may be executed to check if this node is in their paths.Then the corresponding configurations are stored in conf.Figure 4 shows the computation of attribute conf for XCFD η 1 using the corresponding FSA and FSTs depicted in Figure 3.

Synthesized attributes for leaf nodes
When a leaf node is found it is necessary to verify if data contained in this node concerns any dependency being verified and, and if so, data is collected in synthesized attributes.To define the synthesized attributes, we recall the XCFD definition that is (C, (Cond, {P 1 [E 1 ], ... , P k . Initially, we define the attributes ds i , i in [1..k] to store values respectively to paths P i , attribute dc to save values concerning path Q, and dcond j , j in [1..n], to store the values obtained from conditional paths in Cond.Also, an attribute inters is defined to gather all values found for a constraint in tuples <lcond, ldep>, where lcond = <dcond 1 , …, dcond n >, and ldep is a tuple <lds, dc> to store the determinant part and the dependent part of the dependency (respectively l 1 and l 2 ).For this purpose, lds = <ds 1 ,…, ds k >.
An extra attribute, called c, is defined to store a Boolean value representing the result of the validation for a context.At a leaf node, this result is not calculated yet.Synthesized attributes are grouped in a structure specified by (dcond j , ds i , dc, {inters}, c).Empty values are filled with the symbol ε as detailed in Algorithm 3.

Algorithm 3 -Calculation of synthesized attributes for leaf nodes
Function name: calculateSynAttributesLeaf Input: (i) A: tag closing (current node) the parent node.After closing all child nodes, for node labeled account, we have inters = {<<true>,<<43523>,CC2423>>}.At the node labeled checkingAccounts all intersection sets are combined, and as it is a context node, all tuples contained in the intersection set are verified (according to Definition 6) and if no violations are identified, the result for the verification is true, and it is stored in attribute c.

Auxiliary algorithms
This section defines the auxiliary algorithms used in the previous main algorithms.In line ( 22) of Algorithm 3, a function eval is used to evaluate a relational expression.This function is detailed in Algorithm 5.It can be seen in lines (16-18) from Algorithm 4 that when a given node is reached in the XML document and at the same time a final state in the context path is also reached, a checking is executed to ensure that the XCFD instances respect the value constraints specified on the condition expression.The function validateCondition uses the intersection values (from attribute inters associated to current node) and verifies whether this tuple is complete and resolves the values contained in the conditional part of the tuple.Algorithm 5 specifies this process.for each tuple t j in curInters (j≠i) do ( 7) newInters:= newInters ∪ combine(ti,tj); (8) return newInters When an empty value is found in an intersection tuple, we must try to replace it by looking up into all other intersection tuples that are in the same set for a node.This is shown in line (6) of Algorithm 7.For two intersection tuples, the combination between them is defined in Algorithm 8.

Framework for XML integrity constraints validation with conditions
We propose a general framework to validate integrity constraints for XML data.In this section, the architectural design for this environment is treated.The framework is based on a homogeneous formalism used to express different integrity constraints and it is expected to validate not only traditional and conditional functional dependencies, but also traditional and conditional inclusion dependencies.The software is being developed in Java language and the choice of this technology is justified by its portability between different platforms.The component used to manipulate a set of XML documents is SAXParser and the server used is Tomcat, which allows integrating Java console and web applications.This research aims to develop a software in which non-proprietary APIs are used, that is, opensource components.
Considering that the same formalism is used to define and validate integrity constraints, based on path expressions and is evaluated using FSAs and FSTs, design patterns are very useful to define the software architecture with the purpose of facilitating code reuse and flexibility as discussed in Kuchana (2004) and Gama et al. (1994).We use UML diagrams to represent different views of our system model.Figure 6 partially illustrates the Class Diagram outlining the integrity contraints concerned.The dashed lines bypass dependences that are not yet implemented.The AbstractConstraint class defines the common characteristics of all constraints and is designed to be a base class for integrity constraints.
Figure 7 demonstrates the application packages organization for XML validation.The proposed environment applies the MVC (Model-View-Control) architectural design pattern, which is, in our case, useful to separate data model with validation rules from the user's interface.This approach helps to improve communication among developers, to increase the understanding of contents for each folder and allows better organization of the application.
The Model package contains the application object and is divided into three other sub-packages.The Validator package is responsible for organizing the constraints validation.Constraints package includes all basic features of restrictions.Synthesized and Inherited attributes are defined specifically for each constraint type.Assistant package contains helper classes used for building temporary structures, namely stacks, lists and hash tables.The Automata package includes all classes that involve the definition of finite automata and their corresponding operations.The Con-troller package contains classes for specifying actions used by Struts 2 to perform validation operations through web interface and to redirect actions.The View package implements the system interface and allows the user to interact through the graphical user interface (GUI) or browser user interface (BUI).For usability, the user can insert, edit or remove constraints for validation, and use the XML file sent to the last upload, avoiding the use of unnecessary bandwidth.Thus, it is also possible to view detailed results of the validation, thus creating a standard for the application.
In this project, the behavioral pattern Observer is implemented, where an object (called the subject) maintains a list of its dependents (called observers).These modifications enable the application to become extensible to accommodate new types of integrity constraints that are independent from each other.The Observer pattern is also a key part in Model View Controller (MVC) pattern.

Conclusions
The research method employed in this work aims the application of basic concepts such as databases and compilers theory to create a set of algorithms to validate Conditional Functional Dependencies in XML documents.It also uses an analytical approach that intends to explain how XCFDs validation can be performed.The materials used here involved XML SAXParser, Tomcat server, and computer science theoretical concepts like functional dependencies, attribute grammar and design patterns such as Observer.With these tools we were able to define algorithms capable to validate any well-defined Conditional Functional Dependency in an XML document, as presented in this article.
An attribute grammar can be defined for the validation of integrity constraints over XML data by performing various annotations and calculations that are associated to the tree nodes during one tree traversal in an XML document.The main advantage of this proposal is to be based on a generic method founded on finite automata and grammar attributes, which can be adapted to the validation of other types of restriction.The validation of conditional functional dependencies allows the quality of the XML data to be analyzed from specific conditions imposed on the data (semantic imposition) which can be quite useful in data integration environments.
The use of the behavioral pattern Observer allows our framework to have a low coupling between the validation classes.Thus, each class is enabled to perform its validation process without interference from other types of constraints validation, once for every event occurring at tree traversal each class is individually notified and performs its own response action.Furthermore, this approach gives the opportunity to extend the system to other types of constraints validation and contributes to the identification and management of inconsistent information.
As a continuation of this work, a module that provides the correction of possible inconsistencies that are raised during the validation process is proposed.Those corrections might consider data semantics, error type, and can be formally defined with insertion, exclusion and substitution operations over branches and values.Another future improvement in this work is the processing of integrity constraints validation over XML collections stored in distributed storage, using a map-reduce framework.

Figure 1 .
Figure 1.Validation structure.Overview of the verification process for a set of XCFDs.

Figure 2 .
Figure 2. XML tree.Fragment of the XML document containing information about bank accounts.

Figure 7 .
Figure 7. Packages.Overview of the packages organization.

Figure 8 .
Figure 8. Framework classes.The class diagram for validators of a set of constraints.

Figure 8
Figure 8 illustrates a class diagram (from package Validator) that models the characteristics of the Observer pattern in this framework.The scenario contains the subject ValidationSax-Parser and a set of observers.The subject maintains a dynamic list of observers, and they can change their state when receiving notification from the subject.