Data integration means merging data from several heterogeneous sources. While performing the data integration you have to deal with several issues such as data redundancy, inconsistency, duplicity and many more.
In this section, we are going to discuss data integration at a stretch and along with that, we will also discuss the issues or challenges faced during data integration. Further, we will discuss the techniques and tools used for data integration.
Content: Data Integration in Data Mining
- What is Data Integration?
- Issues in Data Integration
- Data Integration Techniques
- Data Integration Tools
- Key Takeaways
What is Data Integration?
Data integration merges data from several heterogeneous sources to attain meaningful data. The source involves several databases, multiple files or data cubes. The integrated data must exempt inconsistencies, discrepancies, redundancies and disparity.
Data integration is important as it provides a unified view of the scattered data not only this it also maintains the accuracy of data. This helps the data-mining program in mining the useful information which in turn helps the executive and managers in taking the strategic decisions for the betterment of the enterprise.
Issues in Data Integration
While integrating the data we have to deal with several issues which are discussed below.
1. Entity Identification Problem
As we know the data is unified from the heterogeneous sources then how can we ‘match the real-world entities from the data’. For example, we have customer data from two different data source. An entity from one data source has customer_id and the entity from the other data source has customer_number. Now how does the data analyst or the system would understand that these two entities refer to the same attribute?
Well, here the schema integration can be achieved using metadata of each attribute. Metadata of an attribute incorporates its name, what does it mean in the particular scenario, what is its data type, up to what range it can accept the value. What rules does the attribute follow for the null value, blank, or zero? Analyzing this metadata information will prevent error in schema integration.
Structural integration can be achieved by ensuring that the functional dependency of an attribute in the source system and its referential constraints matches the functional dependency and referential constraint of the same attribute in the target system.
This can be understood with the help of an example suppose in the one system, the discount would be applied to an entire order but in another system, the discount would be applied to every single item in the order. This difference must be caught before the data from these two sources are integrated into the target system.
2. Redundancy and Correlation Analysis
Redundancy is one of the big issues during data integration. Redundant data is an unimportant data or the data that is no longer needed. It can also arise due to attributes that could be derived using another attribute in the data set.
For example, one data set has the customer age and other data set has the customers date of birth then age would be a redundant attribute as it could be derived using the date of birth.
Inconsistencies in the attribute also raise the level of redundancy. The redundancy can be discovered using correlation analysis. The attributes are analyzed to detect their interdependency on each other thereby detecting the correlation between them.
3. Tuple Duplication
Along with redundancies data integration has also deal with the duplicate tuples. Duplicate tuples may come in the resultant data if the denormalized table has been used as a source for data integration.
4. Data Conflict Detection and Resolution
Data conflict means the data merged from the different sources do not match. Like the attribute values may differ in different data sets. The difference maybe because they are represented differently in the different data sets. For suppose the price of a hotel room may be represented in different currencies in different cities. This kind of issues is detected and resolved during data integration.
Data Integration Techniques
1. Manual Integration
This technique avoids the use of automation during data integration. The data analyst himself collects the data, cleans it and integrate it to provide useful information.
This technique can be implemented for a small organization with a small data set. But it would be tedious for the large, complex and recurring integration. Because it is a time taking process as the entire process has to be done manually.
2. Middleware Integration
The middleware software is employed to collect the information from different sources, normalize the data and stored into the resultant data set. This technique is adopted when the enterprise wants to integrate data from the legacy systems to modern systems.
Middleware software act as an interpreter between the legacy systems and advanced systems. You can take an example of the adapter which helps in connecting two systems with different interfaces. It can be applied to some system only.
3. Application-Based Integration
This technique makes use of software application to extract, transform and load the data from the heterogeneous sources. This technique also makes the data from disparate source compatible in order to ease the transfer of the data from one system to another.
This technique saves time and effort but is little complicated as designing such an application requires technical knowledge.
4. Uniform Access Integration
This technique integrates data from a more discrepant source. But, here the location of the data is not changed, the data stays in its original location.
This technique only creates a unified view which represents the integrated data. No separate storage is required to store the integrated data as only the integrated view is created for the end-user.
5. Data Warehousing
This technique loosely relates to the uniform access integration technique. But the difference is that the unified view is stored in certain storage. This allows the data analyst to handle more complex queries.
- On-premise data integration tool integrates data from the local sources and uses middleware software for connecting legacy databases.
- Open-source data integration tool is the best option in case you want to avoid expensive enterprise solutions. But using this tool will you have to handle the security and privacy of your data.
- Cloud-based data integration tool provides you ‘integration platform as a service’.
- Data integration merge data from heterogeneous sources.
- Data integration has to deal with the challenges like, redundant data, duplicate data, inconsistent data, legacy systems and so on.
- Data integration can be performed manually, using middleware, application. You can even perform uniform access or using data warehousing.
- There are several tools present in the market using which you can perform data integration.
So, this is all about the data integration, issues faced while integrating the data, techniques used and the data integrating tools.