Thursday, July 1, 2010

Data Warehouse in Agricultural

A Data warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated. This makes it much easier and more efficient to run queries over data that originally came from different sources. In other words Data warehouse is a database that is used to hold data for reporting and analysis.

Economic foundation and productivity growth depends on agricultural sectors. Agriculture is the driving force behind the way of live and source of earnings for the majority of peoples. More than 60 percents of population are living in rural areas and the majority are farmers. The rural communities as a main producer for country food productivity and food security earn only 11 percents of Gross Domestic Product (GDP). The arrival of information age guides this country to new development strategies.

National Electronics and Computer technology Center (NECTEC) in collaboration with the Ministry of Agriculture, has launched "Agriculture Information Network" as a response to the unmet information requirements of the agricultural sector. Farmers should gain benefit from the contents provided which include risk assessment, agriculture warning system and agricultural knowledge base, which aim to improve technology, productivity, income and stability of India agriculture sector through the age of Information Technology. The data warehouse consists of common databases and geo-spatial databases from various departments and organizations in the country and abroad. Farmers can get access to the contents through Internet by themselves or from groups of professional people called "Information Brokers".



Keywords: Data Warehouse, Agriculture, IT





1. Introduction

A Data warehouse [1] is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated. This makes it much easier and more efficient to run queries over data that originally came from different sources. In other words Data warehouse is a database that is used to hold data for reporting and analysis.



Goals of Data Warehousing

* To facilitate reporting as well as analysis
* Maintain an organizations historical information
* Be an adaptive and resilient source of information
* Be the foundation for decision making



Data Warehouse Architecture

Data warehouse Architecture comprises of

* Operational source systems
* A data staging area
* One or more conformed data marts
* A data warehouse database



Operational Source Systems

Operational source systems [1] are developed to capture and process original business transactions. These systems are designed for data entry, not for reporting, but it is from here the data in data warehouse gets populated.



Data Staging Area

Data staging area is where the raw operational data is extracted, cleaned, transformed and combined so that it can be reported on and queried by users. This area lies between the operational source systems and the user database and is typically not accessible to users.



Data staging is a major process that includes the following sub procedures:

* Extraction

The extract step is the first step of getting data into the data warehouse environment. Extracting means reading and understanding the source data, and copying the pas that are needed to the data staging for further work.

* Transformation

Once the data is extracted into the data staging area, there are many transformation steps, including



1. Cleaning the data by correcting misspellings, resolving domain conflicts, dealing with missing data elements, and parsing into standard formats.

2. Purging selected fields from the legacy data that are not useful for data warehouse.

3. Combining data sources by matching exactly on key values or by performing fuzzy matches on non-key attributes.

4. Creating surrogate keys for each dimension record in order to avoid dependency on legacy defined keys, where the surrogate key generation process enforces referential integrity between the dimension tables and fact tables.

5. Building the aggregates for boosting the performance of common queries.

* Loading and indexing

At the end of transformation process, the data is in the form of load record images. Loading in the data warehouse environment usually takes the form of replicating the dimensional tables and fact tables and presenting these tables to bulk loading facilitates each recipient data mart. Bulk loading is a very important capability that is to be contrasted with record-at-a time loading, which is far slower. The target data mart must then index the newly arrived data for query performance.



Data Mart

Data mart is a logical subset of an enterprise-wide data warehouse. For example, a data warehouse for a retail chain is constructed incrementally from individual, conformed data marts dealing with separate subject areas such as product sales. Dimensional data marts are organized by subject area such as sales, finance, and marketing and coordinated by data category such as customer, product, and location. These flexible information stores allows data structures to respond to business changes-product line additions, new staff responsibilities, mergers, consolidations, and acquisitions.



Data Warehouse Database

A data warehouse database contains the data that is organized and stored specifically for direct user queries and reports. It differs from an OLTP database in the sense that it is

designed primarily for reads not writes. An OLAP application is a system designed for few but complex (read only) request. An OLTP application is a system designed for many but simple concurrent (and updating) requests.



Metadata

Metadata defines the content and location of the data in the data warehouse, relationships between the operational databases and the data warehouse and the business views of the data in the data in the warehouse as accessible to the end-user tools. Metadata is searched by user to find the subject areas and the definitions of the data.

For decision support, the pointers required to data warehouse are provided by the metadata. Therefore, it acts as logical link between the decision support system application and the data warehouse. Thus, any data warehouse design should assure that there is a mechanism that populates and maintains the metadata repository and that all access paths to data warehouse have metadata as an entry point. In other words there should be no direct access permitted to the data-warehouse data if it does the user metadata definitions to gain the access. Meta data definition can be done by the user in any given data warehousing environment. The software environment as decided by the software tools used will provide a facility for metadata definition in a metadata repository.



OLAP Vs OLTP



OLTP (Online Transactional Processing)

* OLTP servers handle mission-critical production data accessed through simple queries
* Usually handles queries of an automated nature
* OLTP applications consist of a large number of relatively simple transactions.
* Most often contains data organised on the basis of logical relations between normalised tables

• OLAP (Online Analytical Processing)

* OLAP servers handle management-critical data accessed through an iterative analytical investigation
* Usually handles queries of an ad-hoc nature
* supports more complex and demanding transactions
* contains logically organised data in multiple dimensions



2. Warehouse Schema Design

Dimensional modeling is a term used to refer a set of data modeling techniques that have

gained popularity and acceptance for data warehouse implementation. Dimensional modeling is one of the key techniques in data warehousing. Two types of tables are used in dimensional modeling: Fact tables and dimensional tables







Fact Tables

These are used to record actual facts and measures in the business. Facts are numeric data items that are of interest to the business. Example, telecommunication- length of call in minutes, average number of calls.



Dimensional Tables

Dimensional tables establish the context of the facts. Dimensional tables store fields that describe the facts. Example, telecommunication- call origin, call destination. A schema is a fact table plus its related dimensional table.



3. Crucial Decision in Designing a Data Warehouse

The job of designing and implementing a data warehouse [3] is very challenging and difficult one, even though at the same time, there is a lot of focus and importance attached to it. The designer of the data warehouse may be asked by the top management:"take all enterprise data and build a data warehouse such that the management can get answer to all their questions". This is daunting task with responsibility being visible and exciting. But how to get started? Where to start? Which data should be put first? Where is that data available? Which query should be answered? How would bring down the scope of project to something smaller and manageable, yet be scalable to gradually upgrade to upgrade to comprehensive data warehouse environment finally?

The recent trend is to build data marts for before a real large data warehouse is built. People want something smaller, so as to get manageable results before proceeding to the real data warehouse.

RALPH KIMBALL identified a nine step method as follows:

Step 1: Choose the subject matter.

Step 2: Decide the what the fact table represents.

Step 3: Identify and confirm the dimension.

Step 4: Choose the facts.

Step 5: Store precalculation in the fact table.

Step 6: Define the dimension and tables.

Step 7: Define the duration of database and periodicity of updation.

Step 8: Track slowly the change in dimension.

Step 9: Decide the query priorities and query modes.

All the above steps are required before the data warehousing is implemented. The final step or step 10 is implemented a simple data warehouse or data mart. The approach should be ‘from simpler to complex'. First only a few data marts are identified, designed and implemented. A data warehouse then will emerge gradually.

Let us discuss the above mentioned steps in detail. Interaction with the user is essential for obtaining answers to many questions. The user to be interviewed includes top management, middle management, executives, as also operational users, in addition to sales force and marketing teams. A clear picture emerges from the entire project on the data warehousing as to what are their problems and how they can be possibly solved with the help of data warehousing.

4. Various Technology Considerations

The following or technological issues [3] are required to be considered for designing and implementing a data warehouse:

1. The hardware platforms for Data Warehouse

2. DBMS for supporting data warehouse

3. Communication and network infrastructure for a Data Warehouse

4. The system management /operating system platforms

5. The software tools for building, operating and using Data Warehouse



Hardware Platform

Organization normally tend to utilize the already existing hardware platform for data warehouse development however the disk storage requirements for a data warehouse will be significantly large, especially in comparison with single application.

If data warehouse or data mart is small in data size, normal Pentium server will be probably sufficient with not very high reliability standards. However for a regular large data warehouse application the server has to be specialized for the tasks associated with a data warehouse. A mainframe, for example is well suited for this purpose, as a data warehouse server. What are the features required for a successful data warehouse server? Firstly it should be able to support large data volume and complex query processing. In addition, it has to be highly scalable. As the user population keeps on growing, the network traffic and the access traffic increase significantly. Therefore, the requirement of data warehouse server is the scalable high performance for data loading and ad hoc query processing as well as the ability to support large database in a reliable and efficient manner. If the querying is going to be on a large public data network then multiprocessor configuration will be required for parallel query processing. In case of a complex server of configuration with multiple processors and large I/O bandwidth a proper balance needs to be made between I/O and processing power.



DBMS Selection

Next to hardware solutions a factor most critical is the database selection. This determines the speed performance of the data warehousing environment. The requirement Of a DBMS for data warehousing and requirement are scalability and high volume storage and processing and throughput in traffic.

The majority of established RDBMS vendors have implemented various degree of parallelism in their products. Even though all the vendors have implemented various degrees of parallelism in their products. Even though all the well known vendors-IBM, ORACLE SYBASE-support parallel database processing, some of them have improved their architectures so as to better suit the specialized requirement of the data warehouse. The RDBMS products provide additional modules for OLAP cubes. The correct choice of OLAP server DB server and web server can be made by the designer or user of Data warehouse depending on the requirement.



Communication and Networking Infrastructure

Data warehouse can be internet enabled or intranet enabled as the choice may be. If web enabled the networking is taken care by the internet. If only Intranet based then the appropriate LAN operational environment should be accessible to all the identified users. Thus network expansion may be required as per the needs. In web enabled data warehouses, issues of security privacy and accessibility need to be considered carefully .Accordingly web enablement facilities should be ensured in the software tools used for data warehouse development.



Stages in Implementation

A data warehouse cannot be purchased and installed. Its implementation requires the integration of implementation of many products. Following are the steps of the Data Warehouse implementation:

Step 1: Collect and analyze business requirement.

Step2: Create a data model and physical design and data warehouse after deciding the

appropriate hardware platform.

Step 3: Define the data sources

Step 4: Choose the DBMS and software platform for data warehouse.

Step 5: Extract the data from operational data sources, translate it, clean-up and load into the

data warehouse model or data mart.

Step 6: Choose database access and reporting tools.

Step 7: Choose database connectivity software.

Step 8: Choose the data analysis (OLAP) and presentation (client GUI) software.

Step 9: Keep refreshing the data warehouse periodically.



Access Tools

With the exception of SAS(of SAS institute), all the Data Warehouses /OLAP vendors are not currently providing comprehensive single-window software tools capable of handling all aspects of data warehousing project implementation .SAS alone meets the requirement largely independently as it has its own database internally with a capability of import data from any vendor DBMS software. Therefore one can implement a data warehousing and data mining solution independently with SAS.

The best way to choose a group of tools is to understand the capability and compatibility of different type of access to the data and reporting by selecting best tool in market for that kind of access. The types of access and reporting are as follows:

* 1. Time series analysis
* 2. Data visualization, graphing, charting and pivoting
* 3. Complex textual search (text mining)
* 4. General stastical analysis.
* 5. Artificial intelligence techniques for hypothesis testing, trends discovery, identification and validation of data clusters and segments(also useful for data mining)
* 6. Mapping of specifial information into geographic information system
* 7. Ad hoc user-specific queries
* 8. Predefined repeatable queries
* 9. Drilling down interactically
* 10. Reporting the analysis by drilling down
* 11. Complex queries with multi-table forces, multilevel sub-queries, sophisticated search criteria.

In some applications, the user requirement may exceed the capability of tools. A number of query tools are available in the market today which enables an ordinary user to build customized reports by easily composing and executing ad hoc queries without any necessity to have the knowledge of the underlying design details or data base technology, SQL, or even the data model



5. Its Applications in Agriculture



Project: Agriculture Information System Network (AGRISNET)

Department of Agriculture and Cooperation (DAC) [2] have taken steps to establish "Agricultural Information System Network (AGRISNET)" in collaboration with NIC. The Proposal recommends (i) the state-of-the-art IT infrastructure requirements to establish AGRISNET as the INTRANET over NICNET, (ii) development of databases and information systems for decision support for evaluation, monitoring and policy formulations, and (iii) human resources development, (iv) multi-media based training and demonstration of transfer of technology to strengthen Farm Research and Education using broadcast VSATs, (v) special interest groups in respect of subjects, problems, programmes, schemes, etc, and above all, to make Indian Agriculture on-line for INTERNET and INTRANET access through AGRISNET Nodes. AGRISNET Nodes are envisaged to be established at

• DAC Hqrs (Krishi Bhawan),

• DAC Attached Offices and its regional offices,

• DAC Subordinate Offices and its regional units,

• DAC Public Sector Undertakings (NSC&SFCI) and sub-units,

• DAC Autonomous Organizations,

• Apex Cooperative Organizations

• State Agriculture Departments

• NCT/UT Agriculture Departments

• District Agriculture Offices and

• Block Agriculture offices

In this direction, IFFCO has taken up a project in association with Indian Space Research Organisation (ISRO) to utilise satellite based remote sensing data and Geographical Information Systems (GIS). Attention may be drawn to the fact that the developed countries have been utilising precision farming with the help of IT tools for a long time. While this will take a long time for our country due to small holdings, it is to be noted that GIS has an invaluable role to play even in the existing conditions. Remote sensing and GIS information can provide warnings on evolving crop stresses, crop vigour, etc.

The IFFCO-ISRO GIS project extends support for efficient and timely availability of IFFCO's fertiliser to farmers though better logistics & efficient operations. It endeavours to provide farmers' advisory services to provide decision support to farmers on land related issues, weather, etc. In addition to the GIS based services, effort is being made to create databases that contain information of interest to the farmers. These include recommendation on package of practices for major cereals, pulses, horticulture, floriculture and animal husbandry, etc.Information on all the inputs such as seeds, fertiliser, sources, current availability, prices, availability of credit, alternatives available and terms and conditions, etc. are sought to be provided. An important service envisaged is to provide access to the nearest expert in case of stress or any other problem witnessed in the crops. Facilities are sought to be provided to encourage and share farm experiences by forging various crop forums. Many of the agricultural extension services are also proposed to be made online using aspects of multimedia.

In order to encourage farmers to obtain best possible price, information on various agricultural output markets (mantis) is also being provided. The objective of this activity is to provide status of price at different mandies to facilitate farmer to move his produce to the mandi where he can expect better price. Other areas of interest to farmers such as distance education, location specific news, etc. are also planned. Access to other related sites of interest such those relating to courts, health, etc. are also sought to be provided.



6. Conclusions

Analytical exploration of vast amount of agricultural data can best be support by appropriate application of Data Warehousing and OLAP technologies. A Data Warehouse provides efficient and reliable structure of storage for vast amount data while OLAP techniques provide mechanisms for analysis of this data.



7. References

[1] Data warehouse and its applications in Agriculture, Anil Rai, Indian Agricultural Statistics Research Institute Library Avenue, New Delhi.

[2] Information Technology in Agriculture, S.C. Mittal.

[3] Data Warehousing concepts, Techniques, Products and Applications, C.S.R.Prabhu.

No comments:

Post a Comment