FAQ
1. What is the disadvantage of File Management System over DBMS?
Ans:
Some of the disadvantages of file management system over database management system are:
n Data redundancy and inconsistency
n Difficulty in integrating data into new enterprise level applications because of varying formats
n Lack of support for concurrent updates by multiple users.
n Lack of inherent security
2. Are relational databases the only possible type of database models?
Ans:
No. Apart from relational, other models include network and hierarchical models. However, these two models are obsolete. Nowadays, relational and object-oriented models are preferred.
3. What is referential integrity and how it is achieved in a relational database?
Ans:
Referential integrity is a feature of DBMS that prevents the user from entering inconsistent data. This is mainly achieved by having a foreign key constraint on a table.
4. What are the higher normal forms?
Ans:
A normal form considered higher than 3NF is the Boyce Codd Normal Form (BCNF). The BCNF differs from the 3 NF when there are more than one composite and disjoint candidate keys.
------------------------------------------------------------------------------------------------------------
5. On which layer of application architecture does data warehouse operate?
Ans:
A data warehouse is a server-side repository to store data.
6. What is a data warehouse?
Ans:
A data warehouse is a huge repository of data used for very complex business analysis.
7. What are the benefits of data warehousing?
Ans:
The main benefit of data warehousing is to keep data in such a form that complex business analysis can be done in minimum amount of time.
8. What are the application areas of a data warehouse?
Ans:
There are various application areas of a data warehouse. Some of these are:
n Airlines
n Meteorology
n Logistics
n Insurance
9. Where can you use data warehouse successfully?
Ans:
It is ideal to implement data warehousing when there is a large amount of historical data that needs to be processed for extensive analytical purposes.
10. When and where is a data mart useful?
Ans:
A data mart helps to provide data for conducting analysis at a specialized level. It will always be used at a strategic business unit level such as the department level for a business unit.
11. What does historical data support in data warehouse?
Ans:
Historical data is used for supplying pre-processed and non-pre-processed data for conducting business analysis.
-----------------------------------------------------------------------------------------------------------
Solutions to Chapter Three Questions
12. What is query redirection?
Ans.
When the available data grows beyond size, partitioning becomes essential. Query redirection means that the queries should be directed to the appropriate partitions that store the data required by the query.
FAQ
13. What are the benefits of OLTP?
Ans:
On Line Transaction Processing (OLTP) assists in storing current business transactional data. It also supports a large number of concurrent users to access data at the same time.
14. Why the OLTP cannot provide history data for analysis?
Ans:
Data in a data warehouse comes from an OLTP system only. However, it cannot be directly used for analysis. The reason is that the data in OLTP systems is not organized to give results quickly from billions of records. In a data warehouse, data is classified into various categories and so it is possible to give the results quickly.
15. Why is the data in the data warehouse not stored in a normalized form as in OLTP?
Ans:
The objective of storing data in a normalized form in OLTP is to reduce redundancy and minimize disk storage. The key objective in a data warehouse is to enhance the query response time. The easier the access to data better will be the query response time. Hence, the normalization rules do not matter in a data warehouse.
16. An integral part of OLTP is its support for hundreds of concurrent users. The number of concurrent users supported by a data warehouse is comparable to OLTP. Is this statement true or false? Justify your answer.
Ans:
The statement is false. This is because the number of people involved in data analysis is very low as compared to front-end users who engage in transactional data. Moreover, the percentage of CPU usage per user is very high in case of data warehousing as compared to OLTP users.
17. Explain why a data warehouse does not use current or OLTP data for analysis.
Ans:
The main purpose of a data warehouse is to provide historical data to analyze business trends. Therefore, historical data needs to be a snapshot of events over time and not only on the current data.
18. What is the advantage of MOLAP as storage model?
Ans:
MOLAP dimensions provide better query performance. Here the contents of the dimension are processed and stored on the Analysis Server and not on a Relational Server.
19. What kind of data does a fact table contain?
Ans:
A fact table contains numeric data.
20. What are the different OLAP storage models?
Ans:
Following are the different OLAP storage models: MOLAP, ROLAP and HOLAP.
21. A data analysis has to be done in the fastest possible means on data stored in Multi-dimensional format. Which storage model is best suited in this case?
Ans:
MOLAP.
---------------------------------------------------------------------------------------------------------
FAQ
22. Name the 2 important parameters that decide the granularity of partitions.
Ans:
Two important factors that decide the granularity of partitions are the overall size and manageability of the system. Both parameters are to be balanced against each other while deciding on a partitioning strategy. Suppose a data containing information about the population is partitioned on the basis of state, the two maintenance related issues that could be faced by the administrator are:
23. Are there any disadvantages of data partitioning?
Ans:
Data partitioning is by and large an advantageous technique for improving performance. However, it increases the implementation complexity and imposes constraints in query design.
24. Can partitions be indexed?
Ans:
Yes, partitions can be indexed if supported by the platform. For example, in Oracle 9i, you can create various types of partitions in indexes.
25. If you have a huge amount of historical data, which is too old to be useful often but cannot be discarded, then can partitioning help?
Ans:
Essentially the answer to this question depends on various factors such as availability of resources and design strategies. However, you can partition data on the basis of the date that it was last accessed and keep the historical data on a separate partition. In fact, you can use Stripping to keep it on a separate disk to improve access speed to the more useful data.
------------------------------------------------------------------------------------------------------------
FAQ
26. Are there any risks associated with aggregation?
Ans:
The main risk associated with aggregates is that of increase in disk storage space.
27. Once created, is an aggregate permanent?
Ans:
No, aggregates keep changing as per the need of the business. In fact, they can be taken offline or put online anytime by the administrator. Aggregates, which have become obsolete, can also be deleted to free up disk space.
28. Can operations such as MIN and MAX value be determined once a summary table has been created?
Ans:
Operations such as MIN and MAX cannot be determined correctly once the summary table has been created. To determine their value they must be calculated and stored at the time that the summary table was derived from the base table.
29. How much storage increase might be required in the data warehouse system when using aggregates?
Ans:
The storage needs typically increase by a factor of 1 or sometimes even 2 for aggregates.
---------------------------------------------------------------------------------------------------------
FAQ
30. What are conformed dimensions?
Ans:
A conformed dimension is the one whose meaning is independent of the fact table from which it is being referred to.
31. What are virtual data marts?
Ans:
Virtual data marts are logical views of multiple physical data marts based on user requirement.
32. Which tool supports data mart based data warehouse architectures?
Ans:
Informtica is commonly used for implementing data mart based data warehouse architectures.
33. Is the data in data marts also historical like in data warehouses?
Ans:
The data in data marts is historical only to some extent. In fact, it is not the same as the data in data warehouse because of the difference in the purpose and approaches of the two.
------------------------------------------------------------------------------------------------------------
FAQ
34. How can you classify metadata?
Ans:
You can classify metadata according the use of metadata as:
n Administrative metadata: Metadata that describes the data used for managing the data in terms of statistics such as time of creation, access rights, and last access time.
n Structural metadata: Metadata that describes the structure of the data.
n Descriptive metadata: Metadata that describes the purpose or functionality of the data.
35. What is backroom metadata?
Ans:
Backroom metadata is the metadata related to the process of extracting, cleaning, and loading. It is of use to the DBA and business users but not to the end-user.
36. What is a metadata catalogue?
Ans:
A metadata catalogue is the same as a metadata repository. It is also called metadatabase.
37. Are there any tools for metadata management?
Ans:
Yes, there are various tools that facilitate metadata management. One such Windows based tool is Saphir. SQL Server 2000 also enables metadata management to some extent.
-----------------------------------------------------------------------------------------------------------
FAQ
38. Are the system and process management devoid of any manual intervention considering that process manager is a tool and not a person?
Ans:
No. Although the system and process manager are themselves tools that automate system and process management in data warehouses, they must be configured and sometimes handled through manual intervention at times. These tasks may be done by the Database Administrator.
39. Does SQL Server also provide system managers?
Ans:
Yes. SQL Server includes various components that enable system management through management and security services:
40. What is Oracle Warehouse Builder(OWB)?
Ans:
It is one of the commonly used data warehouse development tools with various advanced features such as support for large databases, automated summary management, and embedded multidimensional OLAP engine. Unlike SQL Server, which is only for the Windows platform, OWB can be used on all platforms. It is also more fast, reliable, and scaleable than SQL Server.
41. What is replication?
Ans:
Replication is the process of creating multiple copies of data on the same or different platform and keeping the copies in sync.
------------------------------------------------------------------------------------------------------------
FAQ
42. What is KDD Process?
Ans:
The unifying goal of the KDD process is to extract knowledge from data in the context of large databases. It does this by using data mining methods (algorithms) to extract (identify) what is deemed knowledge, according to the specifications of measures and thresholds, using a database along with any required preprocessing, subsampling, and transformations of that database.
43. What is Data Visualization?
Ans:
Data Visualization presents data in three dimensions and colors to help users view complex patterns. They also provide advanced manipulation capabilities to slice, rotate or zoom the objects to identify patterns.
44. What are the constituents of Multidimensional objects?
Ans:
Dimensions and Measures.
45. What does level specify within dimension?
Ans:
Levels specify the contents and structure of the dimension's hierarchy.
46. What is data mining?
Ans:
Data Mining is the process of finding new and potentially useful knowledge from data
47. What does Data Mining Software do?
Ans:
A Data Mining Software searches large volume of data, looking for patterns that accurately predict behavior, such as customers most likely to maintain relationship with the company etc. Common techniques employed by Data Mining Software include Neural Networks, Decision Trees and standard statistical modeling.
48. What is Oracle Data Mining?
Ans:
Oracle Data Mining is enterprise data mining software that combines the ease of a Windows-based client with the power of a fully scalable, multi-algorithmic, UNIX server-based solution. Oracle Data Mining provides comprehensive predictive modeling capabilities that take advantage of parallel computing techniques to rapidly extract valuable customer intelligence information. Oracle Data Mining can optionally generate deployable models in C, C++, or Java code, delivering the "power of prediction" to call center, campaign management, and Web-based applications enterprise-wide.
49. How does data mining differ from OLAP?
Ans:
Simply put, OLAP compares and data mining predicts. OLAP performs roll-ups, aggregations, and calculations, and it compares multiple results in a clearly organized graphical or tabular display. Data mining analyzes data on historical cases to discover patterns and uses the patterns to make predictions or estimates of outcomes for unknown cases. An analyst may use OLAP to discover a business problem, and then apply data mining to make the predictions necessary for a solution. An OLAP user can apply data mining to discover meaningful dimensions that should be compared. OLAP may be used to perform roll-ups and aggregations needed by the data mining tool. Finally, OLAP can compare data mining predictions or values derived from predictions.
50. What are some typical data mining applications?
Ans:
Following are some of the data mining applications:
n Customer retention
n Cross selling
n Response modeling / target marketing
n Profitability analysis
n Product affinity analysis
n Fraud detection
------------------------------------------------------------------------------------------------------------
FAQ
51. What is Noisy Data?
Ans:
Noise is a random error or variance in data. It can happen because of:
n Faulty data collection and data entry mistake such as a typing mistake
n Data transmission and storage problem
n Inconsistency in naming convention
Noise makes data inaccurate for predictions and renders it futile for mining systems.
52. Which are the major data mining tasks?
Ans:
The main data mining tasks include:
n Classification
n Clustering
n Associations
n Prediction
n Characterization and Discrimination
n Evolution Analysis
53. What are some other Data Mining Languages and standardization of primitives apart from DMQL?
Ans:
Some other Data Mining Languages and standardizations of primitives apart from DMQL include:
n MSQL
n Mine Rule
n Query flocks based on Data log syntax
n OLEDB for DM
n CRISP-DM
54. Which Data Mining tools are used commercially?
Ans:
Some Data Mining tools used commercially are:
n Clementine
n Darwin
n Enterprise Miner
n Intelligent Miner
n Mine Set
55. How can noisy data be smoothened?
Ans:
Noisy data can be smoothened using the following techniques:
n Binning
n Clustering
n Computer/Human inspection
n Regression
---------------------------------------------------------------------------------------------------------
FAQ
56. What are variations to Apriori algorithm?
Ans:
Following are some of the variations to Apriori algorithm that improves the efficiency of the original algorithm:
n Transaction reduction: Reducing the number of transaction scanned in future iterations
n Partitioning: Partitioning data to find candidate itemsets
n Sampling: Mining on a subset of the given data
n Dynamic itemset counting: Adding candidate itemsets at different points during the scan
57. Which is the best approach when we are interested in finding all possible interactions among a set of attributes?
Ans:
The best approach to find all possible interactions among a set of attributes is association rule mining.
58. What is over fitting in neural network?
Ans:
Over fitting is a common problem in neural network design. Over fitting occurs when a network has memorized the training set but has not learned to generalize to new inputs. Over fitting produces a relatively small error on the training set but will produce a much larger error when new data is presented to the network.
59. What is back propagation neural network?
Ans:
The back propagation is a neural network algorithm for classification that employs a method of gradient descent. It searches for a set of weights that can model the data to minimize the mean squared distance between the network's class prediction and the actual class label of data samples. Rules may be extracted from trained neural networks in order to help improve the interpretability of the learned network.
----------------------------------------------------------------------------------------------------------
FAQ
60. What is the difference between KDD and data mining?
Ans:
KDD refers to the overall process of discovering useful knowledge from data. It also includes the choice of encoding schemes, preprocessing, sampling, and projections of the data prior to the data-mining step.
Data mining refers to the application of algorithms for extracting patterns from data without the additional steps of the KDD process. It is essentially the modeling step in the KDD process.
61. Is data stored for data mining different from other operational systems?
Ans:
The two systems differ in the usage patterns associated with data mining. Some of the basic differences are:
n Operational systems support transactional processing whereas data mining systems support analytical processing.
n Operational systems are process-oriented whereas data mining systems are subject-oriented.
n Operational systems are concerned with current data whereas data mining systems deal with historical data.
n Operational systems are updated frequently and have volatile data whereas data mining systems have non-volatile data and are rarely changed.
n Operational systems are optimized for fast updates whereas data mining systems are optimized for fast retrievals.
62. What are the primary goals of the KDD process?
Ans:
The two primary goals of data mining are:
n Prediction of unknown or future values
n Description, which focuses on finding human-interpretable patterns
63. Does KDD need to be scalable?
Ans:
KDD should be scalable because of the ever-increasing data in enterprises. Limiting the size of a data mining system will affect the accuracy of predictions.
------------------------------------------------------------------------------------------------------------
FAQ
64. What are different types of Query Answering Mechanisms?
Ans:
Query answering can be classified into two categories based on their method of response:
n Direct Query answering: It means that query is answered by returning exactly what is being asked.
n Intelligent Query answering: It refers to analyzing the intent of the query and providing generalized, neighborhood, or associated information relevant to the query.
65. Are there any social concerns related to data mining?
Ans:
The most important social issue is privacy of data pertaining to individual. KDD poses a threat to privacy. The discovered patterns often classify individuals into categories, revealing their confidential personal information. Moreover, it raises very sensitive and controversial issues, such as those that involve race, gender or religion. In addition, it may correlate and disclose confidential, sensitive facts about individuals.
66. What is Market Basket Analysis?
Ans:
Market Basket Analysis is one of the most common and useful techniques for data analysis for marketing. The purpose of market basket analysis is to determine what products customers purchase together in a market. This can be helpful to a retailer, merchant or manufacturer or any other type of organization interested in studying consumer buying patterns.
67. What is Visual Data Mining?
Ans:
Visual data mining integrates data mining and data visualization in order to discover implicit and useful knowledge from large data sets.
------------------------------------------------------------------------------------------------------------
I. Note: Answer all the questions.
68. What is Normalization? What are the different forms of Normalization ?
The usual approach in normalization in database applications is to ensure that the data is divided into two or more tables, such that when the data in one of them is updated, it does not lead to anamolies of data (The student is advised to refer any book on data base management systems for details, if interested).
The idea is to ensure that when combined, the data available is consistent. However, in data warehousing, one may even tend to break the large table into several “denormalized” smaller tables. This may lead to lots of extra space being used. But it helps in an indirect way – It avoids the overheads of joining the data during queries.
69. Define Data warehouse. What are roles of education in a data warehousing delivery process?
Data Warehouse: In it’s simplest form, a data ware house is a collection of key pieces of information used to manage and direct the business for the most profitable outcome. It would decide the amount of inventory to be held, the no. of employees to be hired, the amount to be procured on loan etc.,.
The above definition may not be precise - but that is how data ware house systems are. There are different definitions given by different authors, but we have this idea in mind and proceed. It is a large collection of data and a set of process managers that use this data to make information available. The data can be meta data, facts, dimensions and aggregations. The process managers can be load managers, ware house managers or query managers. The information made available is such that they allow the end users to make informed decisions.
Roles of education in a data warehousing delivery process:-
This has two roles to play - one to make people, specially top level policy makers, comfortable with the concept. The second role is to aid the prototyping activity. To take care of the education concept, an initial (usually scaled down) prototype is created and people are encouraged to interact with it. This would help achieve both the activities listed above. The users became comfortable with the use of the system and the ware house developer becomes aware of the limitations of his prototype which can be improvised upon.
70.What is process managers? What are the different types of process managers?
Process Managers: These are responsible for the smooth flow, maintainance and upkeep of data into and out of the database.
The main types of process managers are:-
i). Load manager: to take case of source interaction, data transformation and data load.
ii). Ware house manger: to take care of data movement, meta data management and performance
monitoring.
iii). Query manager: to control query scheduling and monitoring.
We shall look into each of them briefly. Before that, we look at a schematic diagram that defines the boundaries of the three types of managers.
71. Give the architectures of data mining systems.
72.What are the guidelines for KDD environment ?
It is customary in the computer industry to formulate rules of thumb that help information technology (IT) specialists to apply new developments. In setting up a reliable data mining environment we may follow the guidelines so that KDD system may work in a manner we desire.
i). Support extremely large data sets
ii). Support hybrid learning
iii). Establish a data warehouse
iv). Introduce data cleaning facilities
v). Facilitate working with dynamic coding
vi). Integrate with decision support system
vii). Choose extendible architecture
viii). Support heterogeneous databases
ix). Introduce client/server architecture
x). Introduce cache optimization
PART - B
II. Answer any FIVE full questions.
73. With the help of a diagram explain architecture of data warehouse.
The architecture for a data ware is indicated below. Before we proceed further, we should be clear about the concept of architecture. It only gives the major items that make up a data ware house. The size and complexity of each of these items depend on the actual size of the ware house itself, the specific requirements of the ware house and the actual details of implementation.
74. Indicate the important function of a Load Manager, Warehouse Manager.
Important function of Load Manager:
i) To extract data from the source (s)
ii) To load the data into a temporary storage device
iii) To perform simple transformations to map it to the structures of the data ware house.
Important function of Warehouse Manager:
i) Analyze the data to confirm data consistency and data integrity .
ii) Transform and merge the source data from the temporary data storage into the ware house.
iii) Create indexes, cross references, partition views etc.,.
iv) Check for normalization’s.
v) Generate new aggregations, if needed.
vi) Update all existing aggregations
vii) Create backups of data.
viii) Archive the data that needs to be archived.
75. Differentiate between vertical partitioning and horizontal partitioning.
In horizontal partitioning, we simply the first few thousand entries in one partition, the second few thousand in the next and so on. This can be done by partitioning by time, where in all data pertaining to the first month / first year is put in the first partition, the second one in the second partition and so on. The other alternatives can be based on different sized dimensions, partitioning an other dimensions, petitioning on the size of the table and round robin partitions. Each of them have certain advantages as well as disadvantages.
In vertical partitioning, some columns are stored in one partition and certain other columns of the same row in a different partition. This can again be achieved either by normalization or row splitting. We will look into their relative trade offs.
76.What is schema? Distinguish between facts and dimensions.
A schema, by definition, is a logical arrangements of facts that facilitate ease of storage and retrieval, as described by the end users. The end user is not bothered about the overall arrangements of the data or the fields in it. For example, a sales executive, trying to project the sales of a particular item is only interested in the sales details of that item where as a tax practitioner looking at the same data will be interested only in the amounts received by the company and the profits made.
The star schema looks a good solution to the problem of ware housing. It simply states that one should identify the facts and store it in the read-only area and the dimensions surround the area. Whereas the dimensions are liable to change, the facts are not. But given a set of raw data from the sources, how does one identify the facts and the dimensions? It is not always easy, but the following steps can help in that direction.
i) Look for the fundamental transactions in the entire business process. These basic entities
are the facts.
ii) Find out the important dimensions that apply to each of these facts. They are the candidates
for dimension tables.
iii) Ensure that facts do not include those candidates that are actually dimensions, with a set of
facts attached to it.
iv) Ensure that dimensions do not include these candidates that are actually facts.
77. What is an event in data warehousing? List any five events.
An event is defined as a measurable, observable occurrence of a defined action. If this definition is quite vague, it is because it encompasses a very large set of operations. The event manager is a software that continuously monitors the system for the occurrence of the event and then take any action that is suitable (Note that the event is a “measurable and observable” occurrence). The action to be taken is also normally specific to the event.
A partial list of the common events that need to be monitored are as follows:
i). Running out of memory space.
ii). A process dying
iii). A process using excessing resource
iv). I/O errors
v). Hardware failure
78. What is summary table? Describe the aspects to be looked into while designing a summary table.
The main purpose of using summary tables is to cut down the time taken to execute a specific query.
The main methodology involves minimizing the volume of data being scanned each time the query is to be
answered. In other words, partial answers to the query are already made available. For example, in the
above cited example of mobile market, if one expects
i) the citizens above 18 years of age
ii) with salaries greater than 15,000 and
iii) with professions that involve traveling are the potential customers, then, every time the query is to be processed (may be every month or every quarter), one will have to look at the entire data base to compute these values and then combine them suitably to get the relevant answers. The other method is to prepare summary tables, which have the values pertaining toe ach of these sub-queries, before hand, and then combine them as and when the query is raised.
Summary table are designed by following the steps given below:
i) Decide the dimensions along which aggregation is to be done.
ii) Determine the aggregation of multiple facts.
iii) Aggregate multiple facts into the summary table.
iv) Determine the level of aggregation and the extent of embedding.
v) Design time into the table.
vi) Index the summary table.
79 List the significant issues in automatic cluster detection.
Most of the issues related to automatic cluster detection are connected to the kinds of questions we want to be answered in the data mining project, or data preparation for their successful application.
i). Distance measure
Most clustering techniques use for the distance measure the Euclidean distance formula (square root of the sum of the squares of distances along each attribute axes).
Non-numeric variables must be transformed and scaled before the clustering can take place. Depending
on this transformations, the categorical variables may dominate clustering results or they may be even
completely ignored.
ii). Choice of the right number of clusters
If the number of clusters k in the K-means method is not chosen so to match the natural structure of the data, the results will not be good. The proper way t alleviate this is to experiment with different values for k. In principle, the best k value will exhibit the smallest intra-cluster distances and largest inter-cluster distances.
iii). Cluster interpretation
Once the clusters are discovered they have to be interpreted in order to have some value for the data mining project.
81.Define data marting. List the reasons for data marting.
The data mart stores a subset of the data available in the ware house, so that one need not always have to scan through the entire content of the ware house. It is similar to a retail outlet. A data mart speeds up the queries, since the volume of data to be scanned is much less. It also helps to have tail or made processes for different access tools, imposing control strategies etc.,.
Following are the reasons for which data marts are created:
i) Since the volume of data scanned is small, they speed up the query processing.
ii) Data can be structured in a form suitable for a user access too
iii) Data can be segmented or partitioned so that they can be used on different platforms and also different control strategies become applicable.
82. Explain how to categorize data mining system.
There are many data mining systems available or being developed. Some are specialized systems dedicated to a given data source or are confined to limited data mining functionalities, other are more versatile and comprehensive. Data mining systems can be categorized according to various criteria among other classification are the following:
a) Classification according to the type of data source mined: this classification categorizes data mining systems according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc.
b) Classification according to the data model drawn on: this classification categorizes data mining systems based on the data model involved such as relational database, object-oriented database, data warehouse, transactional, etc.
c) Classification according to the king of knowledge discovered: this classification categorizes data mining systems based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination, association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data mining functionalities together.
d) Classification according to mining techniques used: Data mining systems employ and provide different techniques. This classification categorizes data mining systems according to the data analysis approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization, database oriented or data warehouse-oriented, etc.
83. List and explain different kind of data that can be mined.
Different kind of data that can be mined are listed below:-
i). Flat files: Flat files are actually the most common data source for data mining algorithms, especially at the research level.
ii). Relational Databases: A relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships.
iii). Data Warehouses: A data warehouse as a storehouse, is a repository of data collected from multiple data sources (often heterogeneous) and is intended to be used as a whole under the same unified schema.
iv). Multimedia Databases: Multimedia databases include video, images, audio and text media. They can be stored on extended object-relational or object-oriented databases, or simply on a file system.
v). Spatial Databases: Spatial databases are databases that in addition to usual data, store geographical information like maps, and global or regional positioning.
vi). Time-Series Databases: Time-series databases contain time related data such stock market data or logged activities. These databases usually have a continuous flow of new data coming in, which sometimes causes the need for a challenging real time analysis.
vii). World Wide Web: The World Wide Web is the most heterogeneous and dynamic repository available. A very large number of authors and publishers are continuously contributing to its growth and metamorphosis and a massive number of users are accessing its resources daily.
84. Give the syntax for task relevant data specification.
Syntax for tax-relevant data specification:-
The first step in defining a data mining task is the specification of the task-relevant data, that is, the data on which mining is to be performed. This involves specifying the database and tables or data warehouse containing the relevant data, conditions for selecting the relevant data, the relevant attributes or dimensions for exploration, and instructions regarding the ordering or grouping of the data retrieved. DMQL provides clauses for the clauses for the specification of such information, as follows:-
i). use database (database_name) or use data warehouse (data_warehouse_name): The use clause directs the mining task to the database or data warehouse specified.
ii). from (relation(s)/cube(s)) [where(condition)]: The from and where clauses respectively specify the database tables or data cubes involved, and the conditions defining the data to be retrieved.
iii). in relevance to (attribute_or_dimension_list): This clause lists the attributes or dimensions for exploration.
iv). order by (order_list): The order by clause specifies the sorting order of the task relevant data.
v). group by (grouping_list): the group by clause specifies criteria for grouping the data.
vi). having (conditions): The having cluase specifies the condition by which groups of data are considered relevant.
85. Explain the designing of GUI based on data mining query language.
A data mining query language provides necessary primitives that allow users to communicate with data mining systems. But novice users may find data mining query language difficult to use and the syntax difficult to remember. Instead , user may prefer to communicate with data mining systems through a graphical user interface (GUI). In relational database technology , SQL serves as a standard core language for relational systems , on top of which GUIs can easily be designed. Similarly, a data mining query language may serve as a core language for data mining system implementations, providing a basis for the development of GUI for effective data mining.
A data mining GUI may consist of the following functional components:-
a) Data collection and data mining query composition - This component allows the user to specify task-relevant data sets and to compose data mining queries. It is similar to GUIs used for the specification of relational queries.
b) Presentation of discovered patterns – This component allows the display of the discovered patterns in various forms, including tables, graphs, charts, curves and other visualization techniques.
c) Hierarchy specification and manipulation - This component allows for concept hierarchy specification , either manually by the user or automatically. In addition , this component should allow concept hierarchies to be modified by the user or adjusted automatically based on a given data set distribution.
d) Manipulation of data mining primitives – This component may allow the dynamic adjustment of data mining thresholds, as well as the selection, display and modification of concept hierarchies. It may also allow the modification of previous data mining queries or conditions.
e) Interactive multilevel mining – This component should allow roll-up or drill-down operations on discovered patterns.
f) Other miscellaneous information – This component may include on-line help manuals, indexed search , debugging and other interactive graphical facilities.
86. Explain how decision trees are useful in data mining.
Decision trees are powerful and popular tools for classification and prediction. The attractiveness of tree-based methods is due in large part to the fact that, it is simple and decision trees represent rules. Rules can readily be expressed so that we humans can understand them or in a database access language like SQL so that records falling into a particular category may be retrieved.
87 Identify an application and also explain the techniques that can be incorporated in solving the problem using data mining techniques.
Write yourself...
88.Write a short notes on :
i) Data Mining Querying Language
ii) Schedule Manager
iii) Data Formatting.
i) Data Mining Querying Language
A data mining language helps in effective knowledge discovery from the data mining systems. Designing
a comprehensive data mining language is challenging because data mining covers a wide spectrum of
tasks from data characterization to mining association rules, data classification and evolution analysis.
Each task has different requirements. The design of an effective data mining query language requires a
deep understanding of the power, limitation and underlying mechanism of the various kinds of data mining
tasks.
ii) Schedule manager
The scheduling is the key for successful warehouse management. Almost all operations in the ware
house need some type of scheduling. Every operating system will have it’s own scheduler and batch
control mechanism. But these schedulers may not be capable of fully meeting the requirements of a data
warehouse. Hence it is more desirable to have specially designed schedulers to manage the operations.
iii) Data formatting
Final data preparation step which represents syntactic modifications to the data that do not change its
meaning, but are required by the particular modelling tool chosen for the DM task. These include:
a). reordering of the attributes or records: some modelling tools require reordering of the attributes
(or records) in the dataset: putting target attribute at the beginning or at the end, randomizing
order of records (required by neural networks for example)
b). changes related to the constraints of modelling tools: removing commas or tabs, special
characters, trimming strings to maximum allowed number of characters, replacing special
characters with allowed set of special characters.
---------------------------------------------------------------------------------------------------------------
89.With neat diagram explain the main parts of the computer
A Computer will have 3 basic main parts –
i). A central processing unit that does all the arithmetic and logical operations. This can be
thought of as the heart of any computer and computers are identified by the type of CPU
that they use.
ii). The memory is supposed to hold the programs and data. All the computers that we came
across these days are what are known as “stored program computers”. The programs are
to be stored before hand in the memory and the CPU accesses these programs line by line
and executes them.
iii). The Input/output devices: These devices facilitate the interaction of the users with the computer.
The input devices are used to send information to the computer, while the output devices
accept the processed information form the computer and make it available to the user.
Diagram:-
91. Briefly explain the types of memories.
There are two types of memories – Primary memory, which is embedded in the computer
and which is the main source of data to the computer and the secondary memory like floppy disks, CDs etc., which can be carried around and used in different computers. They cost much less than the primary memory, but the CPU can access data only from the primary memory. The main advantage of computer memories, both primary and secondary, is that they can store data indefinitely and accurately
92. Describe the basic concept of databases.
The Concept of Database :-
We have seen in the previous section how data can be stored in computer. Such stored data becomes
a “database” – a collection of data. For example, if all the marks scored by all the students of a class are
stored in the computer memory, it can be called a database. From such a database, we can answer
questions like – who has scored the highest marks? ; In which subject the maximum number of students
have failed?; Which students are weak in more than one subject? etc. Of course, appropriate programs
have to be written to do these computations. Also, as the database becomes too large and more and more
data keeps getting included at different periods of time, there are several other problems about “maintaining”
these data, which will not be dealt with here.
Since handling of such databases has become one of the primary jobs of the computer in recent years,
it becomes difficult for the average user to keep writing such programs. Hence, special languages –
called database query languages- have been deviced, which makes such programming easy, there languages
help in getting specific “queries” answered easily.
93. With example explain the different views of a data.
Data is normally stored in tabular form, unless storage in other formats becomes advantageous, we
store data in what are technically called “relations” or in simple terms as “tables”.
The views are Mainly 2 types .
i). Simple View
ii). Complex View
Simple view:
- It is created by selecting only one table.
- It does not contains functions.
- it can perform DML (SELECT,INSERT,UPDATE,DELETE,MERGE, CALL,LOCK TABLE) operations through simple view.
Complex view :
-It is created by selecting more than one table.
-It can performs functions.
-You can not perform always DML operations through
94 Briefly explain the concept of normalization.
Normalization is dealt with in several chapters of any books on database management systems. Here, we will take the simplest definition, which suffices our purpose namely any field should not have subfields.
Again consider the following student table.
Here under the field marks, we have 3 sub fields: marks for subject1, marks for subject2 and subject3.
However, it is preferable split these subfields to regular fields as shown below
Quite often, the original table which comes with subfields will have to be modified suitable, by the
process of “normalization”.
95 Explain the concept of data ware house delivery process in detail.
The concept of data ware house delivery process :-
This section deals with the dataware house from a different view point - how the different components that go into it enable the building of a data ware house. The study helps us in two ways:
i) to have a clear view of the data ware house building process.
ii) to understand the working of the data ware house in the context of the components.
Now we look at the concepts in details :-
i). IT Strategy : The company must and should have an overall IT strategy and the data ware housing has to be a part of the overall strategy.
ii). Business case analysis : This looks at an obvious thing, but is most often misunderstood. The overall understanding of the business and the importance of various components there in is a must. This will ensure that one can clearly justify the appropriate level of investment that goes into the data ware house design and also the amount of returns accruing.
iii). Education : This has two roles to play - one to make people, specially top level policy makers, comfortable with the concept. The second role is to aid the prototyping activity.
iv). Business Requirements : As has been discussed earlier, it is essential that the business requirements are fully understood by the data ware house planner. This would ensure that the ware house is incorporated adequately in the overall setup of the organization.
v). Technical blue prints : This is the stage where the overall architecture that satisfies the requirements is delivered.
vi). Building the vision : Here the first physical infrastructure becomes available. The major infrastructure components are set up, first stages of loading and generation of data start up.
vii). History load : Here the system is made fully operational by loading the required history into the ware house - i.e. what ever data is available over the previous years is put into the dataware house to make is fully operational.
viii). Adhoc Query : Now we configure a query tool to operate against the data ware house.
ix). Automation : This phase automates the various operational processes like -
a) Extracting and loading of data from the sources.
b) Transforming the data into a suitable form for analysis.
c) Backing up, restoration and archiving.
d) Generate aggregations.
e) Monitoring query profiles.
x). Extending Scope : There is not single mechanism by which this can be achieved. As and when needed, a new set of data may be added, new formats may be included or may be even involve major changes.
xi). Requirement Evolution : Business requirements will constantly change during the life of the ware house. Hence, the process that supports the ware house also needs to be constantly monitored and modified.
96. What are three major activities of data ware house? Explain.
Three major activities of data ware house are :-
i) Populating the ware house (i.e. inclusion of data)
ii) day-to-day management of the ware house.
iii) Ability to accommodate the changes.
i). The processes to populate the ware house have to be able to extract the data, clean it up, and make it available to the analysis systems. This is done on a daily / weekly basis depending on the quantum of the data population to be incorporated.
ii). The day to day management of data ware house is not to be confused with maintenance and management of hardware and software. When large amounts of data are stored and new data are being continually added at regular intervals, maintaince of the “quality” of data becomes an important element.
iii). Ability to accommodate changes implies the system is structured in such a way as to be able to cope with future changes without the entire system being remodeled. Based on these, we can view the processes that a typical data ware house scheme should support as follows.
96. What are three major activities of data ware house? Explain.
Three major activities of data ware house are :-
i) Populating the ware house (i.e. inclusion of data)
ii) day-to-day management of the ware house.
iii) Ability to accommodate the changes.
i). The processes to populate the ware house have to be able to extract the data, clean it up, and make it available to the analysis systems. This is done on a daily / weekly basis depending on the quantum of the data population to be incorporated.
ii). The day to day management of data ware house is not to be confused with maintenance and management of hardware and software. When large amounts of data are stored and new data are being continually added at regular intervals, maintaince of the “quality” of data becomes an important element.
iii). Ability to accommodate changes implies the system is structured in such a way as to be able to cope with future changes without the entire system being remodeled. Based on these, we can view the processes that a typical data ware house scheme should support as follows.
97. Explain the extract and load process of data ware house.
Extract and Load Process : This forms the first stage of data ware house. External physical systems like the sales counters which give the sales data, the inventory systems that give inventory levels etc. constantly feed data to the warehouse. Needless to say, the format of these external data is to be monitored and modified before loading it into the ware house. The data ware house must extract the data from the source systems, load them into their data bases, remove unwanted fields (either because they are not needed or because they are already there in the data base), adding new fields / reference data and finally reconciling with the other data. We shall see a few more details of theses broad actions in the subsequent paragraphs.
i). A mechanism should be evolved to control the extraction of data, check their consistency
etc. For example, in some systems, the data is not authenticated until it is audited.
ii). Having a set of consistent data is equally important. This especially happens when we are
having several online systems feeding the data.
iii). Once data is extracted from the source systems, it is loaded into a temporary data storage
before it is “Cleaned” and loaded into the warehouse.
98. In what ways data needs to be cleaned up and checked? Explain briefly.
Data needs to be cleaned up and checked in the following ways :-
i) It should be consistent with itself.
ii) It should be consistent with other data from the same source.
iii) It should be consistent with other data from other sources.
iv) It should be consistent with the information already available in the data ware house.
While it is easy to list act the needs of a “clean” data, it is more difficult to set up systems that
automatically cleanup the data. The normal course is to suspect the quality of data, if it does not meet the normally standards of commonsense or it contradicts with the data from other sources, data already available in the data ware house etc. Normal intution doubts the validity of the new data and effective measures like rechecking, retransmission etc., are undertaken. When none of these are possible, one may even resort to ignoring the entire set of data and get on with next set of incoming data.
99. Explain the architecture of data warehouse.
The architecture for a data ware is indicated below. Before we proceed further, we should be clear about the concept of architecture. It only gives the major items that make up a data ware house. The size and complexity of each of these items depend on the actual size of the ware house itself, the specific requirements of the ware house and the actual details of implementation.100. Briefly explain the functions of each manager of data warehouse.
The Warehouse Manager : The ware house manager is a component that performs all operations necessary to support the ware house management process. Unlike the load manager, the warehouse management process is driven by the extent to which the operational management of the data ware house has been automated.
The ware house manger can be easily termed to be the most complex of the ware house components, and performs a variety of tasks. A few of them can be listed below.
i) Analyze the data to confirm data consistency and data integrity.
ii) Transform and merge the source data from the temporary data storage into the ware house.
iii) Create indexes, cross references, partition views etc.,.
iv) Check for normalization’s.
v) Generate new aggregations, if needed.
vi) Update all existing aggregations
vii) Create backups of data.
viii) Archive the data that needs to be archived.
101. Explain the star schema to represent the sales analysis.
Star schemes are data base schemas that structure the data to exploit a typical decision supportenquiry. When the components of typical enquiry’s are examined, a few similarities stand out.
i) The queries examine a set of factual transactions - sales for example.
ii) The queries analyze the facts in different ways - by aggregating them on different bases /
graphing them in different ways.
The central concept of most such transactions is a “fact table”. The surrounding references are called dimension tables. The combination can be called a star schema.
102. What do you mean by partition of data? Explain briefly.
Partitioning of data :-
In most ware houses, the size of the fact data tables tends to become very large. This leads to several problems of management, backup, processing etc. These difficulties can be over come by partitioning each fact table into separate partitions.
Data ware houses tend to exploit these ideas by partitioning the large volume of data into data sets. For example, data can be partitioned on weekly / monthly basis, so as the minimize the amount of data scanned before answering a query. This technique allows data to be scanned to be minimized, without the overhead of using an index. This improves the overall efficiency of the system. However, having too many partitions can be counter productive and an optimal size of the partitions and the number of such partitions is of vital importance.
Participating generally helps in the following ways.
i) Assists in better management of data
ii) Ease of backup / recovery since the volumes are less.
iii) The star schemes with partitions produce better performance.
iv) Since several hardware architectures operate better in a partitioned environment, the overall
system performance improve.
103. Describe the terms data mart and Meta data.
Data mart :-
A data mart is a subset of information content of a data ware house, stored in it’s own data base. The data of a data ware house may have been collected through a ware house or in some cases, directly from the source. In a crude sense, if you consider a data ware house as a whole sale shop of data, a data mart can be thought of as a retailer.
Meta data :-
Meta data is simply data about data. Data normally describe the objects, their
quantity, size, how they are stored etc. Similarly meta data stores data about how data (of objects) is stored, etc.
Meta data is useful in a number of ways. It can map data sources to the common view of information within the warehouse. It is helpful in query management, to direct query to most appropriate source etc.,.
The structure of meta data is different for each process. It means for each volume of data, there are multiple sets of meta data describing the same volume. While this is a very convenient way of managing data, managing meta data itself is not a very easy task.
104. Enlist the differences between fact and dimension.
This ensures that key dimensions are no fact tables.
Consider the following example :-
Let us elaborate a little on the example. Consider a customer A. If there is a situation, where the
warehouse is building the profiles of customer, then A becomes a fact - against the name A, we can list his address, purchases, debts etc. One can ask questions like how many purchases has A made in the last 3 months etc. Then A is fact. On the other hand, if it is likely to be used to answer questions like “how many customers have made more than 10 purchases in the last 6 months”, and one uses the data of A, as well as of other customers to give the answer, then it becomes a fact table. The rule is, in such cases, avoid making A as a candidate key.
105. Explain the designing of star-flake schema in detail.
A star flake schema, as we have defined previously, is a schema that uses a combination of denormalised star and normalized snow flake schemas. They are most appropriate in decision support data ware houses. Generally, the detailed transactions are stored within a central fact table, which may be partitioned horizontally or vertically. A series of combinatory data base views are created to allow the user to access tools to treat the fact table partitions as a single, large table.
The key reference data is structured into a set of dimensions. Theses can be referenced from the fact table. Each dimension is stored in a series of normalized tables (snow flakes), with an additional denormalised star dimension table.
106. What is query redirection? Explain.
Query Redirection :-
One of the basic requirements for successful operation of star flake schema (or any schema, for that matter) is the ability to direct a query to the most appropriate source. Note that once the available data grows beyond a certain size, partitioning becomes essential. In such a scenario, it is essential that, in order to optimize the time spent on querying, the queries should be directed to the appropriate partitions that store the date required by the query.
The basic method is to design the access tool in such away that it automatically defines the locality to which the query is to be redirected.
107. In detail, explain the multidimensional schema.
Multidimensional schemas :-
Before we close, we see the interesting concept of multi dimensions. This is a very convenient
method of analyzing data, when it goes beyond the normal tabular relations.
For example, a store maintains a table of each item it sells over a month as a table, in each of it’s 10 outlets..
This is a 2 dimensional table. One the other hand, if the company wants a data of all items sold by it’s outlets, it can be done by simply by superimposing the 2 dimensional table for each of these items – one behind the other. Then it becomes a 3 dimensional view.
Then the query, instead of looking for a 2 dimensional rectangle of data, will look for a 3 dimensional cuboid of data.
There is no reason why the dimensioning should stop at 3 dimensions. In fact almost all queries can be thought of as approaching a multi-dimensioned unit of data from a multidimensioned volume of the schema.
108. Why partitioning is needed in large data warehouse?
Partitioning is needed in any large data ware house to ensure that the performance and manageability is improved. It can help the query redirection to send the queries to the appropriate partition, thereby reducing the overall time taken for query processing.
109. Explain the types of partitioning in detail.
i). Horizontal partitioning :-
As the name suggests, a vertical partitioning scheme divides the table vertically – i.e. each row is
divided into 2 or more partitions.
iii). Hardware partitioning :-
Needless to say, the dataware design process should try to maximize the performance of the system. One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture.
110. Explain the mechanism of row splitting.
Row Splitting :-
The method involved identifying the not so frequently used fields and putting them into another table.
This would ensure that the frequently used fields can be accessed more often, at much lesser computation time.
It can be noted that row splitting may not reduce or increase the overall storage needed, but normalization may involve a change in the overall storage space needed. In row splitting, the mapping is 1 to 1 whereas normalization may produce one to many relationships.
113. Explain the different aspects for designing the summary table.
Summary table are designed by following the steps given below :-
i). Decide the dimensions along which aggregation is to be done.
ii). Determine the aggregation of multiple facts.
iii). Aggregate multiple facts into the summary table.
iv). Determine the level of aggregation and the extent of embedding.
v). Design time into the table.
vi). Index the summary table.
114. Give the reasons for creating the data mart.
The following are the reasons for which data marts are created :-
i). Since the volume of data scanned is small, they speed up the query processing.
ii). Data can be structured in a form suitable for a user access too
iii). Data can be segmented or partitioned so that they can be used on different platforms and
also different control strategies become applicable.
115. Explain the two stages in setting up data marts.
There are two stages in setting up data marts :-
i). To decide whether data marts are needed at all. The above listed facts may help you to
decide whether it is worth while to setup data marts or operate from the warehouse itself.
The problem is almost similar to that of a merchant deciding whether he wants to set up retail
shops or not.
ii). If you decide that setting up data marts is desirable, then the following steps have to be gone
through before you can freeze on the actual strategy of data marting.
a) Identify the natural functional splits of the organization.
b) Identify the natural splits of data.
c) Check whether the proposed access tools have any special data base structures.
d) Identify the infrastructure issues, if any, that can help in identifying the data marts.
e) Look for restrictions on access control. They can serve to demarcate the warehouse
details.
116. What are disadvantages of data mart?
There are certain disadvantages :-
i). The cost of setting up and operating data marts is quite high.
ii). Once a data strategy is put in place, the datamart formats become fixed. It may be fairly difficult to change the strategy later, because the data marts formats also have to be changes.
117. What is role of access control issue in data mart design?
Role of access control issue in data mart design :-
This is one of the major constraints in data mart designs. Any data warehouse, with it’s huge volume
of data is, more often than not, subject to various access controls as to who could access which part of data. The easiest case is where the data is partitioned so clearly that a user of each partition cannot access any other data. In such cases, each of these can be put in a data mart and the user of each can access only his data .
In the data ware house, the data pertaining to all these marts are stored, but the partitioning are retained. If a super user wants to get an overall view of the data, suitable aggregations can be generated.
118. Explain the purpose of using metadata in detail.
Metadata will be used for the following purposes :-
i). data transformation and loading.
ii). data management.
iii). query generation.
· Tables
- Columns
* Names
* Types
· Indexes
- Columns
* Name
* Type
· Views
- Columns
* Name
* Type
· Constraints
- Name
- Type
- Table
* Columns
120. How the query manager uses the Meta data? Explain in detail.
Meta data is also required to generate queries. The query manger uses the metadata to build a history of all queries run and generator a query profile for each user, or group of uses.
We simply list a few of the commonly used meta data for the query. The names are self explanatory.
o Query
o Table accessed
§ Column accessed
· Name
· Reference identifier
o Restrictions applied
o Column name
o Table name
o Reference identifier
o Restrictions
o Join criteria applied
o Column name
o Table name
o Reference identifier
o Column name
o Table name
o Reference identifier
o Aggregate function used
o Column name
o Reference identifier
o Aggregate function
o Group by criteria
o Column name
o Reference identifier
o Sort direction
o Syntax
o Resources
o Disk
o Read
o Write
o Temporary
121. Why we need different managers to a data ware house? Explain.
Need for managers to a data ware house :-
Data warehouses are not just large databases. They are complex environments that integrate many
technologies. They are not static, but will be continuously changing both contentwise and structurewise. Thus, there is a constant need for maintenance and management. Since huge amounts of time, money and efforts are involved in the development of data warehouses, sophisticated management tools are always justified in the case of data warehouses.
When the computer systems were in their initial stages of development, there used to be an army of
human managers, who went around doing all the administration and management. But such a scheme became both unvieldy and prone to errors as the systems grew in size and complexity. Further most of the management principles were adhoc in nature and were subject to human errors and fatigue.
122. With neat diagram explain the boundaries of process managers.
A schematic diagram that defines the boundaries of the three types of managers :-
123. Explain the responsibilities of each manager of data ware house.
Ware house Manager :-
The warehouse manager is responsible for maintaining data of the ware house. It should also create
and maintain a layer of meta data. Some of the responsibilities of the ware house manager are
o Data movement
o Meta data management
o Performance monitoring
o Archiving.
Data movement includes the transfer of data within the ware house, aggregation, creation and
maintenance of tables, indexes and other objects of importance. It should be able to create new aggregations as well as remove the old ones. Creation of additional rows / columns, keeping track of the aggregation processes and creating meta data are also it’s functions.
124. What are the different system management tools used for data warehouse?
The different system management tools used for data warehouse :-
i). Configuration managers
ii). schedule managers
iii). event managers
iv). database mangers
v). back up recovery managers
vi). resource and performance a monitors.
108. Why partitioning is needed in large data warehouse?
Partitioning is needed in any large data ware house to ensure that the performance and manageability is improved. It can help the query redirection to send the queries to the appropriate partition, thereby reducing the overall time taken for query processing.
i). Horizontal partitioning :-
This is essentially means that the table is partitioned after the first few thousand entries, and the next
few thousand entries etc. This is because in most cases, not all the information in the fact table needed all the time. Thus horizontal partitioning helps to reduce the query access time, by directly cutting down the amount of data to be scanned by the queries.
ii). Vertical partitioning :-As the name suggests, a vertical partitioning scheme divides the table vertically – i.e. each row is
divided into 2 or more partitions.
iii). Hardware partitioning :-
Needless to say, the dataware design process should try to maximize the performance of the system. One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture.
110. Explain the mechanism of row splitting.
Row Splitting :-
The method involved identifying the not so frequently used fields and putting them into another table.
This would ensure that the frequently used fields can be accessed more often, at much lesser computation time.
It can be noted that row splitting may not reduce or increase the overall storage needed, but normalization may involve a change in the overall storage space needed. In row splitting, the mapping is 1 to 1 whereas normalization may produce one to many relationships.
111. Explain the guidelines used for hardware partitioning.
Guidelines used for hardware partitioning :-
Needless to say, the dataware design process should try to maximize the performance of the system. One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture. Obviously, the exact details of optimization depends on the hardware platforms. Normally the following guidelines are useful:-
i). maximize the processing, disk and I/O operations.
ii). Reduce bottlenecks at the CPU and I/O
112. What is aggregation? Explain the need of aggregation. Give example.
Aggregation : Data aggregation is an essential component of any decision support data ware house. It helps us to ensure a cost – effective query performance, which in other words means that costs incurred to get the answers to a query would be more than off set by the benefits of the query answer. The data aggregation attempts to do this by reducing the processing power needed to process the queries. However, too much of aggregations would only lead to unacceptable levels of operational costs.
Too little of aggregations may not improve the performance to the required levels. A file balancing of
the two is essential to maintain the requirements stated above. One thumbrule that is often suggested is that about three out of every four queries would be optimized by the aggregation process, whereas the fourth will take it’s own time to get processed. The second, though minor, advantage of aggregations is that they allow us to get the overall trends in the data. While looking at individual data such overall trends may not be obvious, whereas aggregated data will help us draw certain conclusions easily.
113. Explain the different aspects for designing the summary table.
Summary table are designed by following the steps given below :-
i). Decide the dimensions along which aggregation is to be done.
ii). Determine the aggregation of multiple facts.
iii). Aggregate multiple facts into the summary table.
iv). Determine the level of aggregation and the extent of embedding.
v). Design time into the table.
vi). Index the summary table.
114. Give the reasons for creating the data mart.
The following are the reasons for which data marts are created :-
i). Since the volume of data scanned is small, they speed up the query processing.
ii). Data can be structured in a form suitable for a user access too
iii). Data can be segmented or partitioned so that they can be used on different platforms and
also different control strategies become applicable.
115. Explain the two stages in setting up data marts.
There are two stages in setting up data marts :-
i). To decide whether data marts are needed at all. The above listed facts may help you to
decide whether it is worth while to setup data marts or operate from the warehouse itself.
The problem is almost similar to that of a merchant deciding whether he wants to set up retail
shops or not.
ii). If you decide that setting up data marts is desirable, then the following steps have to be gone
through before you can freeze on the actual strategy of data marting.
a) Identify the natural functional splits of the organization.
b) Identify the natural splits of data.
c) Check whether the proposed access tools have any special data base structures.
d) Identify the infrastructure issues, if any, that can help in identifying the data marts.
e) Look for restrictions on access control. They can serve to demarcate the warehouse
details.
116. What are disadvantages of data mart?
There are certain disadvantages :-
i). The cost of setting up and operating data marts is quite high.
ii). Once a data strategy is put in place, the datamart formats become fixed. It may be fairly difficult to change the strategy later, because the data marts formats also have to be changes.
117. What is role of access control issue in data mart design?
Role of access control issue in data mart design :-
This is one of the major constraints in data mart designs. Any data warehouse, with it’s huge volume
of data is, more often than not, subject to various access controls as to who could access which part of data. The easiest case is where the data is partitioned so clearly that a user of each partition cannot access any other data. In such cases, each of these can be put in a data mart and the user of each can access only his data .
In the data ware house, the data pertaining to all these marts are stored, but the partitioning are retained. If a super user wants to get an overall view of the data, suitable aggregations can be generated.
118. Explain the purpose of using metadata in detail.
Metadata will be used for the following purposes :-
i). data transformation and loading.
ii). data management.
iii). query generation.
119. Explain the concept of metadata management.
Meta data should be able to describe data as it resides in the data warehouse. This will help the warehouse manager to control data movements. The purpose of the metadata is to describe the objects in the database. Some of the descriptions are listed here.· Tables
- Columns
* Names
* Types
· Indexes
- Columns
* Name
* Type
· Views
- Columns
* Name
* Type
· Constraints
- Name
- Type
- Table
* Columns
120. How the query manager uses the Meta data? Explain in detail.
Meta data is also required to generate queries. The query manger uses the metadata to build a history of all queries run and generator a query profile for each user, or group of uses.
We simply list a few of the commonly used meta data for the query. The names are self explanatory.
o Query
o Table accessed
§ Column accessed
· Name
· Reference identifier
o Restrictions applied
o Column name
o Table name
o Reference identifier
o Restrictions
o Join criteria applied
o Column name
o Table name
o Reference identifier
o Column name
o Table name
o Reference identifier
o Aggregate function used
o Column name
o Reference identifier
o Aggregate function
o Group by criteria
o Column name
o Reference identifier
o Sort direction
o Syntax
o Resources
o Disk
o Read
o Write
o Temporary
121. Why we need different managers to a data ware house? Explain.
Need for managers to a data ware house :-
Data warehouses are not just large databases. They are complex environments that integrate many
technologies. They are not static, but will be continuously changing both contentwise and structurewise. Thus, there is a constant need for maintenance and management. Since huge amounts of time, money and efforts are involved in the development of data warehouses, sophisticated management tools are always justified in the case of data warehouses.
When the computer systems were in their initial stages of development, there used to be an army of
human managers, who went around doing all the administration and management. But such a scheme became both unvieldy and prone to errors as the systems grew in size and complexity. Further most of the management principles were adhoc in nature and were subject to human errors and fatigue.
122. With neat diagram explain the boundaries of process managers.
A schematic diagram that defines the boundaries of the three types of managers :-
123. Explain the responsibilities of each manager of data ware house.
Ware house Manager :-
The warehouse manager is responsible for maintaining data of the ware house. It should also create
and maintain a layer of meta data. Some of the responsibilities of the ware house manager are
o Data movement
o Meta data management
o Performance monitoring
o Archiving.
Data movement includes the transfer of data within the ware house, aggregation, creation and
maintenance of tables, indexes and other objects of importance. It should be able to create new aggregations as well as remove the old ones. Creation of additional rows / columns, keeping track of the aggregation processes and creating meta data are also it’s functions.
124. What are the different system management tools used for data warehouse?
The different system management tools used for data warehouse :-
i). Configuration managers
ii). schedule managers
iii). event managers
iv). database mangers
v). back up recovery managers
vi). resource and performance a monitors.
-----------------------------------------------------------------------------------------------------
No comments:
Post a Comment