Business Intelligence

Monday, February 1, 2010

Week 16 Guest Lecture

We have a guest lecture this week. The guest lecture was fun. We had people from different industry to share with us their experience on BI/DM and how does this two application help them in their daily operation or marketing.

Most of the information shared by the guest was like a recall of my major project. Since we did data mining for Estee Lauder Companies, thus it is very easy for us to understand the contain and discussion during the guest lecture. Had fun doing the lecture activity quiz, it is just like writing down some of the findings from our major project.

I believe that BI/ DM is very important to any organization and industry. It is very difficult for an organization to grow without BI. BI helps an organization to understand its' customer, coming out more cost effective marketing campaign, reducing churning customer and improving KPIs of the company.

In conclusion, I think I made a right choice of selecting BI as one of my electives. The lecturers and tutors are very experience in this industry. I have benefited a lot from this subject, the lectures help us to build up understanding about Business Intelligence while the lab lessons give us experience on building up a dashboard or scorecard. Personally I enjoy doing the project very much as it build up my confidence in using the software. This is a great subject that I will recommend to my juniors.

Happy learning
Chenyuan

Monday, January 18, 2010

Week 15 Future of Data Warehouse, Data Mining and Data Visualisation

This weeks' lecture is all about trends and future development of the data warehouse. I get an overall picture of the data warehouse development. I am very surprised by the media of storage which is covered in the lecture. According to Ms Chong, the future of data warehousing is not high-performance disk storage, but an array of alternative storage. This involves two forms of storage. Near-line storage involves an automated silo where tape cartridges are handled automatically. Secondary storage is slower and less expensive, such as CD-ROMs and floppy disks.

I think this trend of alternative storage is very reasonable as for data that are placed there once and left alone, so do not need to be updated at high speed. If data gets accessed less often as it ages, it can be moved to secondary storage, giving the resources to the new data and making access to newer data more efficient.

I have read through this article from Corporate Information Factory. It was written by W. H. Inmon. In this article it mention that high performance disk storage plays only a secondary role in the future of data warehousing. The real future of data warehousing is in a storage media collectively known as "alternative storage". This support what Ms Chong have taught us in the lecture.

I have extracted out from the article some reasons that why high-performance disk storage not a choice for data warehouse.

Secondary storage is a form of disk storage but whose disk is slower, significantly less expensive and less cached than high performance storage.
Data warehouse data is very stable.
The ability to store far more data on near line and/or secondary storage.
The rate at which secondary storage and near line storage is getting cheaper is at a faster rate than high performance storage.

Most organization is currently under the stage of non integrated information architecture. For the next few years companies will be moving towards integrated information center where data format and information is all standardized. According to Philip Howard, most important trend is towards the integration of text mining and data mining.

The future of Data Mining is to react more quickly and offer better service, and do it all with fewer people and at a lower cost.

The increase in hardware speed and capacity makes it possible to analyze data sets that were too large just a few years ago. However while the available data exponentially, the industry is looking into automatic procedures for data mining.

Data Mining is also been used in protecting privacy information.

One current intrusion detection technique is misuse detection – scanning for malicious activity patterns known by signatures.
Another technique is anomaly detection where there is an attempt to identify malicious activity based on deviations from norms.

STEPHEN FEW and PERCEPTUAL EDGE mention that data visualization is increasingly taking its rightful place as an important part of business intelligence.

Data visualization has in recent years become an established area of study in academia. Many universities now have faculty members who focus on visualization and a few have excellent programs that serve the needs of many graduate students who produce worthwhile research studies and prototype applications.

Both of them expect that data visualization will continue for the next few years to pursue and mature those trends that have already begun. Dashboards, visual analytics, and even simple graphs will continue to develop and conform to best practices. They also have seen evidence that newer efforts are emerging that will soon develop into full-blown trends.

According to STEPHEN FEW and PERCEPTUAL EDGE, another expression of data visualization that has captured the imagination of many in the business world in recent years is geo-spatial visualization. The popularity of Google Earth and other similar Web services have contributed a great deal to this interest.

Another trend that has made the journey in recent years from the academic research community to commercial software tackles the problem of displaying large sets of quantitative data in the limited space of a screen. The most popular example of this is the treemap.

That's all I like to share for this week

Happy Reading
Cheers
Chenyuan

Week 14 Implementing Enterprise Business Intelligence Systems

Implementing a BI system is not easy. Before any implementation, organization must first identify what do they want to achieve. Understand what is important for the organization together with data sources and analytical capabilities that the organization have. Finally determine the gap between current resources and the goals that the organization has for BI

BI Program Management Office (BI PMO) has to be created.

BI POM will establish project priorities and obtain funding for projects
Oversee the enforcement of standards, policies and procedures as well as the development of data models and ETL code.
Identify and create the overarching BI architecture
Determine how each project will further develop or fit into the existing architecture

The corporate information factory shows the information flow direction inside an organization. Change in management will be one of the cause for the failure of implementing BI system.

Without data quality, reporting and analytical processes will be unreliable and dangerous for decision making

Managing data quality:

Define what data quality means to your company
Educate key stakeholders on:
Data quality theories
Case studies from other organisations
Specific problems with your company’s data
Obtain key stakeholders’ support

Decide how to evaluate and prioritise data quality problems
Decide how to establish and maintain on-going data quality programs

A organization can either outsource their Business Intelligence Competency Center(BICC) or set up one BICC team in house. According to SAS, a Business Intelligence Competency Center (BICC) is a cross-functional team with a permanent, formal organizational structure. It is owned and staffed by the client and has defined tasks, roles, responsibilities and processes for supporting and promoting the effective use of business intelligence and performance management across the organization.

Your BICC might have some of these functional areas:

BI Program Management Office: : Defines and monitors implementation of the BI strategy. Responsible for consistent BI deployment, standards, technology assessments, knowledge management, best practices and business analytics expertise.
Data Stewardship: Data definition, ownership and quality, metadata management, data standards.
Vendor Management: Vendor evaluation, relationship management, user licenses.
Information Management: Coordination with the data warehouse team to integrate operational data into the BI environment.
Information Delivery: Working with the business community to promote proper use of information, data mart design and utilization, analytical analysis, user training and support.

The success factor for BICC are to have representatives from both business and IT team. Strong support from management and reporting to CIO or COO.

BICC can be either a permanent structure or virtual team. A permanent structure will be a separate division, the advantage is that will be sharing of experience and knowledge.

A virtual team consisting of staff from different department or companies which require no internal reorganization or shifting of budget. But it lack of communication and alignment between members.

There are three different ways of funding a BICC. Listing it as a cost center. This is run the risks of an under appreciated center as it is not a revenue generating center.

The second method is Internal billing system. It charge users for help given on projects and analysis. The disadvantage of this method is that it can limit the use and growth of BICC.

The third method is to have a subscription-based billing model. It is very difficult to implement as it is very difficult to get agreement upfront. Estimating of usage will be difficult.

According to SAS, big organization are more favour towards the cost center funding where SMB is more like to use the Internal billing system.

Finally before ending for this week's lecture, I like to share this actricle with my friends. It contains a lot of information such as Benefits from having a BICC, Staff members in a BICC and Guidance for setting up a BICC.

Happy Learning
Cheers
Chenyuan

Sunday, January 17, 2010

Week 13 Text Mining and Web Mining

What we have learnt this week is totally new to my understanding of data mining and business intelligence.

In this week's lecture Ms Chong has introduced text mining to us. Text mining is performing data mining on an unstructured data. It is important for any data analyze to equip with this skill as 80% of organizational information is unstructured textual forms.

Some example of unstructured textual data will be:

Remarks of a call centre officer
Open question from a survey
Web sites
Annual reports

Text mining usually involves of Training, Filtering and Classifying.

During the training stage, user has to create an attribute dictionary. Attribute will be words appear in the data. Only include words that occur in a minimum frequency(Usually use will determine the minimum frequency).

After creation of the attribute dictionary, user has to remove common words which is useless in for data mining (e.g. the, from , of, a)

Some challenges of Textual data mining

Some common textual data mining will be email document and telephone transcripts. Performing textual data mining on this two information will involves more difficult problems.

Some common problems faced during textual data mining will be spelling and grammar errors in the data. (e.g. customer, cust, customar, csmr). Most of the time data analyze has to group all the different words such as customer, cust, customar under one single attribute before doing any analysis.

Other common problems will be Semantic analysis and Syntax analysis.

How do we apply Text Mining to our daily operation?

Text mining can be apply to stop email spam or phishing though analysis of the document content.

Automatic process a message or email and route the message to the most appropriate department.

Identify most common problems from a help center.

Some time organization will receive hundreds of resumes, text mining can help to filter resumes to open positions.

Text mining can also help us to monitoring the website activities and find out user behaviours when browsing the website. The website admin can use the information to improve the website structure and make it more user friendly.

Just like any other data mining, text mining involve of 7KDD steps.

Like to recommend this page to my friends as it explains Typical Applications for Text Mining in detail and the different approaches to Text Mining.

Web Mining is also covered in this lecture

There are three different types of web mining

Web content mining
Web structure mining
Web usage mining

The above image shows a Web server log file. This is where most of the information is gather for web usage mining. There are referring pages, IP address and date&time in the log file. With this log file user can create a session file which will track down the page viewed by each individual user.

Web server log file is very useful but there are some disadvantages of it. It is sometime difficult in differentiating individual user sessions. The same host addresses may be access by multiple users. The create a more accurate Session file, data analyst can combine referring pages with the host address to identify each individual user. Cookies will be the best choice if it is been allowed to be placed on the users' computer.

I find this case study of web usage mining very interesting.

Saturday, January 16, 2010

Week12 Regression and Neural Networks

This week we have learnt Regression and Neural Networks(NN). Both technique was taught in the year two. This time the topic is more detail.

Regression analysis is to relate one or more numeric input attributes to a single numeric output attribute. The focus is on the relationship between a dependent variable and one or more independent variables

The lecture covered three different model

Linear Regression
A straight line graph

Nonlinear Regression
Usually a Curve

Logistic Regression
Categorical data such as y = o or 1

Output of Regression

R2 is the percentage of difference between line and the actual value. Model is more accurate with a higher R2. To increase R2, data analyst can increase the number of attributes.

Adding R2 will always increase R2 value while Adjusted R2 is to adjusts calculation to penalize for number of independent variables.

There are a lot of good examples of Regression Models from here

The next thing I like to share is Neural Networks

It is computer that will operate like a human brain. The machines possess simultaneous memory storage and works with ambiguous information.

NN can be used for both supervised and unsupervised learning. Only numeric data can be used for NN. The relationships between input and output are not linear. NN is usually used in areas like approval of loan application and fraud prevention.

There two type of NN

Feed Forward NN

This type of NN consist of at least 2 layers and a number of hidden layers. Since all input and output attributes must be numeric thus categorical attributes have to be converted into numerical attributes first.

This is an example of NN. Life ins Promo Computed output is the predicted value by the model. The difference between the actual value between the predicted value will be is absolute error. Average absolute error should decrease after each iteration.

Kohonen Neural Networks

The other type of NN is Kohonen Neural Networks. It is unsupervised mining. Unlike Feed Forward NN there no hidden layer in it. Instances input into the network are assigned to the most appropriate cluster represented by the nodes at the output layer using the input values and connection weights.

You may find this page interesting as there is historical and background about NN and many good examples.

Friday, December 18, 2009

Week 5 & 6 Information Dashboard Design

In this two week's lecture Ms Chong has gone though Dashboard design in detail. I think the two lectures are helpful in doing our project as it provide us with guidelines to effective Dashboard design.

Dashboards are visual display that fits on a single computer screen and display information needed to achieve specific objectives.

There are three different types of Dashboard.

Dashboards for strategic Purpose

This type of dashboard is usually for the top management where real time data is not required.

I like to recommend this article to my friends. Totally agree to the article and find it very detail and contain a lot of information and facts

Dashboard for Analytical Purpose

The type of dashboard should allow drill down and has a more sophisticated display media when compare with Dashboards for strategic Purpose.

Dashboard for operational purposes

Real time information is presented, able to alert user on abnormalies

Visualization is a very powerful way of presenting solutions and findings. Dashboard is one of the popular tool but creating an effective dashboard is not easy. people have short-term memory. Dashboard should not contain too much information as user may not remember if the information is overwhelmed. 3 to 9 chunks of visual at a time and fit everything into one screen.

Having an effective dashboard allow the user to focus on important data and alert user when necessary.

I have summarized the key points for creating an effective dashboard from the two lectures.

Using appropriate color, emphasis colors for attention grabbing.
Pay attention on the 6 principles of visual perception.
Only include absolute needed information(It is a dashboard not a detail report!)
Condense carefully so that the meaning don't decrease.
Use visual display mechanisms that can be easily understood.
Reduce non-data pixels and enhance data pixels

Personally I found this example in lecture week 5 very interesting. It demonstrate the importance of good dashboard design.

As you can see by using the appropriate color it make the job of user much easier. In one glance the user is able to get the needed information which is the number of Fives.

Some tips above for effective dashboard creation is going to be useful when it comes to dashboard design. But I think it is not enough, we must understand the different type of display media. I believe it is very important for us to understand when to use which display media and use them appropriately.

The diagram below demonstrate the importance of choosing the correct display media for different information.

There are two different display media in this graph, one is the table and the other is the line chart. Line chart is a better display media for this case. Patterns and trends is easily spotted. Information is very to understand if we present it in a table format, in fact it is almost unreadable and not useful at all.

Some tips from me for choosing the right display media.

Try to avoid using thermometer as it does not display the maximum and minimum number.
Bar and Column graphs is used for comparing and usually involve more than one measure.
Interval Scale for bar and column chart.
Stacked bar is good for compare the whole data and see information in a little detail.
Scatter Plots is good for correlation.

Personally I find this case study about dashboard design very interesting. I think it is useful for our learning.

That's all I like to share for Information Dashboard Design.
Happy learning
Cheers~

Wednesday, December 16, 2009

Week 4 Data Warehouse and OLAP

This week, there are a lot of things for me to absorb. The lecture is very packed and almost run out of time :)

In this week, I have learned the differences between data warehouse and operational database. Ms Chong also went though the different type of data warehouse schemas and Dimension tables. Different OLAP servers is also covered in the lecture.

Difference between data warehouse and operational database

Data warehouse is made up of both internal and external data. By extracting all the necessary data from different source and perform ETL process will help you to create a data warehouse.

There are three different design of a data warehouse

Star Schema
Snowflake Schema
Constellation Schema

Star Schema
Star schema architecture is the simplest data warehouse design. It is a data modeling technique used to map multidimensional decision support data into relational database.

You may check out this link for detail information about star schema. This website explain the structure of a Star Schema and different components in it. I think it will be helpful as there are examples given.

Snowflake Schema

Snowflake schema architecture is a more complex variation of a star schema design. The main difference is that dimensional tables in a snowflake schema are normalized, so they have a typical relational database design.

Snowflake schemas are generally used when a dimensional table becomes very big and when a star schema can’t represent the complexity of a data structure

I like to recommend this blog as it has provide both advantage and disadvantage of Snowflake Schema

Constellation Schema

Constellation Schema is made up of two or more Fact table and it is usually used for bottom up approach. Different Fact table are linked by dimension table.

Slowly changing dimensions are dimensions which change over time. There are three type of slowly changing dimensions

Type 1

Overwrites the previous dimension information
Does not track changes
Usually results in in accurate analysis

Type 2

Added four more supplementary attributes to track history of a dimension
This allow the tracking of entire history but require large data storage

Type 3

Only implement tow more additional attributes
Tracking only the current and original state of a dimension member

OLAP
An OLAP (Online analytical processing) cube is a data structure that allows fast analysis of data

Relational OLAP (ROLAP)

Real time and flexible than cube
Query response is generally slower
Low Storage requirement
Greater scalability

Multidimensional OLAP (MOLAP)

It can process faster
Implement for cubes with frequent use and rapid query response

Hybrid OLAP (HOLAP)

It contain all the advantage of MOLAP and ROLAP but require large volumes of storage space.

A lot of information in one lecture, happy learning~
Cheers~