Business Intelligence: 2010

Monday, February 1, 2010

Week 16 Guest Lecture

We have a guest lecture this week. The guest lecture was fun. We had people from different industry to share with us their experience on BI/DM and how does this two application help them in their daily operation or marketing.

Most of the information shared by the guest was like a recall of my major project. Since we did data mining for Estee Lauder Companies, thus it is very easy for us to understand the contain and discussion during the guest lecture. Had fun doing the lecture activity quiz, it is just like writing down some of the findings from our major project.

I believe that BI/ DM is very important to any organization and industry. It is very difficult for an organization to grow without BI. BI helps an organization to understand its' customer, coming out more cost effective marketing campaign, reducing churning customer and improving KPIs of the company.

In conclusion, I think I made a right choice of selecting BI as one of my electives. The lecturers and tutors are very experience in this industry. I have benefited a lot from this subject, the lectures help us to build up understanding about Business Intelligence while the lab lessons give us experience on building up a dashboard or scorecard. Personally I enjoy doing the project very much as it build up my confidence in using the software. This is a great subject that I will recommend to my juniors.

Happy learning
Chenyuan

Monday, January 18, 2010

Week 15 Future of Data Warehouse, Data Mining and Data Visualisation

This weeks' lecture is all about trends and future development of the data warehouse. I get an overall picture of the data warehouse development. I am very surprised by the media of storage which is covered in the lecture. According to Ms Chong, the future of data warehousing is not high-performance disk storage, but an array of alternative storage. This involves two forms of storage. Near-line storage involves an automated silo where tape cartridges are handled automatically. Secondary storage is slower and less expensive, such as CD-ROMs and floppy disks.

I think this trend of alternative storage is very reasonable as for data that are placed there once and left alone, so do not need to be updated at high speed. If data gets accessed less often as it ages, it can be moved to secondary storage, giving the resources to the new data and making access to newer data more efficient.

I have read through this article from Corporate Information Factory. It was written by W. H. Inmon. In this article it mention that high performance disk storage plays only a secondary role in the future of data warehousing. The real future of data warehousing is in a storage media collectively known as "alternative storage". This support what Ms Chong have taught us in the lecture.

I have extracted out from the article some reasons that why high-performance disk storage not a choice for data warehouse.

Secondary storage is a form of disk storage but whose disk is slower, significantly less expensive and less cached than high performance storage.
Data warehouse data is very stable.
The ability to store far more data on near line and/or secondary storage.
The rate at which secondary storage and near line storage is getting cheaper is at a faster rate than high performance storage.

Most organization is currently under the stage of non integrated information architecture. For the next few years companies will be moving towards integrated information center where data format and information is all standardized. According to Philip Howard, most important trend is towards the integration of text mining and data mining.

The future of Data Mining is to react more quickly and offer better service, and do it all with fewer people and at a lower cost.

The increase in hardware speed and capacity makes it possible to analyze data sets that were too large just a few years ago. However while the available data exponentially, the industry is looking into automatic procedures for data mining.

Data Mining is also been used in protecting privacy information.

One current intrusion detection technique is misuse detection – scanning for malicious activity patterns known by signatures.
Another technique is anomaly detection where there is an attempt to identify malicious activity based on deviations from norms.

STEPHEN FEW and PERCEPTUAL EDGE mention that data visualization is increasingly taking its rightful place as an important part of business intelligence.

Data visualization has in recent years become an established area of study in academia. Many universities now have faculty members who focus on visualization and a few have excellent programs that serve the needs of many graduate students who produce worthwhile research studies and prototype applications.

Both of them expect that data visualization will continue for the next few years to pursue and mature those trends that have already begun. Dashboards, visual analytics, and even simple graphs will continue to develop and conform to best practices. They also have seen evidence that newer efforts are emerging that will soon develop into full-blown trends.

According to STEPHEN FEW and PERCEPTUAL EDGE, another expression of data visualization that has captured the imagination of many in the business world in recent years is geo-spatial visualization. The popularity of Google Earth and other similar Web services have contributed a great deal to this interest.

Another trend that has made the journey in recent years from the academic research community to commercial software tackles the problem of displaying large sets of quantitative data in the limited space of a screen. The most popular example of this is the treemap.

That's all I like to share for this week

Happy Reading
Cheers
Chenyuan

Week 14 Implementing Enterprise Business Intelligence Systems

Implementing a BI system is not easy. Before any implementation, organization must first identify what do they want to achieve. Understand what is important for the organization together with data sources and analytical capabilities that the organization have. Finally determine the gap between current resources and the goals that the organization has for BI

BI Program Management Office (BI PMO) has to be created.

BI POM will establish project priorities and obtain funding for projects
Oversee the enforcement of standards, policies and procedures as well as the development of data models and ETL code.
Identify and create the overarching BI architecture
Determine how each project will further develop or fit into the existing architecture

The corporate information factory shows the information flow direction inside an organization. Change in management will be one of the cause for the failure of implementing BI system.

Without data quality, reporting and analytical processes will be unreliable and dangerous for decision making

Managing data quality:

Define what data quality means to your company
Educate key stakeholders on:
Data quality theories
Case studies from other organisations
Specific problems with your company’s data
Obtain key stakeholders’ support

Decide how to evaluate and prioritise data quality problems
Decide how to establish and maintain on-going data quality programs

A organization can either outsource their Business Intelligence Competency Center(BICC) or set up one BICC team in house. According to SAS, a Business Intelligence Competency Center (BICC) is a cross-functional team with a permanent, formal organizational structure. It is owned and staffed by the client and has defined tasks, roles, responsibilities and processes for supporting and promoting the effective use of business intelligence and performance management across the organization.

Your BICC might have some of these functional areas:

BI Program Management Office: : Defines and monitors implementation of the BI strategy. Responsible for consistent BI deployment, standards, technology assessments, knowledge management, best practices and business analytics expertise.
Data Stewardship: Data definition, ownership and quality, metadata management, data standards.
Vendor Management: Vendor evaluation, relationship management, user licenses.
Information Management: Coordination with the data warehouse team to integrate operational data into the BI environment.
Information Delivery: Working with the business community to promote proper use of information, data mart design and utilization, analytical analysis, user training and support.

The success factor for BICC are to have representatives from both business and IT team. Strong support from management and reporting to CIO or COO.

BICC can be either a permanent structure or virtual team. A permanent structure will be a separate division, the advantage is that will be sharing of experience and knowledge.

A virtual team consisting of staff from different department or companies which require no internal reorganization or shifting of budget. But it lack of communication and alignment between members.

There are three different ways of funding a BICC. Listing it as a cost center. This is run the risks of an under appreciated center as it is not a revenue generating center.

The second method is Internal billing system. It charge users for help given on projects and analysis. The disadvantage of this method is that it can limit the use and growth of BICC.

The third method is to have a subscription-based billing model. It is very difficult to implement as it is very difficult to get agreement upfront. Estimating of usage will be difficult.

According to SAS, big organization are more favour towards the cost center funding where SMB is more like to use the Internal billing system.

Finally before ending for this week's lecture, I like to share this actricle with my friends. It contains a lot of information such as Benefits from having a BICC, Staff members in a BICC and Guidance for setting up a BICC.

Happy Learning
Cheers
Chenyuan

Sunday, January 17, 2010

Week 13 Text Mining and Web Mining

What we have learnt this week is totally new to my understanding of data mining and business intelligence.

In this week's lecture Ms Chong has introduced text mining to us. Text mining is performing data mining on an unstructured data. It is important for any data analyze to equip with this skill as 80% of organizational information is unstructured textual forms.

Some example of unstructured textual data will be:

Remarks of a call centre officer
Open question from a survey
Web sites
Annual reports

Text mining usually involves of Training, Filtering and Classifying.

During the training stage, user has to create an attribute dictionary. Attribute will be words appear in the data. Only include words that occur in a minimum frequency(Usually use will determine the minimum frequency).

After creation of the attribute dictionary, user has to remove common words which is useless in for data mining (e.g. the, from , of, a)

Some challenges of Textual data mining

Some common textual data mining will be email document and telephone transcripts. Performing textual data mining on this two information will involves more difficult problems.

Some common problems faced during textual data mining will be spelling and grammar errors in the data. (e.g. customer, cust, customar, csmr). Most of the time data analyze has to group all the different words such as customer, cust, customar under one single attribute before doing any analysis.

Other common problems will be Semantic analysis and Syntax analysis.

How do we apply Text Mining to our daily operation?

Text mining can be apply to stop email spam or phishing though analysis of the document content.

Automatic process a message or email and route the message to the most appropriate department.

Identify most common problems from a help center.

Some time organization will receive hundreds of resumes, text mining can help to filter resumes to open positions.

Text mining can also help us to monitoring the website activities and find out user behaviours when browsing the website. The website admin can use the information to improve the website structure and make it more user friendly.

Just like any other data mining, text mining involve of 7KDD steps.

Like to recommend this page to my friends as it explains Typical Applications for Text Mining in detail and the different approaches to Text Mining.

Web Mining is also covered in this lecture

There are three different types of web mining

Web content mining
Web structure mining
Web usage mining

The above image shows a Web server log file. This is where most of the information is gather for web usage mining. There are referring pages, IP address and date&time in the log file. With this log file user can create a session file which will track down the page viewed by each individual user.

Web server log file is very useful but there are some disadvantages of it. It is sometime difficult in differentiating individual user sessions. The same host addresses may be access by multiple users. The create a more accurate Session file, data analyst can combine referring pages with the host address to identify each individual user. Cookies will be the best choice if it is been allowed to be placed on the users' computer.

I find this case study of web usage mining very interesting.

Saturday, January 16, 2010

Week12 Regression and Neural Networks

This week we have learnt Regression and Neural Networks(NN). Both technique was taught in the year two. This time the topic is more detail.

Regression analysis is to relate one or more numeric input attributes to a single numeric output attribute. The focus is on the relationship between a dependent variable and one or more independent variables

The lecture covered three different model

Linear Regression
A straight line graph

Nonlinear Regression
Usually a Curve

Logistic Regression
Categorical data such as y = o or 1

Output of Regression

R2 is the percentage of difference between line and the actual value. Model is more accurate with a higher R2. To increase R2, data analyst can increase the number of attributes.

Adding R2 will always increase R2 value while Adjusted R2 is to adjusts calculation to penalize for number of independent variables.

There are a lot of good examples of Regression Models from here

The next thing I like to share is Neural Networks

It is computer that will operate like a human brain. The machines possess simultaneous memory storage and works with ambiguous information.

NN can be used for both supervised and unsupervised learning. Only numeric data can be used for NN. The relationships between input and output are not linear. NN is usually used in areas like approval of loan application and fraud prevention.

There two type of NN

Feed Forward NN

This type of NN consist of at least 2 layers and a number of hidden layers. Since all input and output attributes must be numeric thus categorical attributes have to be converted into numerical attributes first.

This is an example of NN. Life ins Promo Computed output is the predicted value by the model. The difference between the actual value between the predicted value will be is absolute error. Average absolute error should decrease after each iteration.

Kohonen Neural Networks

The other type of NN is Kohonen Neural Networks. It is unsupervised mining. Unlike Feed Forward NN there no hidden layer in it. Instances input into the network are assigned to the most appropriate cluster represented by the nodes at the output layer using the input values and connection weights.

You may find this page interesting as there is historical and background about NN and many good examples.