Sunday, January 17, 2010

Week 13 Text Mining and Web Mining

What we have learnt this week is totally new to my understanding of data mining and business intelligence.

In this week's lecture Ms Chong has introduced text mining to us. Text mining is performing data mining on an unstructured data. It is important for any data analyze to equip with this skill as 80% of organizational information is unstructured textual forms.

Some example of unstructured textual data will be:

  • Remarks of a call centre officer
  • Open question from a survey
  • Web sites
  • Annual reports
Text mining usually involves of Training, Filtering and Classifying.

During the training stage, user has to create an attribute dictionary. Attribute will be words appear in the data. Only include words that occur in a minimum frequency(Usually use will determine the minimum frequency).

After creation of the attribute dictionary, user has to remove common words which is useless in for data mining (e.g. the, from , of, a)

Some challenges of Textual data mining

Some common textual data mining will be email document and telephone transcripts. Performing textual data mining on this two information will involves more difficult problems.

Some common problems faced during textual data mining will be spelling and grammar errors in the data. (e.g. customer, cust, customar, csmr). Most of the time data analyze has to group all the different words such as customer, cust, customar under one single attribute before doing any analysis.

Other common problems will be Semantic analysis and Syntax analysis.

How do we apply Text Mining to our daily operation?

Text mining can be apply to stop email spam or phishing though analysis of the document content.

Automatic process a message or email and route the message to the most appropriate department.

Identify most common problems from a help center.

Some time organization will receive hundreds of resumes, text mining can help to filter resumes to open positions.

Text mining can also help us to monitoring the website activities and find out user behaviours when browsing the website. The website admin can use the information to improve the website structure and make it more user friendly.

Just like any other data mining, text mining involve of 7KDD steps.

Like to recommend this page to my friends as it explains Typical Applications for Text Mining in detail and the different approaches to Text Mining.

Web Mining is also covered in this lecture

There are three different types of web mining
  • Web content mining
  • Web structure mining
  • Web usage mining
The above image shows a Web server log file. This is where most of the information is gather for web usage mining. There are referring pages, IP address and date&time in the log file. With this log file user can create a session file which will track down the page viewed by each individual user.

Web server log file is very useful but there are some disadvantages of it. It is sometime difficult in differentiating individual user sessions. The same host addresses may be access by multiple users. The create a more accurate Session file, data analyst can combine referring pages with the host address to identify each individual user. Cookies will be the best choice if it is been allowed to be placed on the users' computer.

I find this case study of web usage mining very interesting.

No comments:

Post a Comment