Tutorials

Four tutorials will be offered during the conference.

Tutorial 1: Detecting Clones, Copying and Reuse on the Web

Time & Date: 13:30-15:00, April 16th
Speakers: Xin Luna Dong and Divesh Srivastava

Abstract:
 The Web has enabled the availability of a vast amount of useful information in recent years. However, the Web technologies that have enabled sources to share their information have also made it easy for sources to copy from each other and often publish without proper attribution. Understanding the copying relationships between sources has many benefits, including helping data providers protect their own rights, improving various aspects of data integration, and facilitating in-depth analysis of information flow.

The importance of copy detection has led to a substantial amount of research in many disciplines of Computer Science, based on the type of information considered. The Information Retrieval community has devoted considerable effort to finding plagiarism, near-duplicate web pages and text reuse. The Multimedia community has considered techniques for copy detection of images and video, especially in the presence of distortion. The Software Engineering community has examined techniques to detect clones of software code. Finally, the Database community has focused on mining and making use of overlapping information between structured sources across multiple databases and more recently on copy detection of structured data across sources.

In this seminar, we explore the similarities and differences between the techniques proposed for copy detection across the different types of information. We do this with illustrative examples that would be of interest to data management researchers and practitioners. We also examine the computational challenges associated with large-scale copy detection, indicating how they could be detected efficiently, and identify a range of open problems for the community.

Bio:
Xin Luna Dong is a researcher at AT&T Labs-Research. She received her Ph.D. from University of Washington in 2007, received a Master’s Degree from Peking University in China in 2001, and received a Bachelor’s Degree from Nankai University in China in 1998. Her research interests include databases, information retrieval and machine learning, with an emphasis on data integration, data cleaning, personal information management, and Web search. She has led the Solomon project, whose goal is to detect copying between structured sources and to leverage the results in various aspects of data integration, and the Semex personal information management system, which got the Best Demo award (one of top-3) in Sigmod’05. She was the associate editor of IEEE Data Engineering Bulletin 9/2011, co-chaired WebDB’10, and has served in the program committees of Sigmod’11, VLDB’11, PVLDB’10, WWW’10, ICDE’10, VLDB’09, etc. She has presented two tutorials in Sigmod and VLDB recently.

Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his Ph.D. from the University of Wisconsin, Madison, and his B.Tech. from the Indian Institute of Technology, Bombay. He is on the board of trustees of the VLDB Endowment and an associate editor of the ACM Transactions on Database Systems. He has served as the program committee co-chair of many conferences, including VLDB 2007. His research interests and publications span a variety of topics in data management. He has presented tutorials on “Data Stream Query Processing” (with Nick Koudas) at VLDB 2003 and ICDE 2005, on “Record Linkage: Similarity Measures and Algorithms” (with Nick Koudas and Sunita Sarawagi) at VLDB 2005 and SIGMOD 2006, on “Anonymized Data: Generation, Models, Usage”
(with Graham Cormode) at SIGMOD 2009 and ICDE 2010, and on “Information Theory for Data Management” (with Suresh Venkatasubramanian) at VLDB 2009 and SIGMOD 2010.

Tutorial 2: Query Processing over Uncertain and Probabilistic Databases

Time & Date: 15:30-17:00, April 16th
Speakers: Lei Chen and Xiang Lian

Abstract:
 Recently, query processing over uncertain data has become increasingly important in many real applications like location-based services (LBS), sensor network monitoring, object identification, and moving object search. In many of these applications, data are inherently uncertain and imprecise, thus, we can either assign a probability to each data object or model each object as an uncertainty region. Based on these models, we have to re-define and study queries over uncertain data. In this tutorial, we will first introduce data models that are used to model uncertain and probabilistic data. Then, we will discuss various types of queries together with their query processing techniques. After that, we will introduce recent trends on query processing over uncertain non-traditional databases, such as sets and graphs. Finally, we will highlight some future research direction.

Bio:
Lei Chen received the BS degree in computer science and engineering from Tianjin University, Tianjin, China, in 1994, the MA degree from Asian Institute of Technology, Bangkok, Thailand, in 1997, and the PhD degree in computer science from the University of Waterloo, Waterloo, Ontario, Canada, in 2005. He is currently an associate professor in the Department of Computer Science and Engineering, Hong Kong University of Science and Technology. His research interests include probabilistic and uncertain databases, multimedia and time series databases, privacy-preserved data publishing, and sensor and p2p network data management. So far, he published more than 100 conference and journal papers. He got the best paper awards in DASFAA 2009 and 2010. Currently, he is an associated editor of IEEE Transactions on Knowledge and Data Engineering (TKDE). He is PC Track chairs for ACM SIGMM 2011, and IEEE ICDE 2012. He has served as PC members for SIGMOD, VLDB, ICDE, SIGMM, and WWW. He is a member of the ACM and IEEE. He also serves as the vice-chairman of ACM Hong Kong Chapter.

Xiang Lian received the BS degree from the Department of Computer Science and Technology, Nanjing University, in 2003. He obtained the PhD degree in the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, in 2009. From 2009 to 2011, he worked as a post-doctoral fellow in the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong. He is now an assistant professor in the Department of Computer Science at the University of Texas - Pan American. His research interests include query processing over uncertain databases, streaming time series, spatial databases and inconsistent probabilistic databases.

Tutorial 3: Data Stream Mining and Its Applications

Time & Date: 10:45-12:15, April 17th
Speakers: Latifur Khan and Wei Fan

Abstract:
Data streams are continuous flows of data. Examples of data streams include network traffic, sensor data, call center records and so on. Their sheer volume and speed pose a great challenge for the data mining community to mine them. Data streams demonstrate several unique properties: infinite length, concept-drift, concept-evolution, feature-evolution and limited labeled data. Concept-drift occurs in data streams when the underlying concept of data changes over time. Concept-evolution occurs when new classes evolve in streams. Feature-evolution occurs when feature set varies with time in data streams. Data streams also suffer from scarcity of labeled data since it is not possible to manually label all the data points in the stream. Each of these properties adds a challenge to data stream mining.

Multi-step methodologies and techniques, and multi-scan algorithms, suitable for knowledge discovery and data mining, cannot be readily applied to data streams. This is due to well-known limitations such as bounded memory, high speed data arrival, online/timely data processing, and need for one-pass techniques (i.e., forgotten raw data) issues etc. In spite of the success and extensive studies of stream mining techniques, there is no single tutorial dedicated to a unified study of the new challenges introduced by evolving stream data like change detection, novelty detection, and feature evolution. This tutorial presents an organized picture on how to handle various data mining techniques in data streams: in particular, how to handle classification and clustering in evolving data streams by addressing these challenges. The importance and significance of research in data stream mining has been manifested in most recent launch of large scale stream processing prototype in many important application areas. In the same time, commercialization of streams (e.g., IBM InfoSphere streams, etc.) brings new challenge and research opportunities to the Data Mining (DM) community. In this tutorial a number of applications of stream mining will be presented such as adaptive malicious code detection, on-line malicious URL detection, evolving insider threat detection and textual stream classification.

Bio:
Latifur R. Khan is currently an Associate Professor in the Computer Science department at the University of Texas at Dallas (UTD), where he has taught and conducted research since September 2000. He received his Ph.D. and M.S. degrees in Computer Science from the University of Southern California (USC), USA in August of 2000, and December of 1996 respectively. His research work is supported by grants from NASA, the Air Force Office of Scientific Research (AFOSR), National Science Foundation (NSF), the Nokia Research Center, and Raytheon. In addition, Dr. Khan's research areas cover data mining, multimedia information management, and semantic web. He has published more than 160 papers in data mining, and database conferences, such as ICDM, ECML/PKDD, PAKDD, AAAI, ACM Multimedia, and journals such as VLDB, TKDE, Bio Informatics, KAIS etc. Dr. Khan has served a PC member of several conferences such as KDD, ICDM, SDM, and PAKDD. Dr. Khan is currently serving on the editorial boards of a number of journals including IEEE Transactions on Knowledge and Data Engineering (TKDE).

Wei Fan received his PhD in Computer Science from Columbia University in 2001 and has been working in IBM T.J.Watson Research since 2000. He published more than 80 papers in top data mining, machine learning and database conferences, such as KDD, SDM, ICDM, ECML/PKDD, SIGMOD, VLDB, ICDE, AAAI, ICML, IJCAI etc. Dr. Fan has served as Associate Editor of TKDD, Area Chair, Senior PC of SIGKDD'06/10, SDM'08/10/11/12 and ICDM'08/09/10, sponsorship co-chair of SDM'09, award committee member of ICDM'09/11, His main research interests and experiences are in various areas of data mining and database systems, such as, risk analysis, high performance computing, extremely skewed distribution, cost-sensitive learning, data streams, ensemble methods, easy-to-use nonparametric methods, graph mining, predictive feature discovery, feature selection, sample selection bias, transfer learning, novel applications and commercial data mining systems. His thesis work on intrusion detection has been licensed by a start-up company since 2001. His co-teamed submission that uses Random Decision Tree (www.dice4dm.com) has won the ICDM'08 Championship. His co-authored paper in ICDM'06 that uses "Randomized Decision Tree" to predict skewed ozone days won the best application paper award. His co-authored paper in KDD'97 on distributed learning system "JAM" won the runner-up best application paper award. He received IBM Outstanding Technical Achievement Awards in 2010 for his contribution in building Infosphere Streams.

Tutorial 4: Storing, Querying, Summarizing, and Comparing Molecular Networks: The State-Of-The-Art

Time & Date: 10:30-12:00, April 18th
Speakers: Sourav S Bhowmick and Boon-Siew Seah

Abstract:
A grand challenge of systems biology is to model the cell. The cell can be viewed as an integrated network of cellular functions. Each cellular function is defined by an interconnected ensemble of molecular networks and represents the backbone of molecular activity within the cell. The critical role played by these networks along with rapid advancement in high-throughput techniques has led to explosion in molecular interaction data. In this tutorial we explore the data management and mining techniques that have been proposed in the literature for storing, querying, summarizing, and comparing molecular networks and pathways. It offers an introduction to these issues and a synopsis of the state-of-the-art.

Bio:
Sourav S Bhowmick is an Associate Professor in the School of Computer Engineering, Nanyang Technological University and the Director of Centre for Advanced Information Systems (CAIS). He is currently Visiting Associate Professor at the Biological Engineering Division, Massachusetts Institute of Technology (MA, USA). He also holds the position of Singapore-MIT Alliance (SMA) Fellow in Computation and Systems Biology program (2005–2012). Sourav received his Ph.D. in computer engineering in 2001. Sourav’s current research interests include tree and graph data management, HCI-aware data management, database usability, social media & web data management, data mining, and computation & systems biology. He has published more than 120 papers in major international database, data mining, and bioinformatics conferences and journals such as VLDB, IEEE ICDE, ACM WWW, ACM SIGMOD, ACM SIGKDD, ACM MM, ACM CIKM, ACM BCB, ACM SIGHIT, IEEE TKDE, ACM CS, Information Systems, and DKE. He is serving as a PC member of various database, data mining, and bioinformatics conferences and workshops and reviewer for various database and data mining journals. He is serving as a program chair/co-chair of several international workshops and conferences. He is a member of the editorial boards of several international journals. Sourav has been tutorial speaker for several international conferences such as ER 2006, APWeb 2008, WAIM 2008, PAKDD 2009 and 2011, and DASFAA 2011. Sourav has received Best Paper Awards at ACM CIKM 2004 and ACM BCB 2011 for papers related to evolution mining and biological network summarization, respectively. He was also nominated for Excellence in Teaching Award for three consecutive years (2003 – 2005).

Boon-Siew Seah is a senior doctoral student of Sourav in the School of Computer Engineering, Nanyang Technological University. His research interest includes molecular network mining and graph mining. His research has been published in BCB 2011, SIGHIT 2012, and BMC Bioinformatics. He also received best paper award in ACM BCB 2011.