ETL

BI
In-memory databases

Important:
http://dssresources.com/papers/features/langseth/langseth02082004.html - printed
http://www.sitepoint.com/kiba-etl-done-right/ - printed
http://thibautbarrere.com/2015/04/05/how-to-write-solid-data-processing-code/ - printed
https://gigaom.com/2011/07/04/big-data-in-real-time-is-no-fantasy/ - should read the comments

Videos:
http://www.youtube.com/watch?v=W5CTngwuYPY - ETL Essential Fundamentals
http://www.youtube.com/watch?v=E6MmE68j82A - Oracle Data Integrator for Scalable High Performance ETL Architecture
http://www.youtube.com/watch?v=rADepO14VHA - ETL Testing: How to Achieve 100% Data Validation
http://www.youtube.com/watch?v=9Akvz2x0az4 - Data Warehouse tutorial. Creating an ETL.
http://www.youtube.com/watch?v=Vq_CKBQd4zY - ETL & ELT using Informatica
http://www.youtube.com/watch?v=I_Ae3suaL-U - What is SSIS , SSAS and SSRS ( part 1) with sample demo

Articles:
http://www.firstretail.com/blog/Time_for_Semantic_ETL - printed
http://dsrealtime.wordpress.com/2007/11/23/what-is-real-time-etl-anyway/ - printed
http://searchitchannel.techtarget.com/feature/Supporting-real-time-BI-with-real-time-ETL - printed
http://www.sqlstream.com/solutions/continuous-etl/ - printed
http://my.safaribooksonline.com/book/databases/business-intelligence/9780470635179/real-time-data-integration/introduction_to_real-time_etl - Safari book
http://www.information-management.com/blogs/data_warehousing_etl_soa_service_oriented_architecture-10018216-1.html - not accessible
http://www.developer.com/db/best-practices-etl-development-for-data-warehouse-projects.html
https://msdn.microsoft.com/en-us/library/cc671624.aspx
http://searchbusinessintelligence.techtarget.in/tip/Six-ETL-best-practices-followed-by-Shoppers-Stop
http://casertaconcepts.com/resources/best-practices/data-integrationetl-best-practices/
https://developer.gooddata.com/article/recommended-etl-development-practices
http://www.kimballgroup.com/category/etl-and-data-quality/page/2/
https://community.informatica.com/thread/34785?db=5
http://us.analytics8.com/images/uploads/general/2011_A8_ETL_Best_Practices.pdf
http://www.cptech.com/wp-content/uploads/2014/07/ETL_Check_List_2014.pdf
http://www.element61.be/e/compete.asp?ServiceId=136

Books:
http://homes.dcc.ufba.br/~mauricio052/Material%20Artigo/Wiley%20Publishing%20-%20The%20Data%20Warehouse%20ETL%20Toolkit.pdf
http://www.cs.uoi.gr/~pvassil/publications/2009_DB_encyclopedia/Extract-Transform-Load.pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.473.5382&rep=rep1&type=pdf
http://www.wiley.com/WileyCDA/WileyTitle/productCd-0764567578.html
http://www.amazon.com/The-Data-Warehouse%C3%82-ETL-Toolkit-Techniques/dp/0764567578
https://docs.oracle.com/cd/E11882_01/owb.112/e10935.pdf
http://erwin.com/content/whitepapers/ca_erwin_building_etl_processes_sql.pdf
https://fenix.tecnico.ulisboa.pt/downloadFile/3779572230311/licao..
http://users.itk.ppke.hu/~szoer/DW/Kimball%20&%20Caserta%20-The%20Data%20Warehouse%20ETL%20Toolkit%20%5BWiley%202004%5D.pdf

What are popular ETL tools?

  • Pentaho Kettle
  • Data Junction (Pervasive Software)
  • SQL Loader
  • InfoSphere DataStage (IBM)
  • PR3systems.com
  • Informatica
  • OT Studio (Embarcadero Technologies)
  • Ab Initio
  • http://www.cloveretl.com/
  • Oracle Warehouse Builder
  • Oracle Data Integrator
  • Microsoft SQL Server Integration
  • TransformOnDemand (Scionde)
  • Transformation Manager
  • DataMirror
  • SQLstream Blaze
  • Flume
  • Storm
  • Spark
  • MetaMatrix
  • Tibco
  • DataStax
  • Appistry
  • Accenture
  • HStreaming
  • Kiba

Do we need real-time ETL?

Real-time ETL can be complex or expensive for your particular need. This is a question for yourself and your particular business situations or requirements. You must look at your business situations and requirements and determine whether you really need to have real-time ETL.

Do we need to have ETL?

Setting up ETL can be complex. We need to do data validation / testing (validate whether the ETL is working correctly). Can we avoid ETL? Can we use database replication instead?

What is ETL and when do we need it?

Everyday more and more companies are starting to think about data warehouse to better analyze their data and give valuable information to reduce cost or boost sales. A data warehouse is a common repository of information about a company's activities and its operations which is source by transactional data. A data warehouse allow us to ask real-life business questions such as which brand sold more products this month or who is my top-performing salesman. Because transactional databases are heavily used everyday to store the company operational data, they cannot be used to answer the complex business questions which enable a manager to make good business decisions. So then, how does the data from a everyday transactional databases get moved or copied to the data warehouse? This is where ETL comes into play.

Companies know that they have valuable data throughout the organization. The only problem is that the data lies in all sorts of heterogeneous systems and therefore in all sort of formats, such as Enterprise Databases, Access databases, Excel files, flat files, or data from external business suppliers or business partners. For example, a financial institution might have information on the customer in several departments although each department has that customers information listed in a different way.

The membership department might list the customer by name whereas the accounting department might list the customer by an ID number. ETL can bundle all this data and consolidate it into a uniform presentation. All ETL efforts are to help the business to increase sales or reduce costs by making good decisions based on facts. Therefore, all ETL project start with the business requirements. There are several other requirements involved in the project, such as compliance requirements, security, cleansing data, and understanding data archiving and manage requirements and working within the constraints of existing or expanding resources. These functions are combined into one tool to extract data out of one database or data source and place it into another.

ETL is the software that enable businesses to consolidate and move their data. The data can come from any source and can be in different formats. ETL is powerful enough to handle such data disparities.

We may need ETL if:

  • we have to work with data from multiple sources
  • we do not have access to the OLTP

What are the 4 basic steps in ETL?

  • The first step in an ETL process is to map the data between the source systems and the target database.
  • The second step is the cleansing of the source data in the staging area.
  • The third step is transforming the cleansed source data
  • Loading it into the target system.

Extraction is the process of collecting data often from many locations in all different heterogeneous systems. The raw data is often written directly to a staging area with some minimal restructuring. This allows the original extract to be as simple and as fast as possible, and allows for the flexibility to restart the extract in the occassion of an interruption.

One important function of ETL is cleansing the data. The ETL consolidation protocol is also included the manipulation of duplicate or fragmentary data, so that what passes from the extraction portion of the process to the load portion is easier to assiminate and store.

After extraction, the data is transformed depending on the specific business logic involved so that it can be sent to the target repository. Transformation occurs by using rules or lookup tables for combining or aggregating the data with other data. For example, the category of a person's sex, male, might be represented in three different systems such as capital M, the word male, or the number zero. ETL software would recognize that these entries mean the same thing and convert them to the target format. In addition, the ETL process could involve standardizing name and address fields, verifying telephone numbers, or expanding records with additional fields containing demographic data from other systems. This portion of the ETL equation is the most powerful. ETL can transform not only data from different departments but also data from different sources. For example, data in an email program such as Microsoft Outlook could be transformed right along with data from an SAP manufacturing application.

The final step in the process is loading the data into the target repository. Although this step may seem trivial, it is usually a point where data exception and error handling becomes very important. Each organization needs to decide how these exceptions and errors are addressed so that the final data is fully understood. ETL process can be performed in batch mode, which are scheduled executions of a job or set of jobs, or may be performed in real time using techniques such as triggers or events.

ETL tools can support start or snowflake schema databases, and some tools provide specific utilities such as dealing with slowing changing dimensions.

This diagram represents a typical ETL process flow. Although there are many hidden details and considerations along the flow. As you recall, ETL follows the "extract, transform, load" methodology. ELT is the process of extract, load and transform, where the transformation is accomplished by database utilities such as procedures, triggers, or SQL code. It is common for many organizations to have a blend of ETL and ELT approaches.

The ETL approach is probably the most popular. ETL tools provide programming capabilities for complex operations and some tools allow external references to C++ or Java code for even more extensibility. Some ETL tool scale extremely well to enhance performance and generally support a variety of hardware and database vendors. Like any technological solution, there are always alternatives and different approaches to solving the same problem. The ELT approach may have attractive benefits for certain solutions depending on the target database's capabilities. Therefore, it is important to be open-minded when designing and developing a data warehousing solution. Just like the ELT approach, the ETL tools also have attractive benefits. Today, ETL tools have matured to the point where they do what they were designed to do very well. They may be packaged with other best of breed tools such as data profiling, data cleansing, job scheduling, business process modeling, and etc to provide a complete and comprehensive solution.

Before the evolution of the ETL tools, detail process was done manually by using SQL code created by programmers and each interface was custom-developed. This task was tedious and cumbersome, and involved many resources, complex coding, and work hours. These difficulties are eliminated by ETL tools and if chosen wisely can significantly reduce development cost.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License