Friday, October 24, 2014

Course Outline: Business Intelligence and Data Warehousing Architecture


Section 0: Introductions
  • Instructor & class introductions
Section 1: The Architectures
  • The Four Architectures
    • Information Architecture
    • Data Architecture
    • Technology Architecture
    • Product Architecture
  • Data Integration Framework (DIF)
    • Architecture
    • Processes & Data Stores
    • Standards
    • Tools
    • Resources & Skills

Section 2: DIF Processes
  • Data Preparation
    • Data Sourcing
    • Data Cleansing
    • Data Quality
    • Data Transformation
    • Data Loading

  • Data Franchising
    • Data filtering
    • Data Summarization & Aggregation
    • Data Transformation
    • Data Loading
  • Information Access & Analytics
    • Information Access & Reporting
    • Analytics & Performance Management

  • Metadata Management
    • Inter-tool interfaces
    • Audit & What-If Capability

  • Data Management
    • Data Modeling
    • Data Profiling
    • Database Management
Section 3: Data Store Components
  • Data Modeling Basics
    • Conceptual, Logical & Physical Models
    • Entity-Relationship & Dimensional Modeling

  • Data Structure Concepts
    • Facts, Dimensions, Reference
    • Types of Keys

  • Data Structure Options
    • Star
    • Snowflake
    • Normalized (3NF)
    • Denormalized
    • Others
    • Why do these structures matter?

  • Metadata
    • Technical
    • Business
    • Process
    • Why does metadata matter?

Section 4: DIF Data Stores
  • DIF Data Stores
    • Data Sources
    • Data Warehouse
    • Data Marts
    • Cubes
    • Data Shadow Systems
    • Operational Data Stores (ODS)
    • Data Staging
    • Best Practices & Best Fit Considerations

  • DIF Architectural Options
    • Data Warehouse vs. Data Mart
    • Stand-alone, Federated & Hub and Spoke
    • “Closed loop”
    • Comparison of Architectural Options
Section 5: DIF Tools & Technology
  • Extract, Transform & Loading (ETL)
  • Enterprise Information Integration (EII)
  • Enterprise Application Integration (EAI)
  • Data Profiling
  • Data Quality & Cleansing
  • Metadata Management
  • What about unstructured data?
  • Searching for information

Section 6: DIF Standards
  • Project management
  • Software development
  • Technology and products
  • Architecture
  • Data
Section 7: Conclusions
  • Highlights
  • References & Resources

Data Architect, Data Warehouse Architect and BI Solution Architect

What is the difference between a data warehouse architect and a data architect? How about BI solution architect?

Both data architect and data warehouse architect do data modelling, as in ERWin stuff. Or Embarcadero ER Studio. But a data warehouse architect is more specialised on designing the data model for a data warehouse, whereas a data architect is more specialised on designing the data model for databases used by transactional systems.

A data warehouse architect does a lot more than just data modelling. They also does the ETL and the infrastructure. These are areas that a data architect doesn’t do normally.
For a data architect to be able to call themselves a data warehouse architect, they don’t only need to learn how to create a data model for a data warehouse (as in dimensional modelling). But they need to also understand the ETL architecture. And they need to understand the servers. For example, they need to be able to specify the specification for the production data warehouse servers, i.e. CPU, memory, disks. And other server stuff like clustering, mirroring and DR. And they need to understand physical database stuff too, like table partitioning, file groups and materialised views.

In my book I specify there are 2 sides of data warehouse architecture. The first one is the logical architecture and the second one is physical architecture. A warehouse architect needs to understand both.

A data warehouse architect in my opinion demands 4 separate skills: data architecture, ETL, database platform and physical infrastructure. By “database platform” I mean SQL Server knowledge, Oracle knowledge, Teradata knowledge, Netezza knowledge, etc. For example, “SQL Server 2008 Data Warehousing Features” is a “must know” for a DW architect in SQL Server. Whereas UPI & USI is a “must know” for DW architect in Teradata. If we design a DW on Oracle 11g R2, we need to know Oracle specific DW features, such as Initialisation Parameter Settings, Partitionwise Joins, Compression, Parallel Execution, etc.

A BI architect is more on the application side, as in SharePoint architecture, Hyperion architecture, Reporting Services architecture, and Analysis Services architecture. For example: a BI system where we have reporting services running on SharePoint, plus excel services and PPS services on SharePoint, and SSAS cubes too reading from a warehouse or mart. And on top of that some custom .NET coding for authentication or customised security. Plus they allow some self-service BI using Qlikview or PowerPivot.

Back to the data warehouse architect, the ETL aspect of the job is sometimes quite demanding. There is something called ETL architecture in warehousing, which is basically

a) the architecture of the overnight batch, i.e. the structure of the tasks and workflows, the execution order of the workflows, the backup, the reconciliation, the checking and alert, and the data quality. The overnight batch is not only about data loading / ETL. It also has: data serving elements, i.e. processing reports (stored as PDFs, ready to be served to achieve split second response time), refreshing OLAP cubes.

b) the architecture of the continuous feed throughout the day for real time warehousing

c) the physical infrastructure of the ETL, i.e. the servers, the databases, the data connections

d) the methods to extract and load the data i.e. sliding window, swap partition, flipping twin tables, identifying incremental extraction, changed data capture, change tracking mechanism, (filtered) replication between warehouse and mart (or mirroring), how to extract data from cubes.

If a data warehouse architect only understands dimensional modelling, the company will have problems in the ETL and infrastructure. The servers might not be ‘suit for purpose’, for example disk configuration is not optimised for warehousing.

How about “BI solution architect”? A data warehouse is the back end. Business Intelligence is the front end. Data warehousing is about the data model, the ETL and the databases. BI is about the reports, the OLAP cubes, the analytical applications, the data mining, the KPIs, the dashboards, the score cards, and the performance management. It is a common perception that a BI solution architect is a “front end” person. An application person. This perception is incorrect.

A “solution architect” is responsible for the whole solution. Not only the front end, but also the back end. It is impossible for him to be responsible for the whole solution without looking after the back end. In the case of a BI solution architect, he or she needs to look after both the BI front end (reports, cubes, performance management, etc.), and the back end (data warehouse, ETL). It is impossible for a BI solution architect to be responsible for the BI without looking after the data warehouse. In some companies, instead of calling the role “BI solution architect”, it is called “data warehouse solution architect”. It doesn’t mean that the role is only responsible for the back end data warehouse, but he is also responsible for the BI front end. In some companies, they have not only one but several BI solution architects, each responsible for a certain area.

There are several different types of architects in IT, for example: system architect, data architect, solution architect, information architect and enterprise architect. This segregation of duties only happens in very large group of companies (enterprises). For example, a banking group with 3000 IT staff. If the IT staff is only 100 usually the architect does multiple functions. A system architect is responsible for infrastructure, including networks and servers. I’ve mentioned about data architect and solution architect. An information architect is responsible for the flow of information throughout the enterprise, including databases and data quality. An enterprise architect is responsible for all the applications in the enterprise, making sure they run in sync and adhere to the standards.


Wednesday, October 15, 2014

JasperSoft BI Vs Pentaho business intelligence solutions

·         Reports :- Jaspersoft is known for its picture pixel perfect  reporting. Jasper uses ireport for designing the reports. Hence, for  having reports, Jaspersoft is the most ideal candidate. Pentaho uses  Pentaho Report Designer.
·         Dashboards :- Pentaho  provides much more capabililties, interactivity in terms of dashboards.  Dashboards designed in Pentaho are far more superior in functionality,  aesthetically as compared to Jaspersoft. Pentaho CE uses CDE/CDF,  Pentaho EE uses PDD . Dashboard functionality is present only in the  Enterprise edition of Jaspersoft.
·         Mobile BI compatibility :-  Jaspersoft scores higher than Pentaho here. They have BI for iOS,  Android platforms. Functionality wise and user interface wise Jaspersoft  mobile BI is better. They even have an opensouce Apple iphone  application. Pentaho users have to often rely on third party tools to  enhance mobile BI experience. But, Jasper provides native mobile API  hence helping developers to create mobile applications. Pentaho mobile  BI is only present in the enterprise edition.


Tableau vs Jaspersoft

JasperSoft is one of the top Open Source BI tools in the market (Pentaho is another tool positioned in the same space), finding a place in Gartner’s Magic Quadrant over the last few years. JasperSoft provides a comprehensive BI platform. The JasperSoft Enterprise Edition has a reporting server, ad hoc query, in-memory analysis, dashboarding, a reports library, OLAP and ETL. Due to its embeddable architecture and relaxed GNU General Public License terms and conditions, JasperSoft earns more than half of its business from OEMs, SaaS providers and other companies that integrate JasperSoft into their software offerings or internal applications.
Tableau, though not Open Source, is the undisputed leader in Gartner’s Magic Quadrant and is loved by business users / dashboard consumers across the globe for its visualization capabilities.  In this respect, a head-to-head between these JasperSoft and Tableau might seem like an apples and oranges comparison – but we find more and more existing and potential clients approaching us with the same! It ties back to my previous point that large scale implementation of traditional enterprise BI is expensive for SMBs – presently, both Tableau and JasperSoft offer relatively low TCO (for small implementations / no. of consumers) and are hence mentioned together. So let’s get cracking:


JasperSoft
Tableau
COST
JasperSoft Professional:
4 Core Single Instance & Unlimited Users –
Premium Support 37500USD/year, Standard Support 29000USD/year

Additional cores added in multiples of 2 – 5500USDx2/year

Generally, 1 core handles 10-12 heavy (ad-hoc slice & dice), 30-50 light (business users viewing static reports) users

Cloud: Jaspersoft for AWS - 0.40USD/hour (AWS costs 13 cents – so total of 0.53USD/hour or roughly 3700USD/year) + 1500USD/month or 7000USD/year for phone support)
Tableau Desktop: 2000USD/user (perpetual license)

Tableau Server:
1000USD/user (perpetual license) + 20% maintenance/year + requires additional Desktop licenses for development

Cloud: Tableau Online - 500USD/25 users/year + at least 1 Tableau Desktop Professional @ 2000USD/user
TALENT
Have to identify 3rd party resources that can help you with implementation and future projects – talent pool is limited though.

Technical resources required with at least basic understanding of J2EE Web Application functionality and maintenance design.
Anyone is a potential basic Tableau user; but more advanced users are not easy to come by.
VENDOR SUPPORT
The community is helpful .Mixed reviews about support though – while some have complained about bad support with no solution offered, others have vouched for its superior support. Might have to do with support contract.

“The service has been at such a level that even if we did not require the Enterprise solution to fit our needs, the support alone would be worth the cost.” – Satisfied Customer
Tableau support is great. The community is helpful as well.
EASE OF USE
Install isn’t point & click – some effort involved in server setup, maintenance and report generation – both, time and money if tech support is solicited.

The GUI can be made more intuitive, and it would be helpful to have more tutorials.
Very easy to install and quickly begin deriving insights – even for non-technical users.

GUI extremely easy to use and numerous tutorials are available on the web. 
STABILITY
Many bugs reported – while JasperSoft has made improvements in catching certain errors some users have reported seeing Java error dumps occasionally.
Mostly error free – unstable features are turned off immediately and helpful error messages displayed generally.
APPLICATION INTEGRATION
Integration into applications is great (especially Java apps)
- same as above -
WEB PORTAL INTEGRATION
Great! Though still not a flawless process -may require custom Java code to work.
Easy and error-free.
EMBEDDED INSIGHTS
USP of JasperSoft – the server product is easily configurable with company UI and can be deployed across webpages, portals and applications with absolutely no user limitations.
You can embed views from Tableau Server across webpages, portals and applications. These follow the same licensing and permission restrictions used on the server. Generally, people loading a webpage with an embedded view must also have an account on Tableau Server. If you have a core-based license you can alternatively select Enable Guest, which allows users to load the view without logging in.
ETL
Uses open source Talend Studio for ETL - unintuitive but powerful tool - Talend is a code generator, so Jobs are translated into corresponding defined language (Java or Perl), compiled and executed .It is necessary to have an expert in Java or Perl to take full advantage of the tool.
No ETL functionality, some basic SQL and nifty connectors to various data source, traditional and Big.
OLAP
The Mondrian OLAP Work Bench is a great tool for designing and deploying OLAP cubes for multi-dimensional analysis. However, users do encounter occasional glitches and need to find workarounds to meet a defined business requirement.
This form of multi-dimensional modelling is completely absent in Tableau, replaced by optimized columnar storage or Extracts - Tableau does connect to OLAP DBs though.
VISUALIZATION & DASHBOARDING
While many of the advanced report writing features are available through the GUI, customization may require knowledge of Java. Also, the client GUI tools are not as friendly as they could be.
Report / Dashboard building is a breeze - Beautiful reports with great design embedded, dashboards with filters easily set up
AD-HOC SLICE & DICE
Ad-hoc reporting does not work for everyone, can still be quite technical for non-technical users.
Ad-Hoc reporting is a breeze.
EXPORTING REPORTS
Allows Pixel Perfect Reporting and exporting reports in various formats, including PDF, Excel, Excel with pagination, CSV, DOCX, RTF, Flash, ODT, ODS, Excel 2007, and Excel 2007 with pagination.
Not as comprehensive - You can export views and data in the as Image / PDF file, Microsoft Access data, Crosstab to Microsoft Excel, and Tableau packaged workbook
SOFTWARE UPGRADES
JasperSoft releases new software versions relatively often and the upgrade process is relatively unintuitive.

Users have mentioned issues going between versions of Jasper Reports.
Tableau makes it a point to ensure the process is hassle-free with minimum disruptions that are well documented. However, certain rework of workbooks is may be necessary to replace features that have been edited/removed.
Among other things, the 2013 Gartner BI Report states the following about Tableau and JasperSoft:
Tableau: Tableau’s user counts grew by 75% from last year, but the platform is still largely departmentally deployed, albeit across multiple departments, with just below the survey average for user deployment size. Tableau’s products often fill an unmet need in large organizations that already have a BI standard, which makes them frequently deployed as a complementary capability to an existing BI platform. As a result, Tableau is still less likely to be considered an enterprise BI standard than the products of most other vendors. Given the success of Tableau and other interactive visualization vendors, leading traditional BI platform vendors – including Microsoft, MicroStrategy, IBM and SAP – delivered their own data discovery capabilities in 2012, mostly in the image of Tableau’s functionality, and are integrating and bundling data discovery capabilities with their BI and analytic platforms for free, or at a low cost, to proactively meet their customers’ business user ease-of-use requirements – and, more importantly, to defend their installed bases from the adoption and expansion of Tableau and the other data discovery vendors.
JasperSoft: JasperSoft earned below-average scores for ease of use for end users and developers. Despite these results, JasperSoft customers still report below-average BI platform ownership costs. Moreover, its customers maintain a positive view of the vendor’s future, and report successes with JasperSoft’s product (as defined by expanded usage) over the past year. One explanation for this paradox is that the value organizations derive from JasperSoft’s lower cost deployments is in line with their level of investment and expectations.
A final word about JasperSoft though – while JasperSoft’s price differentiation is important, it has also consistently experimented with product and service innovation, the latest example being JasperSoft for AWS, a cloud analytics server available on the Amazon Web Services Marketplace, which you can buy by the hour, paying only for what you use. Its recent acquisition by TIBCO may well be the first step towards Open Source BI’s future dominance.