There are no specific requirements needed to attend this course.
[overview] =>
To meet compliance of the regulators, CSPs (Communication service providers) can tap into Big Data Analytics which not only help them to meet compliance but within the scope of same project they can increase customer satisfaction and thus reduce the churn. In fact since compliance is related to Quality of service tied to a contract, any initiative towards meeting the compliance, will improve the “competitive edge” of the CSPs. Therefore, it is important that Regulators should be able to advise/guide a set of Big Data analytic practice for CSPs that will be of mutual benefit between the regulators and CSPs.
The course consists of 8 modules (4 on day 1, and 4 on day 2)
[category_overview] =>
[outline] =>
1. Module-1 : Case studies of how Telecom Regulators have used Big Data Analytics for imposing compliance :
2. Module-2 : Reviewing Millions of contract between CSPs and its users using unstructured Big data analytics
Elements of NLP ( Natural Language Processing )
Extracting SLA ( service level agreements ) from millions of Contracts
Some of the known open source and licensed tool for Contract analysis ( eBravia, IBM Watson, KIRA)
Automatic discovery of contract and conflict from Unstructured data analysis
3. Module -3 : Extracting Structured information from unstructured Customer Contract and map them to Quality of Service obtained from IPDR data & Crowd Sourced app data. Metric for Compliance. Automatic detection of compliance violations.
4. Module- 4 : USING app approach to collect compliance and QoS data- release a free regulatory mobile app to the users to track & Analyze automatically. In this approach regulatory authority will be releasing free app and distribute among the users-and the app will be collecting data on QoS/Spams etc and report it back in analytic dashboard form :
Intelligent spam detection engine (for SMS only) to assist the subscriber in reporting
Crowdsourcing of data about offending messages and calls to speed up detection of unregistered telemarketers
Updates about action taken on complaints within the App
Automatic reporting of voice call quality ( call drop, one way connection) for those who will have the regulatory app installed
Automatic reporting of Data Speed
5. Module-5 : Processing of regulatory app data for automatic alarm system generation (alarms will be generated and emailed/sms to stake holders automatically) :
Implementation of dashboard and alarm service
Microsoft Azure based dashboard and SNS alarm service
AWS Lambda Service based Dashboard and alarming
AWS/Microsoft Analytic suite to crunch the data for Alarm generation
Alarm generation rules
6. Module-6 : Use IPDR data for QoS and Compliance-IPDR Big data analytics:
Metered billing by service and subscriber usage
Network capacity analysis and planning
Edge resource management
Network inventory and asset management
Service-level objective (SLO) monitoring for business services
Quality of experience (QOE) monitoring
Call Drops
Service optimization and product development analytics
7. Module-7 : Customer Service Experience & Big Data approach to CSP CRM :
Compliance on Refund policies
Subscription fees
Meeting SLA and Subscription discount
Automatic detection of not meeting SLAs
8. Module-8 : Big Data ETL for integrating different QoS data source and combine to a single dashboard alarm based analytics:
Using a PAAS Cloud like AWS Lambda, Microsoft Azure
Using a Hybrid cloud approach
[language] => en
[duration] => 14
[status] => published
[changed] => 1700037381
[source_title] => Big Data Analytics for Telecom Regulators
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
)
[1] => Array
(
[0] => stdClass Object
(
[tid] => 766
[alias] => big-data-training
[name] => Big Data
[english_name] => Big Data
[consulting_option] => available_promoted
)
)
[2] => bdatr
[3] => Array
(
[outlines] => Array
(
[datavault] => stdClass Object
(
[course_code] => datavault
[hr_nid] => 210132
[title] => Data Vault: Building a Scalable Data Warehouse
[requirements] =>
An understanding of data warehousing concepts
An understanding of database and data modeling concepts
Audience
Data modelers
Data warehousing specialist
Business Intelligence specialists
Data engineers
Database administrators
[overview] =>
Data Vault Modeling is a database modeling technique that provides long-term historical storage of data that originates from multiple sources. A data vault stores a single version of the facts, or "all the data, all the time". Its flexible, scalable, consistent and adaptable design encompasses the best aspects of 3rd normal form (3NF) and star schema.
In this instructor-led, live training, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
[outline] =>
Introduction
The shortcomings of existing data warehouse data modeling architectures
Benefits of Data Vault modeling
Overview of Data Vault architecture and design principles
SEI / CMM / Compliance
Data Vault applications
Dynamic Data Warehousing
Exploration Warehousing
In-Database Data Mining
Rapid Linking of External Information
Data Vault components
Hubs, Links, Satellites
Building a Data Vault
Modeling Hubs, Links and Satellites
Data Vault reference rules
How components interact with each other
Modeling and populating a Data Vault
Converting 3NF OLTP to a Data Vault Enterprise Data Warehouse (EDW)
Understanding load dates, end-dates, and join operations
Business keys, relationships, link tables and join techniques
Query techniques
Load processing and query processing
Overview of Matrix Methodology
Getting data into data entities
Loading Hub Entities
Loading Link Entities
Loading Satellites
Using SEI/CMM Level 5 templates to obtain repeatable, reliable, and quantifiable results
Developing a consistent and repeatable ETL (Extract, Transform, Load) process
Building and deploying highly scalable and repeatable warehouses
Closing remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349914
[source_title] => Data Vault: Building a Scalable Data Warehouse
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => datavault
)
[sparkstreaming] => stdClass Object
(
[course_code] => sparkstreaming
[hr_nid] => 356863
[title] => Spark Streaming with Python and Kafka
[requirements] =>
Experience with Python and Apache Kafka
Familiarity with stream-processing platforms
Audience
Data engineers
Data scientists
Programmers
[overview] =>
Apache Spark Streaming is a scalable, open source stream processing system that allows users to process real-time data from supported sources. Spark Streaming enables fault-tolerant processing of data streams.
This instructor-led, live training (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
[outline] =>
Introduction
Overview of Spark Streaming Features and Architecture
Confluent KSQL is a stream processing framework built on top of Apache Kafka. It enables real-time data processing using SQL operations.
This instructor-led, live training (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Apache Ignite is an in-memory computing platform that sits between the application and data layer to improve speed, scale, and availability.
This instructor-led, live training (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
[outline] =>
Introduction
Overview of Big Data Tools and Technologies
Installing and Configuring Apache Ignite
Overview of Ignite Architecture
Querying Data in Ignite
Spreading Large Data Sets across a Cluster
Understanding the In-Memory Data Grid
Writing a Service in Ignite
Running Distributed Computing with Ignite
Integrating Ignite with RDBMS, NoSQL, Hadoop and Machine Learning Processors
Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.
In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.
By the end of this training, participants will be able to:
Install and configure Apache Beam.
Use a single programming model to carry out both batch and stream processing from withing their Java or Python application.
Execute pipelines across multiple environments.
Format of the Course
Part lecture, part discussion, exercises and heavy hands-on practice
Note
This course will be available Scala in the future. Please contact us to arrange.
[category_overview] =>
[outline] =>
Introduction
Apache Beam vs MapReduce, Spark Streaming, Kafka Streaming, Storm and Flink
Installing and Configuring Apache Beam
Overview of Apache Beam Features and Architecture
Beam Model, SDKs, Beam Pipeline Runners
Distributed processing back-ends
Understanding the Apache Beam Programming Model
How a pipeline is executed
Running a sample pipeline
Preparing a WordCount pipeline
Executing the Pipeline locally
Designing a Pipeline
Planning the structure, choosing the transforms, and determining the input and output methods
Creating the Pipeline
Writing the driver program and defining the pipeline
Using Apache Beam classes
Data sets, transforms, I/O, data encoding, etc.
Executing the Pipeline
Executing the pipeline locally, on remote machines, and on a public cloud
Choosing a runner
Runner-specific configurations
Testing and Debugging Apache Beam
Using type hints to emulate static typing
Managing Python Pipeline Dependencies
Processing Bounded and Unbounded Datasets
Windowing and Triggers
Making Your Pipelines Reusable and Maintainable
Create New Data Sources and Sinks
Apache Beam Source and Sink API
Integrating Apache Beam with other Big Data Systems
Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.
This instructor-led, live training introduces Apache Apex's unified stream processing architecture, and walks participants through the creation of a distributed application using Apex on Hadoop.
By the end of this training, participants will be able to:
Understand data processing pipeline concepts such as connectors for sources and sinks, common data transformations, etc.
Build, scale and optimize an Apex application
Process real-time data streams reliably and with minimum latency
Use Apex Core and the Apex Malhar library to enable rapid application development
Use the Apex API to write and re-use existing Java code
Integrate Apex into other applications as a processing engine
Tune, test and scale Apex applications
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
[outline] =>
To request a customized course outline for this training, please contact us.
Apache Storm is a distributed, real-time computation engine used for enabling real-time business intelligence. It does so by enabling applications to reliably process unbounded streams of data (a.k.a. stream processing).
"Storm is for real-time processing what Hadoop is for batch processing!"
In this instructor-led live training, participants will learn how to install and configure Apache Storm, then develop and deploy an Apache Storm application for processing big data in real-time.
Some of the topics included in this training include:
Apache Storm in the context of Hadoop
Working with unbounded data
Continuous computation
Real-time analytics
Distributed RPC and ETL processing
Request this course now!
Audience
Software and ETL developers
Mainframe professionals
Data scientists
Big data analysts
Hadoop professionals
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
[outline] =>
Request a customized course outline for this training!
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Automate dataflows.
Enable streaming analytics.
Apply various approaches for data ingestion.
Transform Big Data and into business insights.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc> (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
[outline] =>
Introduction
Data at rest vs data in motion
Overview of Big Data Tools and Technologies
Hadoop (HDFS and MapReduce) and Spark
Installing and Configuring NiFi
Overview of NiFi Architecture
Development Approaches
Application development tools and mindset
Extract, Transform, and Load (ETL) tools and mindset
Design Considerations
Components, Events, and Processor Patterns
Exercise: Streaming Data Feeds into HDFS
Error Handling
Controller Services
Exercise: Ingesting Data from IoT Devices using Web-Based APIs
Exercise: Developing a Custom Apache Nifi Processor using JSON
Apache Flink is an open-source framework for scalable stream and batch data processing.
This instructor-led, live training (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
[outline] =>
Introduction
Installing and Configuring Apache Flink
Overview of Flink Architecture
Developing Data Streaming Applications in Flink
Managing Diverse Workloads
Performing Advanced Analytics
Setting up a Multi-Node Flink Cluster
Mastering Flink DataStream API
Understanding Flink Libraries
Integrating Flink with Other Big Data Tools
Testing and Troubleshooting
Summary and Next Steps
[language] => en
[duration] => 28
[status] => published
[changed] => 1700037319
[source_title] => Apache Flink Fundamentals
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => flink
)
[sparkpython] => stdClass Object
(
[course_code] => sparkpython
[hr_nid] => 279430
[title] => Python and Spark for Big Data (PySpark)
[requirements] =>
General programming skills
Audience
Developers
IT Professionals
Data Scientists
[overview] =>
Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.
In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
[outline] =>
Introduction
Understanding Big Data
Overview of Spark
Overview of Python
Overview of PySpark
Distributing Data Using Resilient Distributed Datasets Framework
Distributing Computation Using Spark API Operators
Setting Up Python with Spark
Setting Up PySpark
Using Amazon Web Services (AWS) EC2 Instances for Spark
Setting Up Databricks
Setting Up the AWS EMR Cluster
Learning the Basics of Python Programming
Getting Started with Python
Using the Jupyter Notebook
Using Variables and Simple Data Types
Working with Lists
Using if Statements
Using User Inputs
Working with while Loops
Implementing Functions
Working with Classes
Working with Files and Exceptions
Working with Projects, Data, and APIs
Learning the Basics of Spark DataFrame
Getting Started with Spark DataFrames
Implementing Basic Operations with Spark
Using Groupby and Aggregate Operations
Working with Timestamps and Dates
Working on a Spark DataFrame Project Exercise
Understanding Machine Learning with MLlib
Working with MLlib, Spark, and Python for Machine Learning
Understanding Regressions
Learning Linear Regression Theory
Implementing a Regression Evaluation Code
Working on a Sample Linear Regression Exercise
Learning Logistic Regression Theory
Implementing a Logistic Regression Code
Working on a Sample Logistic Regression Exercise
Understanding Random Forests and Decision Trees
Learning Tree Methods Theory
Implementing Decision Trees and Random Forest Codes
Working on a Sample Random Forest Classification Exercise
Working with K-means Clustering
Understanding K-means Clustering Theory
Implementing a K-means Clustering Code
Working on a Sample Clustering Exercise
Working with Recommender Systems
Implementing Natural Language Processing
Understanding Natural Language Processing (NLP)
Overview of NLP Tools
Working on a Sample NLP Exercise
Streaming with Spark on Python
Overview Streaming with Spark
Sample Spark Streaming Exercise
Closing Remarks
[language] => en
[duration] => 21
[status] => published
[changed] => 1715349940
[source_title] => Python and Spark for Big Data (PySpark)
[source_language] => en
[cert_code] =>
[weight] => -998
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => sparkpython
)
[graphcomputing] => stdClass Object
(
[course_code] => graphcomputing
[hr_nid] => 278402
[title] => Introduction to Graph Computing
[requirements] =>
An undersanding of Java programming and frameworks
A general understanding of Python is helpful but not required
A general understanding of database concepts
Audience
Developers
[overview] =>
Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set of tools and processes -- these tools and processes can be referred to as Graph Computing (also known as Graph Analytics).
In this instructor-led, live training, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics and Distributed Graph Processing) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
[outline] =>
Introduction
Graph databases and libraries
Understanding Graph Data
The graph as a data structure
Using vertices (dots) and edges (lines) to model real-world scenarios
Using Graph Databases to Model, Persist and Process Graph Data
Local graph algorithms/traversals
neo4j, OrientDB and Titan
Exercise: Modeling Graph Data with neo4j
Whiteboard data modeling
Beyond Graph Databases: Graph Computing
Understanding the property graph
Graph modeling different scenarios (software graph, discussion graph, concept graph)
Solving Real-World Problems with Traversals
Algorithmic/directed walk over the graph
Determining circular cependencies
Case Study: Ranking Discussion Contributors
Ranking by number and depth of contributed discussions
Leveraging Hadoop for storage (HDFS) and processing (MapReduce)
Overview of iterative algorithms
Hama, Giraph, and GraphLab
Graph Computing: Graph-Parallel Computation
Unifying ETL, exploratory analysis, and iterative graph computation within a single system
GraphX
Setup and Installation
Hadoop and Spark
GraphX Operators
Property, structural, join, neighborhood aggregation, caching and uncaching
Iterating with Pregel API
Passing arguments for sending, receiving and computing
Building a Graph
Using vertices and edges in an RDD or on disk
Designing Scalable Algorithms
GraphX Optimization
Accessing Additional Algorithms
PageRank, Connected Components, Triangle Counting
Exercis: Page Rank and Top Users
Building and processing graph data using text files as input
Deploying to Production
Closing Remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349940
[source_title] => Introduction to Graph Computing
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => graphcomputing
)
[aitech] => stdClass Object
(
[course_code] => aitech
[hr_nid] => 199320
[title] => Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP
[requirements] =>
[overview] =>
This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and NLP.
[category_overview] =>
[outline] =>
Distribution big data
Data mining methods (training single systems + distributed prediction: traditional machine learning algorithms + Mapreduce distributed prediction)
Apache Spark MLlib
Recommendations and Advertising:
Natural language
Text clustering, text categorization (labeling), synonyms
User profile restore, labeling system
Recommended algorithms
Insuring the accuracy of "lift" between and within categories
How to create closed loops for recommendation algorithms
Logical regression, RankingSVM,
Feature recognition (deep learning and automatic feature recognition for graphics)
Natural language
Chinese word segmentation
Theme model (text clustering)
Text classification
Extract keywords
Semantic analysis, semantic parser, word2vec (vector to word)
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
It divides into two packages:
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
Audience
This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark
[category_overview] =>
[outline] =>
spark.mllib: data types, algorithms, and utilities
Data types
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
streaming significance testing
random data generation
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. Real-life applications for this data mining technique include marketing, fraud detection, telecommunication and manufacturing.
In this instructor-led, live course, we introduce the processes involved in KDD and carry out a series of exercises to practice the implementation of those processes.
Audience
Data analysts or anyone interested in learning how to interpret data to solve problems
Format of the Course
After a theoretical discussion of KDD, the instructor will present real-life cases which call for the application of KDD to solve a problem. Participants will prepare, select and cleanse sample data sets and use their prior knowledge about the data to propose solutions based on the results of their observations.
[category_overview] =>
[outline] =>
Introduction
KDD vs data mining
Establishing the application domain
Establishing relevant prior knowledge
Understanding the goal of the investigation
Creating a target data set
Data cleaning and preprocessing
Data reduction and projection
Choosing the data mining task
Choosing the data mining algorithms
Interpreting the mined patterns
Summary and conclusion
[language] => en
[duration] => 21
[status] => published
[changed] => 1700037259
[source_title] => Knowledge Discovery in Databases (KDD)
[source_language] => en
[cert_code] =>
[weight] => -987
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => kdd
)
)
[codes] => Array
(
[0] => datavault
[1] => sparkstreaming
[2] => ksql
[3] => apacheignite
[4] => beam
[5] => apex
[6] => storm
[7] => nifi
[8] => nifidev
[9] => flink
[10] => sparkpython
[11] => graphcomputing
[12] => aitech
[13] => spmllib
[14] => kdd
)
)
[4] => Array
(
[regions] => Array
(
[ec_4966] => Array
(
[tid] => ec_4966
[title] => Guayaquil
[sales_area] => ec_ecuador
[venues] => Array
(
[ec_15661446] => Array
(
[vid] => ec_15661446
[title] => Guayaquil - Mall del Sol
[vfdc] => 175.00
[prices] => Array
(
[1] => Array
(
[remote guaranteed] => 5437
[classroom guaranteed] => 5787
[remote guaranteed per delegate] => 5437
[delegates] => 1
[adp] => 937
[classroom guaranteed per delegate] => 5787
)
[2] => Array
(
[remote guaranteed] => 6374
[classroom guaranteed] => 6844
[remote guaranteed per delegate] => 3187
[delegates] => 2
[adp] => 937
[classroom guaranteed per delegate] => 3422
)
[3] => Array
(
[remote guaranteed] => 7311
[classroom guaranteed] => 7902
[remote guaranteed per delegate] => 2437
[delegates] => 3
[adp] => 937
[classroom guaranteed per delegate] => 2634
)
[4] => Array
(
[remote guaranteed] => 8248
[classroom guaranteed] => 8960
[remote guaranteed per delegate] => 2062
[delegates] => 4
[adp] => 937
[classroom guaranteed per delegate] => 2240
)
[5] => Array
(
[remote guaranteed] => 9185
[classroom guaranteed] => 10015
[remote guaranteed per delegate] => 1837
[delegates] => 5
[adp] => 937
[classroom guaranteed per delegate] => 2003
)
[6] => Array
(
[remote guaranteed] => 10122
[classroom guaranteed] => 11070
[remote guaranteed per delegate] => 1687
[delegates] => 6
[adp] => 937
[classroom guaranteed per delegate] => 1845
)
[7] => Array
(
[remote guaranteed] => 11060
[classroom guaranteed] => 12131
[remote guaranteed per delegate] => 1580
[delegates] => 7
[adp] => 937
[classroom guaranteed per delegate] => 1733
)
[8] => Array
(
[remote guaranteed] => 12000
[classroom guaranteed] => 13184
[remote guaranteed per delegate] => 1500
[delegates] => 8
[adp] => 937
[classroom guaranteed per delegate] => 1648
)
[9] => Array
(
[remote guaranteed] => 12933
[classroom guaranteed] => 14247
[remote guaranteed per delegate] => 1437
[delegates] => 9
[adp] => 937
[classroom guaranteed per delegate] => 1583
)
[10] => Array
(
[remote guaranteed] => 13870
[classroom guaranteed] => 15300
[remote guaranteed per delegate] => 1387
[delegates] => 10
[adp] => 937
[classroom guaranteed per delegate] => 1530
)
)
)
)
)
[ec_4967] => Array
(
[tid] => ec_4967
[title] => Quito
[sales_area] => ec_ecuador
[venues] => Array
(
[ec_15661447] => Array
(
[vid] => ec_15661447
[title] => Quito - Av Eloy Alfaro
[vfdc] => 200.00
[prices] => Array
(
[1] => Array
(
[remote guaranteed] => 5437
[classroom guaranteed] => 5837
[remote guaranteed per delegate] => 5437
[delegates] => 1
[adp] => 937
[classroom guaranteed per delegate] => 5837
)
[2] => Array
(
[remote guaranteed] => 6374
[classroom guaranteed] => 6874
[remote guaranteed per delegate] => 3187
[delegates] => 2
[adp] => 937
[classroom guaranteed per delegate] => 3437
)
[3] => Array
(
[remote guaranteed] => 7311
[classroom guaranteed] => 7911
[remote guaranteed per delegate] => 2437
[delegates] => 3
[adp] => 937
[classroom guaranteed per delegate] => 2637
)
[4] => Array
(
[remote guaranteed] => 8248
[classroom guaranteed] => 8948
[remote guaranteed per delegate] => 2062
[delegates] => 4
[adp] => 937
[classroom guaranteed per delegate] => 2237
)
[5] => Array
(
[remote guaranteed] => 9185
[classroom guaranteed] => 9985
[remote guaranteed per delegate] => 1837
[delegates] => 5
[adp] => 937
[classroom guaranteed per delegate] => 1997
)
[6] => Array
(
[remote guaranteed] => 10122
[classroom guaranteed] => 11022
[remote guaranteed per delegate] => 1687
[delegates] => 6
[adp] => 937
[classroom guaranteed per delegate] => 1837
)
[7] => Array
(
[remote guaranteed] => 11060
[classroom guaranteed] => 12061
[remote guaranteed per delegate] => 1580
[delegates] => 7
[adp] => 937
[classroom guaranteed per delegate] => 1723
)
[8] => Array
(
[remote guaranteed] => 12000
[classroom guaranteed] => 13096
[remote guaranteed per delegate] => 1500
[delegates] => 8
[adp] => 937
[classroom guaranteed per delegate] => 1637
)
[9] => Array
(
[remote guaranteed] => 12933
[classroom guaranteed] => 14130
[remote guaranteed per delegate] => 1437
[delegates] => 9
[adp] => 937
[classroom guaranteed per delegate] => 1570
)
[10] => Array
(
[remote guaranteed] => 13870
[classroom guaranteed] => 15170
[remote guaranteed per delegate] => 1387
[delegates] => 10
[adp] => 937
[classroom guaranteed per delegate] => 1517
)
)
)
)
)
)
[remote] => Array
(
[1] => Array
(
[remote guaranteed] => 5437
[remote guaranteed per delegate] => 5437
[adp] => 937
)
[2] => Array
(
[remote guaranteed] => 6374
[remote guaranteed per delegate] => 3187
[adp] => 937
)
[3] => Array
(
[remote guaranteed] => 7311
[remote guaranteed per delegate] => 2437
[adp] => 937
)
[4] => Array
(
[remote guaranteed] => 8248
[remote guaranteed per delegate] => 2062
[adp] => 937
)
[5] => Array
(
[remote guaranteed] => 9185
[remote guaranteed per delegate] => 1837
[adp] => 937
)
[6] => Array
(
[remote guaranteed] => 10122
[remote guaranteed per delegate] => 1687
[adp] => 937
)
[7] => Array
(
[remote guaranteed] => 11060
[remote guaranteed per delegate] => 1580
[adp] => 937
)
[8] => Array
(
[remote guaranteed] => 12000
[remote guaranteed per delegate] => 1500
[adp] => 937
)
[9] => Array
(
[remote guaranteed] => 12933
[remote guaranteed per delegate] => 1437
[adp] => 937
)
[10] => Array
(
[remote guaranteed] => 13870
[remote guaranteed per delegate] => 1387
[adp] => 937
)
)
[currency] => USD
)
[5] => Array
(
[0] => 5
[1] => 5
[2] => 4
[3] => 4
[4] => 5
)
[6] => Array
(
[479923] => Array
(
[title] => Apache NiFi for Developers
[rating] => 5
[delegate_and_company] => Pedro
[body] => I liked the virtual machine environments because he could easily toggle between the views and help if we were struggling with the material.
[mc] =>
[is_mt] => 0
[nid] => 479923
)
[445523] => Array
(
[title] => Python and Spark for Big Data (PySpark)
[rating] => 5
[delegate_and_company] => Aurelia-Adriana - Allianz Services Romania
[body] => I liked that it was practical. Loved to apply the theoretical knowledge with practical examples.
[mc] =>
[is_mt] => 0
[nid] => 445523
)
[422075] => Array
(
[title] => Apache NiFi for Administrators
[rating] => 4
[delegate_and_company] => Rolando García - OIT para México y Cuba
[body] => Muy poco, se me dificulto mucho y mas por que entre desfasado, no tome los primeras sesiones.
[mc] =>
[is_mt] => 0
[nid] => 422075
)
[404743] => Array
(
[title] => Data Vault: Building a Scalable Data Warehouse
[rating] => 4
[delegate_and_company] => john ernesto ii fernandez - Philippine AXA Life Insurance Corporation
[body] => how the trainor shows his knowledge in the subject he's teachign
[mc] =>
[is_mt] => 0
[nid] => 404743
)
[283902] => Array
(
[title] => Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP
[rating] => 5
[delegate_and_company] => Laura Kahn
[body] => This is one of the best hands-on with exercises programming courses I have ever taken.
[mc] => This is one of the best hands-on with exercises programming courses I have ever taken.
[is_mt] => 0
[nid] => 283902
)
)
[7] => 4.6
[8] => 1
[9] => 1
[10] =>
)
)
[4] => Array
(
[file] => /apps/nobleprog-website/core/routes.php
[line] => 19
[function] => course_menu_callback
[args] => Array
(
[0] => /en/cc/bdatr
)
)
[5] => Array
(
[file] => /apps/nobleprog-website/__index.php
[line] => 100
[args] => Array
(
[0] => /apps/nobleprog-website/core/routes.php
)
[function] => require_once
)
[6] => Array
(
[file] => /apps/nobleprog-website/_index.php
[line] => 26
[args] => Array
(
[0] => /apps/nobleprog-website/__index.php
)
[function] => include_once
)
[7] => Array
(
[file] => /apps/hitra7/index.php
[line] => 54
[args] => Array
(
[0] => /apps/nobleprog-website/_index.php
)
[function] => include_once
)
)
Big Data Analytics for Telecom Regulators Training Course
Big Data Analytics for Telecom Regulators Training Course
To meet compliance of the regulators, CSPs (Communication service providers) can tap into Big Data Analytics which not only help them to meet compliance but within the scope of same project they can increase customer satisfaction and thus reduce the churn. In fact since compliance is related to Quality of service tied to a contract, any initiative towards meeting the compliance, will improve the “competitive edge” of the CSPs. Therefore, it is important that Regulators should be able to advise/guide a set of Big Data analytic practice for CSPs that will be of mutual benefit between the regulators and CSPs.
The course consists of 8 modules (4 on day 1, and 4 on day 2)
Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
1. Module-1 : Case studies of how Telecom Regulators have used Big Data Analytics for imposing compliance :
2. Module-2 : Reviewing Millions of contract between CSPs and its users using unstructured Big data analytics
Elements of NLP ( Natural Language Processing )
Extracting SLA ( service level agreements ) from millions of Contracts
Some of the known open source and licensed tool for Contract analysis ( eBravia, IBM Watson, KIRA)
Automatic discovery of contract and conflict from Unstructured data analysis
3. Module -3 : Extracting Structured information from unstructured Customer Contract and map them to Quality of Service obtained from IPDR data & Crowd Sourced app data. Metric for Compliance. Automatic detection of compliance violations.
4. Module- 4 : USING app approach to collect compliance and QoS data- release a free regulatory mobile app to the users to track & Analyze automatically. In this approach regulatory authority will be releasing free app and distribute among the users-and the app will be collecting data on QoS/Spams etc and report it back in analytic dashboard form :
Intelligent spam detection engine (for SMS only) to assist the subscriber in reporting
Crowdsourcing of data about offending messages and calls to speed up detection of unregistered telemarketers
Updates about action taken on complaints within the App
Automatic reporting of voice call quality ( call drop, one way connection) for those who will have the regulatory app installed
Automatic reporting of Data Speed
5. Module-5 : Processing of regulatory app data for automatic alarm system generation (alarms will be generated and emailed/sms to stake holders automatically) :
Implementation of dashboard and alarm service
Microsoft Azure based dashboard and SNS alarm service
AWS Lambda Service based Dashboard and alarming
AWS/Microsoft Analytic suite to crunch the data for Alarm generation
Alarm generation rules
6. Module-6 : Use IPDR data for QoS and Compliance-IPDR Big data analytics:
Metered billing by service and subscriber usage
Network capacity analysis and planning
Edge resource management
Network inventory and asset management
Service-level objective (SLO) monitoring for business services
Quality of experience (QOE) monitoring
Call Drops
Service optimization and product development analytics
7. Module-7 : Customer Service Experience & Big Data approach to CSP CRM :
Compliance on Refund policies
Subscription fees
Meeting SLA and Subscription discount
Automatic detection of not meeting SLAs
8. Module-8 : Big Data ETL for integrating different QoS data source and combine to a single dashboard alarm based analytics:
Using a PAAS Cloud like AWS Lambda, Microsoft Azure
Using a Hybrid cloud approach
Requirements
There are no specific requirements needed to attend this course.
14 Hours
Big Data Analytics for Telecom Regulators Training Course - Booking
Big Data Analytics for Telecom Regulators Training Course - Enquiry
Big Data Analytics for Telecom Regulators - Consultancy Enquiry
Testimonials (5)
I liked the virtual machine environments because he could easily toggle between the views and help if we were struggling with the material.
Pedro
Course - Apache NiFi for Developers
I liked that it was practical. Loved to apply the theoretical knowledge with practical examples.
Aurelia-Adriana - Allianz Services Romania
Course - Python and Spark for Big Data (PySpark)
Muy poco, se me dificulto mucho y mas por que entre desfasado, no tome los primeras sesiones.
Rolando García - OIT para México y Cuba
Course - Apache NiFi for Administrators
how the trainor shows his knowledge in the subject he's teachign
john ernesto ii fernandez - Philippine AXA Life Insurance Corporation
Course - Data Vault: Building a Scalable Data Warehouse
This is one of the best hands-on with exercises programming courses I have ever taken.
Laura Kahn
Course - Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP
This instructor-led, live training in Ecuador (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
This instructor-led, live training in Ecuador (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
This instructor-led, live training in Ecuador (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.
In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.
By the end of this training, participants will be able to:
Install and configure Apache Beam.
Use a single programming model to carry out both batch and stream processing from withing their Java or Python application.
Execute pipelines across multiple environments.
Format of the Course
Part lecture, part discussion, exercises and heavy hands-on practice
Note
This course will be available Scala in the future. Please contact us to arrange.
Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.
This instructor-led, live training introduces Apache Apex's unified stream processing architecture, and walks participants through the creation of a distributed application using Apex on Hadoop.
By the end of this training, participants will be able to:
Understand data processing pipeline concepts such as connectors for sources and sinks, common data transformations, etc.
Build, scale and optimize an Apex application
Process real-time data streams reliably and with minimum latency
Use Apex Core and the Apex Malhar library to enable rapid application development
Use the Apex API to write and re-use existing Java code
Integrate Apex into other applications as a processing engine
Tune, test and scale Apex applications
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
Apache Storm is a distributed, real-time computation engine used for enabling real-time business intelligence. It does so by enabling applications to reliably process unbounded streams of data (a.k.a. stream processing).
"Storm is for real-time processing what Hadoop is for batch processing!"
In this instructor-led live training, participants will learn how to install and configure Apache Storm, then develop and deploy an Apache Storm application for processing big data in real-time.
Some of the topics included in this training include:
Apache Storm in the context of Hadoop
Working with unbounded data
Continuous computation
Real-time analytics
Distributed RPC and ETL processing
Request this course now!
Audience
Software and ETL developers
Mainframe professionals
Data scientists
Big data analysts
Hadoop professionals
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
In this instructor-led, live training in Ecuador (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
In this instructor-led, live training in Ecuador, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
This instructor-led, live training in Ecuador (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
In this instructor-led, live training in Ecuador, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
In this instructor-led, live training in Ecuador, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and NLP.
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
It divides into two packages:
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
Audience
This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark
Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. Real-life applications for this data mining technique include marketing, fraud detection, telecommunication and manufacturing.
In this instructor-led, live course, we introduce the processes involved in KDD and carry out a series of exercises to practice the implementation of those processes.
Audience
Data analysts or anyone interested in learning how to interpret data to solve problems
Format of the Course
After a theoretical discussion of KDD, the instructor will present real-life cases which call for the application of KDD to solve a problem. Participants will prepare, select and cleanse sample data sets and use their prior knowledge about the data to propose solutions based on the results of their observations.
There are no specific requirements needed to attend this course.
[overview] =>
To meet compliance of the regulators, CSPs (Communication service providers) can tap into Big Data Analytics which not only help them to meet compliance but within the scope of same project they can increase customer satisfaction and thus reduce the churn. In fact since compliance is related to Quality of service tied to a contract, any initiative towards meeting the compliance, will improve the “competitive edge” of the CSPs. Therefore, it is important that Regulators should be able to advise/guide a set of Big Data analytic practice for CSPs that will be of mutual benefit between the regulators and CSPs.
The course consists of 8 modules (4 on day 1, and 4 on day 2)
[category_overview] =>
[outline] =>
1. Module-1 : Case studies of how Telecom Regulators have used Big Data Analytics for imposing compliance :
2. Module-2 : Reviewing Millions of contract between CSPs and its users using unstructured Big data analytics
Elements of NLP ( Natural Language Processing )
Extracting SLA ( service level agreements ) from millions of Contracts
Some of the known open source and licensed tool for Contract analysis ( eBravia, IBM Watson, KIRA)
Automatic discovery of contract and conflict from Unstructured data analysis
3. Module -3 : Extracting Structured information from unstructured Customer Contract and map them to Quality of Service obtained from IPDR data & Crowd Sourced app data. Metric for Compliance. Automatic detection of compliance violations.
4. Module- 4 : USING app approach to collect compliance and QoS data- release a free regulatory mobile app to the users to track & Analyze automatically. In this approach regulatory authority will be releasing free app and distribute among the users-and the app will be collecting data on QoS/Spams etc and report it back in analytic dashboard form :
Intelligent spam detection engine (for SMS only) to assist the subscriber in reporting
Crowdsourcing of data about offending messages and calls to speed up detection of unregistered telemarketers
Updates about action taken on complaints within the App
Automatic reporting of voice call quality ( call drop, one way connection) for those who will have the regulatory app installed
Automatic reporting of Data Speed
5. Module-5 : Processing of regulatory app data for automatic alarm system generation (alarms will be generated and emailed/sms to stake holders automatically) :
Implementation of dashboard and alarm service
Microsoft Azure based dashboard and SNS alarm service
AWS Lambda Service based Dashboard and alarming
AWS/Microsoft Analytic suite to crunch the data for Alarm generation
Alarm generation rules
6. Module-6 : Use IPDR data for QoS and Compliance-IPDR Big data analytics:
Metered billing by service and subscriber usage
Network capacity analysis and planning
Edge resource management
Network inventory and asset management
Service-level objective (SLO) monitoring for business services
Quality of experience (QOE) monitoring
Call Drops
Service optimization and product development analytics
7. Module-7 : Customer Service Experience & Big Data approach to CSP CRM :
Compliance on Refund policies
Subscription fees
Meeting SLA and Subscription discount
Automatic detection of not meeting SLAs
8. Module-8 : Big Data ETL for integrating different QoS data source and combine to a single dashboard alarm based analytics:
Using a PAAS Cloud like AWS Lambda, Microsoft Azure
Using a Hybrid cloud approach
[language] => en
[duration] => 14
[status] => published
[changed] => 1700037381
[source_title] => Big Data Analytics for Telecom Regulators
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
)
[1] => Array
(
[0] => stdClass Object
(
[tid] => 766
[alias] => big-data-training
[name] => Big Data
[english_name] => Big Data
[consulting_option] => available_promoted
)
)
[2] => bdatr
[3] => Array
(
[outlines] => Array
(
[datavault] => stdClass Object
(
[course_code] => datavault
[hr_nid] => 210132
[title] => Data Vault: Building a Scalable Data Warehouse
[requirements] =>
An understanding of data warehousing concepts
An understanding of database and data modeling concepts
Audience
Data modelers
Data warehousing specialist
Business Intelligence specialists
Data engineers
Database administrators
[overview] =>
Data Vault Modeling is a database modeling technique that provides long-term historical storage of data that originates from multiple sources. A data vault stores a single version of the facts, or "all the data, all the time". Its flexible, scalable, consistent and adaptable design encompasses the best aspects of 3rd normal form (3NF) and star schema.
In this instructor-led, live training, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
[outline] =>
Introduction
The shortcomings of existing data warehouse data modeling architectures
Benefits of Data Vault modeling
Overview of Data Vault architecture and design principles
SEI / CMM / Compliance
Data Vault applications
Dynamic Data Warehousing
Exploration Warehousing
In-Database Data Mining
Rapid Linking of External Information
Data Vault components
Hubs, Links, Satellites
Building a Data Vault
Modeling Hubs, Links and Satellites
Data Vault reference rules
How components interact with each other
Modeling and populating a Data Vault
Converting 3NF OLTP to a Data Vault Enterprise Data Warehouse (EDW)
Understanding load dates, end-dates, and join operations
Business keys, relationships, link tables and join techniques
Query techniques
Load processing and query processing
Overview of Matrix Methodology
Getting data into data entities
Loading Hub Entities
Loading Link Entities
Loading Satellites
Using SEI/CMM Level 5 templates to obtain repeatable, reliable, and quantifiable results
Developing a consistent and repeatable ETL (Extract, Transform, Load) process
Building and deploying highly scalable and repeatable warehouses
Closing remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349914
[source_title] => Data Vault: Building a Scalable Data Warehouse
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => datavault
)
[sparkstreaming] => stdClass Object
(
[course_code] => sparkstreaming
[hr_nid] => 356863
[title] => Spark Streaming with Python and Kafka
[requirements] =>
Experience with Python and Apache Kafka
Familiarity with stream-processing platforms
Audience
Data engineers
Data scientists
Programmers
[overview] =>
Apache Spark Streaming is a scalable, open source stream processing system that allows users to process real-time data from supported sources. Spark Streaming enables fault-tolerant processing of data streams.
This instructor-led, live training (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
[outline] =>
Introduction
Overview of Spark Streaming Features and Architecture
Confluent KSQL is a stream processing framework built on top of Apache Kafka. It enables real-time data processing using SQL operations.
This instructor-led, live training (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Apache Ignite is an in-memory computing platform that sits between the application and data layer to improve speed, scale, and availability.
This instructor-led, live training (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
[outline] =>
Introduction
Overview of Big Data Tools and Technologies
Installing and Configuring Apache Ignite
Overview of Ignite Architecture
Querying Data in Ignite
Spreading Large Data Sets across a Cluster
Understanding the In-Memory Data Grid
Writing a Service in Ignite
Running Distributed Computing with Ignite
Integrating Ignite with RDBMS, NoSQL, Hadoop and Machine Learning Processors
Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.
In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.
By the end of this training, participants will be able to:
Install and configure Apache Beam.
Use a single programming model to carry out both batch and stream processing from withing their Java or Python application.
Execute pipelines across multiple environments.
Format of the Course
Part lecture, part discussion, exercises and heavy hands-on practice
Note
This course will be available Scala in the future. Please contact us to arrange.
[category_overview] =>
[outline] =>
Introduction
Apache Beam vs MapReduce, Spark Streaming, Kafka Streaming, Storm and Flink
Installing and Configuring Apache Beam
Overview of Apache Beam Features and Architecture
Beam Model, SDKs, Beam Pipeline Runners
Distributed processing back-ends
Understanding the Apache Beam Programming Model
How a pipeline is executed
Running a sample pipeline
Preparing a WordCount pipeline
Executing the Pipeline locally
Designing a Pipeline
Planning the structure, choosing the transforms, and determining the input and output methods
Creating the Pipeline
Writing the driver program and defining the pipeline
Using Apache Beam classes
Data sets, transforms, I/O, data encoding, etc.
Executing the Pipeline
Executing the pipeline locally, on remote machines, and on a public cloud
Choosing a runner
Runner-specific configurations
Testing and Debugging Apache Beam
Using type hints to emulate static typing
Managing Python Pipeline Dependencies
Processing Bounded and Unbounded Datasets
Windowing and Triggers
Making Your Pipelines Reusable and Maintainable
Create New Data Sources and Sinks
Apache Beam Source and Sink API
Integrating Apache Beam with other Big Data Systems
Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.
This instructor-led, live training introduces Apache Apex's unified stream processing architecture, and walks participants through the creation of a distributed application using Apex on Hadoop.
By the end of this training, participants will be able to:
Understand data processing pipeline concepts such as connectors for sources and sinks, common data transformations, etc.
Build, scale and optimize an Apex application
Process real-time data streams reliably and with minimum latency
Use Apex Core and the Apex Malhar library to enable rapid application development
Use the Apex API to write and re-use existing Java code
Integrate Apex into other applications as a processing engine
Tune, test and scale Apex applications
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
[outline] =>
To request a customized course outline for this training, please contact us.
Apache Storm is a distributed, real-time computation engine used for enabling real-time business intelligence. It does so by enabling applications to reliably process unbounded streams of data (a.k.a. stream processing).
"Storm is for real-time processing what Hadoop is for batch processing!"
In this instructor-led live training, participants will learn how to install and configure Apache Storm, then develop and deploy an Apache Storm application for processing big data in real-time.
Some of the topics included in this training include:
Apache Storm in the context of Hadoop
Working with unbounded data
Continuous computation
Real-time analytics
Distributed RPC and ETL processing
Request this course now!
Audience
Software and ETL developers
Mainframe professionals
Data scientists
Big data analysts
Hadoop professionals
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
[outline] =>
Request a customized course outline for this training!
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Automate dataflows.
Enable streaming analytics.
Apply various approaches for data ingestion.
Transform Big Data and into business insights.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc> (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
[outline] =>
Introduction
Data at rest vs data in motion
Overview of Big Data Tools and Technologies
Hadoop (HDFS and MapReduce) and Spark
Installing and Configuring NiFi
Overview of NiFi Architecture
Development Approaches
Application development tools and mindset
Extract, Transform, and Load (ETL) tools and mindset
Design Considerations
Components, Events, and Processor Patterns
Exercise: Streaming Data Feeds into HDFS
Error Handling
Controller Services
Exercise: Ingesting Data from IoT Devices using Web-Based APIs
Exercise: Developing a Custom Apache Nifi Processor using JSON
Apache Flink is an open-source framework for scalable stream and batch data processing.
This instructor-led, live training (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
[outline] =>
Introduction
Installing and Configuring Apache Flink
Overview of Flink Architecture
Developing Data Streaming Applications in Flink
Managing Diverse Workloads
Performing Advanced Analytics
Setting up a Multi-Node Flink Cluster
Mastering Flink DataStream API
Understanding Flink Libraries
Integrating Flink with Other Big Data Tools
Testing and Troubleshooting
Summary and Next Steps
[language] => en
[duration] => 28
[status] => published
[changed] => 1700037319
[source_title] => Apache Flink Fundamentals
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => flink
)
[sparkpython] => stdClass Object
(
[course_code] => sparkpython
[hr_nid] => 279430
[title] => Python and Spark for Big Data (PySpark)
[requirements] =>
General programming skills
Audience
Developers
IT Professionals
Data Scientists
[overview] =>
Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.
In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
[outline] =>
Introduction
Understanding Big Data
Overview of Spark
Overview of Python
Overview of PySpark
Distributing Data Using Resilient Distributed Datasets Framework
Distributing Computation Using Spark API Operators
Setting Up Python with Spark
Setting Up PySpark
Using Amazon Web Services (AWS) EC2 Instances for Spark
Setting Up Databricks
Setting Up the AWS EMR Cluster
Learning the Basics of Python Programming
Getting Started with Python
Using the Jupyter Notebook
Using Variables and Simple Data Types
Working with Lists
Using if Statements
Using User Inputs
Working with while Loops
Implementing Functions
Working with Classes
Working with Files and Exceptions
Working with Projects, Data, and APIs
Learning the Basics of Spark DataFrame
Getting Started with Spark DataFrames
Implementing Basic Operations with Spark
Using Groupby and Aggregate Operations
Working with Timestamps and Dates
Working on a Spark DataFrame Project Exercise
Understanding Machine Learning with MLlib
Working with MLlib, Spark, and Python for Machine Learning
Understanding Regressions
Learning Linear Regression Theory
Implementing a Regression Evaluation Code
Working on a Sample Linear Regression Exercise
Learning Logistic Regression Theory
Implementing a Logistic Regression Code
Working on a Sample Logistic Regression Exercise
Understanding Random Forests and Decision Trees
Learning Tree Methods Theory
Implementing Decision Trees and Random Forest Codes
Working on a Sample Random Forest Classification Exercise
Working with K-means Clustering
Understanding K-means Clustering Theory
Implementing a K-means Clustering Code
Working on a Sample Clustering Exercise
Working with Recommender Systems
Implementing Natural Language Processing
Understanding Natural Language Processing (NLP)
Overview of NLP Tools
Working on a Sample NLP Exercise
Streaming with Spark on Python
Overview Streaming with Spark
Sample Spark Streaming Exercise
Closing Remarks
[language] => en
[duration] => 21
[status] => published
[changed] => 1715349940
[source_title] => Python and Spark for Big Data (PySpark)
[source_language] => en
[cert_code] =>
[weight] => -998
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => sparkpython
)
[graphcomputing] => stdClass Object
(
[course_code] => graphcomputing
[hr_nid] => 278402
[title] => Introduction to Graph Computing
[requirements] =>
An undersanding of Java programming and frameworks
A general understanding of Python is helpful but not required
A general understanding of database concepts
Audience
Developers
[overview] =>
Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set of tools and processes -- these tools and processes can be referred to as Graph Computing (also known as Graph Analytics).
In this instructor-led, live training, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics and Distributed Graph Processing) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
[outline] =>
Introduction
Graph databases and libraries
Understanding Graph Data
The graph as a data structure
Using vertices (dots) and edges (lines) to model real-world scenarios
Using Graph Databases to Model, Persist and Process Graph Data
Local graph algorithms/traversals
neo4j, OrientDB and Titan
Exercise: Modeling Graph Data with neo4j
Whiteboard data modeling
Beyond Graph Databases: Graph Computing
Understanding the property graph
Graph modeling different scenarios (software graph, discussion graph, concept graph)
Solving Real-World Problems with Traversals
Algorithmic/directed walk over the graph
Determining circular cependencies
Case Study: Ranking Discussion Contributors
Ranking by number and depth of contributed discussions
Leveraging Hadoop for storage (HDFS) and processing (MapReduce)
Overview of iterative algorithms
Hama, Giraph, and GraphLab
Graph Computing: Graph-Parallel Computation
Unifying ETL, exploratory analysis, and iterative graph computation within a single system
GraphX
Setup and Installation
Hadoop and Spark
GraphX Operators
Property, structural, join, neighborhood aggregation, caching and uncaching
Iterating with Pregel API
Passing arguments for sending, receiving and computing
Building a Graph
Using vertices and edges in an RDD or on disk
Designing Scalable Algorithms
GraphX Optimization
Accessing Additional Algorithms
PageRank, Connected Components, Triangle Counting
Exercis: Page Rank and Top Users
Building and processing graph data using text files as input
Deploying to Production
Closing Remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349940
[source_title] => Introduction to Graph Computing
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => graphcomputing
)
[aitech] => stdClass Object
(
[course_code] => aitech
[hr_nid] => 199320
[title] => Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP
[requirements] =>
[overview] =>
This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and NLP.
[category_overview] =>
[outline] =>
Distribution big data
Data mining methods (training single systems + distributed prediction: traditional machine learning algorithms + Mapreduce distributed prediction)
Apache Spark MLlib
Recommendations and Advertising:
Natural language
Text clustering, text categorization (labeling), synonyms
User profile restore, labeling system
Recommended algorithms
Insuring the accuracy of "lift" between and within categories
How to create closed loops for recommendation algorithms
Logical regression, RankingSVM,
Feature recognition (deep learning and automatic feature recognition for graphics)
Natural language
Chinese word segmentation
Theme model (text clustering)
Text classification
Extract keywords
Semantic analysis, semantic parser, word2vec (vector to word)
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
It divides into two packages:
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
Audience
This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark
[category_overview] =>
[outline] =>
spark.mllib: data types, algorithms, and utilities
Data types
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
streaming significance testing
random data generation
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. Real-life applications for this data mining technique include marketing, fraud detection, telecommunication and manufacturing.
In this instructor-led, live course, we introduce the processes involved in KDD and carry out a series of exercises to practice the implementation of those processes.
Audience
Data analysts or anyone interested in learning how to interpret data to solve problems
Format of the Course
After a theoretical discussion of KDD, the instructor will present real-life cases which call for the application of KDD to solve a problem. Participants will prepare, select and cleanse sample data sets and use their prior knowledge about the data to propose solutions based on the results of their observations.
There are no specific requirements needed to attend this course.
[overview] =>
To meet compliance of the regulators, CSPs (Communication service providers) can tap into Big Data Analytics which not only help them to meet compliance but within the scope of same project they can increase customer satisfaction and thus reduce the churn. In fact since compliance is related to Quality of service tied to a contract, any initiative towards meeting the compliance, will improve the “competitive edge” of the CSPs. Therefore, it is important that Regulators should be able to advise/guide a set of Big Data analytic practice for CSPs that will be of mutual benefit between the regulators and CSPs.
The course consists of 8 modules (4 on day 1, and 4 on day 2)
[category_overview] =>
[outline] =>
1. Module-1 : Case studies of how Telecom Regulators have used Big Data Analytics for imposing compliance :
2. Module-2 : Reviewing Millions of contract between CSPs and its users using unstructured Big data analytics
Elements of NLP ( Natural Language Processing )
Extracting SLA ( service level agreements ) from millions of Contracts
Some of the known open source and licensed tool for Contract analysis ( eBravia, IBM Watson, KIRA)
Automatic discovery of contract and conflict from Unstructured data analysis
3. Module -3 : Extracting Structured information from unstructured Customer Contract and map them to Quality of Service obtained from IPDR data & Crowd Sourced app data. Metric for Compliance. Automatic detection of compliance violations.
4. Module- 4 : USING app approach to collect compliance and QoS data- release a free regulatory mobile app to the users to track & Analyze automatically. In this approach regulatory authority will be releasing free app and distribute among the users-and the app will be collecting data on QoS/Spams etc and report it back in analytic dashboard form :
Intelligent spam detection engine (for SMS only) to assist the subscriber in reporting
Crowdsourcing of data about offending messages and calls to speed up detection of unregistered telemarketers
Updates about action taken on complaints within the App
Automatic reporting of voice call quality ( call drop, one way connection) for those who will have the regulatory app installed
Automatic reporting of Data Speed
5. Module-5 : Processing of regulatory app data for automatic alarm system generation (alarms will be generated and emailed/sms to stake holders automatically) :
Implementation of dashboard and alarm service
Microsoft Azure based dashboard and SNS alarm service
AWS Lambda Service based Dashboard and alarming
AWS/Microsoft Analytic suite to crunch the data for Alarm generation
Alarm generation rules
6. Module-6 : Use IPDR data for QoS and Compliance-IPDR Big data analytics:
Metered billing by service and subscriber usage
Network capacity analysis and planning
Edge resource management
Network inventory and asset management
Service-level objective (SLO) monitoring for business services
Quality of experience (QOE) monitoring
Call Drops
Service optimization and product development analytics
7. Module-7 : Customer Service Experience & Big Data approach to CSP CRM :
Compliance on Refund policies
Subscription fees
Meeting SLA and Subscription discount
Automatic detection of not meeting SLAs
8. Module-8 : Big Data ETL for integrating different QoS data source and combine to a single dashboard alarm based analytics:
Using a PAAS Cloud like AWS Lambda, Microsoft Azure
Using a Hybrid cloud approach
[language] => en
[duration] => 14
[status] => published
[changed] => 1700037381
[source_title] => Big Data Analytics for Telecom Regulators
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
)
[1] => Array
(
[0] => stdClass Object
(
[tid] => 766
[alias] => big-data-training
[name] => Big Data
[english_name] => Big Data
[consulting_option] => available_promoted
)
)
[2] => bdatr
[3] => Array
(
[outlines] => Array
(
[datavault] => stdClass Object
(
[course_code] => datavault
[hr_nid] => 210132
[title] => Data Vault: Building a Scalable Data Warehouse
[requirements] =>
An understanding of data warehousing concepts
An understanding of database and data modeling concepts
Audience
Data modelers
Data warehousing specialist
Business Intelligence specialists
Data engineers
Database administrators
[overview] =>
Data Vault Modeling is a database modeling technique that provides long-term historical storage of data that originates from multiple sources. A data vault stores a single version of the facts, or "all the data, all the time". Its flexible, scalable, consistent and adaptable design encompasses the best aspects of 3rd normal form (3NF) and star schema.
In this instructor-led, live training, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
[outline] =>
Introduction
The shortcomings of existing data warehouse data modeling architectures
Benefits of Data Vault modeling
Overview of Data Vault architecture and design principles
SEI / CMM / Compliance
Data Vault applications
Dynamic Data Warehousing
Exploration Warehousing
In-Database Data Mining
Rapid Linking of External Information
Data Vault components
Hubs, Links, Satellites
Building a Data Vault
Modeling Hubs, Links and Satellites
Data Vault reference rules
How components interact with each other
Modeling and populating a Data Vault
Converting 3NF OLTP to a Data Vault Enterprise Data Warehouse (EDW)
Understanding load dates, end-dates, and join operations
Business keys, relationships, link tables and join techniques
Query techniques
Load processing and query processing
Overview of Matrix Methodology
Getting data into data entities
Loading Hub Entities
Loading Link Entities
Loading Satellites
Using SEI/CMM Level 5 templates to obtain repeatable, reliable, and quantifiable results
Developing a consistent and repeatable ETL (Extract, Transform, Load) process
Building and deploying highly scalable and repeatable warehouses
Closing remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349914
[source_title] => Data Vault: Building a Scalable Data Warehouse
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => datavault
)
[sparkstreaming] => stdClass Object
(
[course_code] => sparkstreaming
[hr_nid] => 356863
[title] => Spark Streaming with Python and Kafka
[requirements] =>
Experience with Python and Apache Kafka
Familiarity with stream-processing platforms
Audience
Data engineers
Data scientists
Programmers
[overview] =>
Apache Spark Streaming is a scalable, open source stream processing system that allows users to process real-time data from supported sources. Spark Streaming enables fault-tolerant processing of data streams.
This instructor-led, live training (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
[outline] =>
Introduction
Overview of Spark Streaming Features and Architecture
Confluent KSQL is a stream processing framework built on top of Apache Kafka. It enables real-time data processing using SQL operations.
This instructor-led, live training (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Apache Ignite is an in-memory computing platform that sits between the application and data layer to improve speed, scale, and availability.
This instructor-led, live training (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
[outline] =>
Introduction
Overview of Big Data Tools and Technologies
Installing and Configuring Apache Ignite
Overview of Ignite Architecture
Querying Data in Ignite
Spreading Large Data Sets across a Cluster
Understanding the In-Memory Data Grid
Writing a Service in Ignite
Running Distributed Computing with Ignite
Integrating Ignite with RDBMS, NoSQL, Hadoop and Machine Learning Processors
Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.
In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.
By the end of this training, participants will be able to:
Install and configure Apache Beam.
Use a single programming model to carry out both batch and stream processing from withing their Java or Python application.
Execute pipelines across multiple environments.
Format of the Course
Part lecture, part discussion, exercises and heavy hands-on practice
Note
This course will be available Scala in the future. Please contact us to arrange.
[category_overview] =>
[outline] =>
Introduction
Apache Beam vs MapReduce, Spark Streaming, Kafka Streaming, Storm and Flink
Installing and Configuring Apache Beam
Overview of Apache Beam Features and Architecture
Beam Model, SDKs, Beam Pipeline Runners
Distributed processing back-ends
Understanding the Apache Beam Programming Model
How a pipeline is executed
Running a sample pipeline
Preparing a WordCount pipeline
Executing the Pipeline locally
Designing a Pipeline
Planning the structure, choosing the transforms, and determining the input and output methods
Creating the Pipeline
Writing the driver program and defining the pipeline
Using Apache Beam classes
Data sets, transforms, I/O, data encoding, etc.
Executing the Pipeline
Executing the pipeline locally, on remote machines, and on a public cloud
Choosing a runner
Runner-specific configurations
Testing and Debugging Apache Beam
Using type hints to emulate static typing
Managing Python Pipeline Dependencies
Processing Bounded and Unbounded Datasets
Windowing and Triggers
Making Your Pipelines Reusable and Maintainable
Create New Data Sources and Sinks
Apache Beam Source and Sink API
Integrating Apache Beam with other Big Data Systems
Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.
This instructor-led, live training introduces Apache Apex's unified stream processing architecture, and walks participants through the creation of a distributed application using Apex on Hadoop.
By the end of this training, participants will be able to:
Understand data processing pipeline concepts such as connectors for sources and sinks, common data transformations, etc.
Build, scale and optimize an Apex application
Process real-time data streams reliably and with minimum latency
Use Apex Core and the Apex Malhar library to enable rapid application development
Use the Apex API to write and re-use existing Java code
Integrate Apex into other applications as a processing engine
Tune, test and scale Apex applications
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
[outline] =>
To request a customized course outline for this training, please contact us.
Apache Storm is a distributed, real-time computation engine used for enabling real-time business intelligence. It does so by enabling applications to reliably process unbounded streams of data (a.k.a. stream processing).
"Storm is for real-time processing what Hadoop is for batch processing!"
In this instructor-led live training, participants will learn how to install and configure Apache Storm, then develop and deploy an Apache Storm application for processing big data in real-time.
Some of the topics included in this training include:
Apache Storm in the context of Hadoop
Working with unbounded data
Continuous computation
Real-time analytics
Distributed RPC and ETL processing
Request this course now!
Audience
Software and ETL developers
Mainframe professionals
Data scientists
Big data analysts
Hadoop professionals
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
[outline] =>
Request a customized course outline for this training!
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Automate dataflows.
Enable streaming analytics.
Apply various approaches for data ingestion.
Transform Big Data and into business insights.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc> (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
[outline] =>
Introduction
Data at rest vs data in motion
Overview of Big Data Tools and Technologies
Hadoop (HDFS and MapReduce) and Spark
Installing and Configuring NiFi
Overview of NiFi Architecture
Development Approaches
Application development tools and mindset
Extract, Transform, and Load (ETL) tools and mindset
Design Considerations
Components, Events, and Processor Patterns
Exercise: Streaming Data Feeds into HDFS
Error Handling
Controller Services
Exercise: Ingesting Data from IoT Devices using Web-Based APIs
Exercise: Developing a Custom Apache Nifi Processor using JSON
Apache Flink is an open-source framework for scalable stream and batch data processing.
This instructor-led, live training (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
[outline] =>
Introduction
Installing and Configuring Apache Flink
Overview of Flink Architecture
Developing Data Streaming Applications in Flink
Managing Diverse Workloads
Performing Advanced Analytics
Setting up a Multi-Node Flink Cluster
Mastering Flink DataStream API
Understanding Flink Libraries
Integrating Flink with Other Big Data Tools
Testing and Troubleshooting
Summary and Next Steps
[language] => en
[duration] => 28
[status] => published
[changed] => 1700037319
[source_title] => Apache Flink Fundamentals
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => flink
)
[sparkpython] => stdClass Object
(
[course_code] => sparkpython
[hr_nid] => 279430
[title] => Python and Spark for Big Data (PySpark)
[requirements] =>
General programming skills
Audience
Developers
IT Professionals
Data Scientists
[overview] =>
Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.
In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
[outline] =>
Introduction
Understanding Big Data
Overview of Spark
Overview of Python
Overview of PySpark
Distributing Data Using Resilient Distributed Datasets Framework
Distributing Computation Using Spark API Operators
Setting Up Python with Spark
Setting Up PySpark
Using Amazon Web Services (AWS) EC2 Instances for Spark
Setting Up Databricks
Setting Up the AWS EMR Cluster
Learning the Basics of Python Programming
Getting Started with Python
Using the Jupyter Notebook
Using Variables and Simple Data Types
Working with Lists
Using if Statements
Using User Inputs
Working with while Loops
Implementing Functions
Working with Classes
Working with Files and Exceptions
Working with Projects, Data, and APIs
Learning the Basics of Spark DataFrame
Getting Started with Spark DataFrames
Implementing Basic Operations with Spark
Using Groupby and Aggregate Operations
Working with Timestamps and Dates
Working on a Spark DataFrame Project Exercise
Understanding Machine Learning with MLlib
Working with MLlib, Spark, and Python for Machine Learning
Understanding Regressions
Learning Linear Regression Theory
Implementing a Regression Evaluation Code
Working on a Sample Linear Regression Exercise
Learning Logistic Regression Theory
Implementing a Logistic Regression Code
Working on a Sample Logistic Regression Exercise
Understanding Random Forests and Decision Trees
Learning Tree Methods Theory
Implementing Decision Trees and Random Forest Codes
Working on a Sample Random Forest Classification Exercise
Working with K-means Clustering
Understanding K-means Clustering Theory
Implementing a K-means Clustering Code
Working on a Sample Clustering Exercise
Working with Recommender Systems
Implementing Natural Language Processing
Understanding Natural Language Processing (NLP)
Overview of NLP Tools
Working on a Sample NLP Exercise
Streaming with Spark on Python
Overview Streaming with Spark
Sample Spark Streaming Exercise
Closing Remarks
[language] => en
[duration] => 21
[status] => published
[changed] => 1715349940
[source_title] => Python and Spark for Big Data (PySpark)
[source_language] => en
[cert_code] =>
[weight] => -998
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => sparkpython
)
[graphcomputing] => stdClass Object
(
[course_code] => graphcomputing
[hr_nid] => 278402
[title] => Introduction to Graph Computing
[requirements] =>
An undersanding of Java programming and frameworks
A general understanding of Python is helpful but not required
A general understanding of database concepts
Audience
Developers
[overview] =>
Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set of tools and processes -- these tools and processes can be referred to as Graph Computing (also known as Graph Analytics).
In this instructor-led, live training, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics and Distributed Graph Processing) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
[outline] =>
Introduction
Graph databases and libraries
Understanding Graph Data
The graph as a data structure
Using vertices (dots) and edges (lines) to model real-world scenarios
Using Graph Databases to Model, Persist and Process Graph Data
Local graph algorithms/traversals
neo4j, OrientDB and Titan
Exercise: Modeling Graph Data with neo4j
Whiteboard data modeling
Beyond Graph Databases: Graph Computing
Understanding the property graph
Graph modeling different scenarios (software graph, discussion graph, concept graph)
Solving Real-World Problems with Traversals
Algorithmic/directed walk over the graph
Determining circular cependencies
Case Study: Ranking Discussion Contributors
Ranking by number and depth of contributed discussions
Leveraging Hadoop for storage (HDFS) and processing (MapReduce)
Overview of iterative algorithms
Hama, Giraph, and GraphLab
Graph Computing: Graph-Parallel Computation
Unifying ETL, exploratory analysis, and iterative graph computation within a single system
GraphX
Setup and Installation
Hadoop and Spark
GraphX Operators
Property, structural, join, neighborhood aggregation, caching and uncaching
Iterating with Pregel API
Passing arguments for sending, receiving and computing
Building a Graph
Using vertices and edges in an RDD or on disk
Designing Scalable Algorithms
GraphX Optimization
Accessing Additional Algorithms
PageRank, Connected Components, Triangle Counting
Exercis: Page Rank and Top Users
Building and processing graph data using text files as input
Deploying to Production
Closing Remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349940
[source_title] => Introduction to Graph Computing
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => graphcomputing
)
[aitech] => stdClass Object
(
[course_code] => aitech
[hr_nid] => 199320
[title] => Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP
[requirements] =>
[overview] =>
This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and NLP.
[category_overview] =>
[outline] =>
Distribution big data
Data mining methods (training single systems + distributed prediction: traditional machine learning algorithms + Mapreduce distributed prediction)
Apache Spark MLlib
Recommendations and Advertising:
Natural language
Text clustering, text categorization (labeling), synonyms
User profile restore, labeling system
Recommended algorithms
Insuring the accuracy of "lift" between and within categories
How to create closed loops for recommendation algorithms
Logical regression, RankingSVM,
Feature recognition (deep learning and automatic feature recognition for graphics)
Natural language
Chinese word segmentation
Theme model (text clustering)
Text classification
Extract keywords
Semantic analysis, semantic parser, word2vec (vector to word)
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
It divides into two packages:
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
Audience
This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark
[category_overview] =>
[outline] =>
spark.mllib: data types, algorithms, and utilities
Data types
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
streaming significance testing
random data generation
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. Real-life applications for this data mining technique include marketing, fraud detection, telecommunication and manufacturing.
In this instructor-led, live course, we introduce the processes involved in KDD and carry out a series of exercises to practice the implementation of those processes.
Audience
Data analysts or anyone interested in learning how to interpret data to solve problems
Format of the Course
After a theoretical discussion of KDD, the instructor will present real-life cases which call for the application of KDD to solve a problem. Participants will prepare, select and cleanse sample data sets and use their prior knowledge about the data to propose solutions based on the results of their observations.
There are no specific requirements needed to attend this course.
[overview] =>
To meet compliance of the regulators, CSPs (Communication service providers) can tap into Big Data Analytics which not only help them to meet compliance but within the scope of same project they can increase customer satisfaction and thus reduce the churn. In fact since compliance is related to Quality of service tied to a contract, any initiative towards meeting the compliance, will improve the “competitive edge” of the CSPs. Therefore, it is important that Regulators should be able to advise/guide a set of Big Data analytic practice for CSPs that will be of mutual benefit between the regulators and CSPs.
The course consists of 8 modules (4 on day 1, and 4 on day 2)
[category_overview] =>
[outline] =>
1. Module-1 : Case studies of how Telecom Regulators have used Big Data Analytics for imposing compliance :
2. Module-2 : Reviewing Millions of contract between CSPs and its users using unstructured Big data analytics
Elements of NLP ( Natural Language Processing )
Extracting SLA ( service level agreements ) from millions of Contracts
Some of the known open source and licensed tool for Contract analysis ( eBravia, IBM Watson, KIRA)
Automatic discovery of contract and conflict from Unstructured data analysis
3. Module -3 : Extracting Structured information from unstructured Customer Contract and map them to Quality of Service obtained from IPDR data & Crowd Sourced app data. Metric for Compliance. Automatic detection of compliance violations.
4. Module- 4 : USING app approach to collect compliance and QoS data- release a free regulatory mobile app to the users to track & Analyze automatically. In this approach regulatory authority will be releasing free app and distribute among the users-and the app will be collecting data on QoS/Spams etc and report it back in analytic dashboard form :
Intelligent spam detection engine (for SMS only) to assist the subscriber in reporting
Crowdsourcing of data about offending messages and calls to speed up detection of unregistered telemarketers
Updates about action taken on complaints within the App
Automatic reporting of voice call quality ( call drop, one way connection) for those who will have the regulatory app installed
Automatic reporting of Data Speed
5. Module-5 : Processing of regulatory app data for automatic alarm system generation (alarms will be generated and emailed/sms to stake holders automatically) :
Implementation of dashboard and alarm service
Microsoft Azure based dashboard and SNS alarm service
AWS Lambda Service based Dashboard and alarming
AWS/Microsoft Analytic suite to crunch the data for Alarm generation
Alarm generation rules
6. Module-6 : Use IPDR data for QoS and Compliance-IPDR Big data analytics:
Metered billing by service and subscriber usage
Network capacity analysis and planning
Edge resource management
Network inventory and asset management
Service-level objective (SLO) monitoring for business services
Quality of experience (QOE) monitoring
Call Drops
Service optimization and product development analytics
7. Module-7 : Customer Service Experience & Big Data approach to CSP CRM :
Compliance on Refund policies
Subscription fees
Meeting SLA and Subscription discount
Automatic detection of not meeting SLAs
8. Module-8 : Big Data ETL for integrating different QoS data source and combine to a single dashboard alarm based analytics:
Using a PAAS Cloud like AWS Lambda, Microsoft Azure
Using a Hybrid cloud approach
[language] => en
[duration] => 14
[status] => published
[changed] => 1700037381
[source_title] => Big Data Analytics for Telecom Regulators
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
)
[1] => Array
(
[0] => stdClass Object
(
[tid] => 766
[alias] => big-data-training
[name] => Big Data
[english_name] => Big Data
[consulting_option] => available_promoted
)
)
[2] => bdatr
[3] => Array
(
[outlines] => Array
(
[datavault] => stdClass Object
(
[course_code] => datavault
[hr_nid] => 210132
[title] => Data Vault: Building a Scalable Data Warehouse
[requirements] =>
An understanding of data warehousing concepts
An understanding of database and data modeling concepts
Audience
Data modelers
Data warehousing specialist
Business Intelligence specialists
Data engineers
Database administrators
[overview] =>
Data Vault Modeling is a database modeling technique that provides long-term historical storage of data that originates from multiple sources. A data vault stores a single version of the facts, or "all the data, all the time". Its flexible, scalable, consistent and adaptable design encompasses the best aspects of 3rd normal form (3NF) and star schema.
In this instructor-led, live training, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to build a Data Vault.
By the end of this training, participants will be able to:
Understand the architecture and design concepts behind Data Vault 2.0, and its interaction with Big Data, NoSQL and AI.
Use data vaulting techniques to enable auditing, tracing, and inspection of historical data in a data warehouse.
Develop a consistent and repeatable ETL (Extract, Transform, Load) process.
Build and deploy highly scalable and repeatable warehouses.
[outline] =>
Introduction
The shortcomings of existing data warehouse data modeling architectures
Benefits of Data Vault modeling
Overview of Data Vault architecture and design principles
SEI / CMM / Compliance
Data Vault applications
Dynamic Data Warehousing
Exploration Warehousing
In-Database Data Mining
Rapid Linking of External Information
Data Vault components
Hubs, Links, Satellites
Building a Data Vault
Modeling Hubs, Links and Satellites
Data Vault reference rules
How components interact with each other
Modeling and populating a Data Vault
Converting 3NF OLTP to a Data Vault Enterprise Data Warehouse (EDW)
Understanding load dates, end-dates, and join operations
Business keys, relationships, link tables and join techniques
Query techniques
Load processing and query processing
Overview of Matrix Methodology
Getting data into data entities
Loading Hub Entities
Loading Link Entities
Loading Satellites
Using SEI/CMM Level 5 templates to obtain repeatable, reliable, and quantifiable results
Developing a consistent and repeatable ETL (Extract, Transform, Load) process
Building and deploying highly scalable and repeatable warehouses
Closing remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349914
[source_title] => Data Vault: Building a Scalable Data Warehouse
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => datavault
)
[sparkstreaming] => stdClass Object
(
[course_code] => sparkstreaming
[hr_nid] => 356863
[title] => Spark Streaming with Python and Kafka
[requirements] =>
Experience with Python and Apache Kafka
Familiarity with stream-processing platforms
Audience
Data engineers
Data scientists
Programmers
[overview] =>
Apache Spark Streaming is a scalable, open source stream processing system that allows users to process real-time data from supported sources. Spark Streaming enables fault-tolerant processing of data streams.
This instructor-led, live training (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.
By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.
[outline] =>
Introduction
Overview of Spark Streaming Features and Architecture
Confluent KSQL is a stream processing framework built on top of Apache Kafka. It enables real-time data processing using SQL operations.
This instructor-led, live training (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to implement Apache Kafka stream processing without writing code.
By the end of this training, participants will be able to:
Install and configure Confluent KSQL.
Set up a stream processing pipeline using only SQL commands (no Java or Python coding).
Carry out data filtering, transformations, aggregations, joins, windowing, and sessionization entirely in SQL.
Design and deploy interactive, continuous queries for streaming ETL and real-time analytics.
Apache Ignite is an in-memory computing platform that sits between the application and data layer to improve speed, scale, and availability.
This instructor-led, live training (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) is aimed at developers who wish to learn the principles behind persistent and pure in-memory storage as they step through the creation of a sample in-memory computing project.
By the end of this training, participants will be able to:
Use Ignite for in-memory, on-disk persistence as well as a purely distributed in-memory database.
Achieve persistence without syncing data back to a relational database.
Use Ignite to carry out SQL and distributed joins.
Improve performance by moving data closer to the CPU, using RAM as a storage.
Spread data sets across a cluster to achieve horizontal scalability.
Integrate Ignite with RDBMS, NoSQL, Hadoop and machine learning processors.
[outline] =>
Introduction
Overview of Big Data Tools and Technologies
Installing and Configuring Apache Ignite
Overview of Ignite Architecture
Querying Data in Ignite
Spreading Large Data Sets across a Cluster
Understanding the In-Memory Data Grid
Writing a Service in Ignite
Running Distributed Computing with Ignite
Integrating Ignite with RDBMS, NoSQL, Hadoop and Machine Learning Processors
Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.
In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.
By the end of this training, participants will be able to:
Install and configure Apache Beam.
Use a single programming model to carry out both batch and stream processing from withing their Java or Python application.
Execute pipelines across multiple environments.
Format of the Course
Part lecture, part discussion, exercises and heavy hands-on practice
Note
This course will be available Scala in the future. Please contact us to arrange.
[category_overview] =>
[outline] =>
Introduction
Apache Beam vs MapReduce, Spark Streaming, Kafka Streaming, Storm and Flink
Installing and Configuring Apache Beam
Overview of Apache Beam Features and Architecture
Beam Model, SDKs, Beam Pipeline Runners
Distributed processing back-ends
Understanding the Apache Beam Programming Model
How a pipeline is executed
Running a sample pipeline
Preparing a WordCount pipeline
Executing the Pipeline locally
Designing a Pipeline
Planning the structure, choosing the transforms, and determining the input and output methods
Creating the Pipeline
Writing the driver program and defining the pipeline
Using Apache Beam classes
Data sets, transforms, I/O, data encoding, etc.
Executing the Pipeline
Executing the pipeline locally, on remote machines, and on a public cloud
Choosing a runner
Runner-specific configurations
Testing and Debugging Apache Beam
Using type hints to emulate static typing
Managing Python Pipeline Dependencies
Processing Bounded and Unbounded Datasets
Windowing and Triggers
Making Your Pipelines Reusable and Maintainable
Create New Data Sources and Sinks
Apache Beam Source and Sink API
Integrating Apache Beam with other Big Data Systems
Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.
This instructor-led, live training introduces Apache Apex's unified stream processing architecture, and walks participants through the creation of a distributed application using Apex on Hadoop.
By the end of this training, participants will be able to:
Understand data processing pipeline concepts such as connectors for sources and sinks, common data transformations, etc.
Build, scale and optimize an Apex application
Process real-time data streams reliably and with minimum latency
Use Apex Core and the Apex Malhar library to enable rapid application development
Use the Apex API to write and re-use existing Java code
Integrate Apex into other applications as a processing engine
Tune, test and scale Apex applications
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
[outline] =>
To request a customized course outline for this training, please contact us.
Apache Storm is a distributed, real-time computation engine used for enabling real-time business intelligence. It does so by enabling applications to reliably process unbounded streams of data (a.k.a. stream processing).
"Storm is for real-time processing what Hadoop is for batch processing!"
In this instructor-led live training, participants will learn how to install and configure Apache Storm, then develop and deploy an Apache Storm application for processing big data in real-time.
Some of the topics included in this training include:
Apache Storm in the context of Hadoop
Working with unbounded data
Continuous computation
Real-time analytics
Distributed RPC and ETL processing
Request this course now!
Audience
Software and ETL developers
Mainframe professionals
Data scientists
Big data analysts
Hadoop professionals
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
[outline] =>
Request a customized course outline for this training!
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Automate dataflows.
Enable streaming analytics.
Apply various approaches for data ingestion.
Transform Big Data and into business insights.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc> (onsite or remote), participants will learn how to deploy and manage Apache NiFi in a live lab environment.
By the end of this training, participants will be able to:
Install and configure Apachi NiFi.
Source, transform and manage data from disparate, distributed data sources, including databases and big data lakes.
Apache NiFi (Hortonworks DataFlow) is a real-time integrated data logistics and simple event processing platform that enables the moving, tracking and automation of data between systems. It is written using flow-based programming and provides a web-based user interface to manage dataflows in real time.
In this instructor-led, live training, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn the fundamentals of flow-based programming as they develop a number of demo extensions, components and processors using Apache NiFi.
By the end of this training, participants will be able to:
Understand NiFi's architecture and dataflow concepts.
Develop extensions using NiFi and third-party APIs.
Custom develop their own Apache Nifi processor.
Ingest and process real-time data from disparate and uncommon file formats and data sources.
[outline] =>
Introduction
Data at rest vs data in motion
Overview of Big Data Tools and Technologies
Hadoop (HDFS and MapReduce) and Spark
Installing and Configuring NiFi
Overview of NiFi Architecture
Development Approaches
Application development tools and mindset
Extract, Transform, and Load (ETL) tools and mindset
Design Considerations
Components, Events, and Processor Patterns
Exercise: Streaming Data Feeds into HDFS
Error Handling
Controller Services
Exercise: Ingesting Data from IoT Devices using Web-Based APIs
Exercise: Developing a Custom Apache Nifi Processor using JSON
Apache Flink is an open-source framework for scalable stream and batch data processing.
This instructor-led, live training (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
Format of the Course
Interactive lecture and discussion.
Lots of exercises and practice.
Hands-on implementation in a live-lab environment.
Course Customization Options
To request a customized training for this course, please contact us to arrange.
[category_overview] =>
This instructor-led, live training in <loc> (online or onsite) introduces the principles and approaches behind distributed stream and batch data processing, and walks participants through the creation of a real-time, data streaming application in Apache Flink.
By the end of this training, participants will be able to:
Set up an environment for developing data analysis applications.
Understand how Apache Flink's graph-processing library (Gelly) works.
Package, execute, and monitor Flink-based, fault-tolerant, data streaming applications.
Manage diverse workloads.
Perform advanced analytics.
Set up a multi-node Flink cluster.
Measure and optimize performance.
Integrate Flink with different Big Data systems.
Compare Flink capabilities with those of other big data processing frameworks.
[outline] =>
Introduction
Installing and Configuring Apache Flink
Overview of Flink Architecture
Developing Data Streaming Applications in Flink
Managing Diverse Workloads
Performing Advanced Analytics
Setting up a Multi-Node Flink Cluster
Mastering Flink DataStream API
Understanding Flink Libraries
Integrating Flink with Other Big Data Tools
Testing and Troubleshooting
Summary and Next Steps
[language] => en
[duration] => 28
[status] => published
[changed] => 1700037319
[source_title] => Apache Flink Fundamentals
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => flink
)
[sparkpython] => stdClass Object
(
[course_code] => sparkpython
[hr_nid] => 279430
[title] => Python and Spark for Big Data (PySpark)
[requirements] =>
General programming skills
Audience
Developers
IT Professionals
Data Scientists
[overview] =>
Python is a high-level programming language famous for its clear syntax and code readibility. Spark is a data processing engine used in querying, analyzing, and transforming big data. PySpark allows users to interface Spark with Python.
In this instructor-led, live training, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn how to use Python and Spark together to analyze big data as they work on hands-on exercises.
By the end of this training, participants will be able to:
Learn how to use Spark with Python to analyze Big Data.
Work on exercises that mimic real world cases.
Use different tools and techniques for big data analysis using PySpark.
[outline] =>
Introduction
Understanding Big Data
Overview of Spark
Overview of Python
Overview of PySpark
Distributing Data Using Resilient Distributed Datasets Framework
Distributing Computation Using Spark API Operators
Setting Up Python with Spark
Setting Up PySpark
Using Amazon Web Services (AWS) EC2 Instances for Spark
Setting Up Databricks
Setting Up the AWS EMR Cluster
Learning the Basics of Python Programming
Getting Started with Python
Using the Jupyter Notebook
Using Variables and Simple Data Types
Working with Lists
Using if Statements
Using User Inputs
Working with while Loops
Implementing Functions
Working with Classes
Working with Files and Exceptions
Working with Projects, Data, and APIs
Learning the Basics of Spark DataFrame
Getting Started with Spark DataFrames
Implementing Basic Operations with Spark
Using Groupby and Aggregate Operations
Working with Timestamps and Dates
Working on a Spark DataFrame Project Exercise
Understanding Machine Learning with MLlib
Working with MLlib, Spark, and Python for Machine Learning
Understanding Regressions
Learning Linear Regression Theory
Implementing a Regression Evaluation Code
Working on a Sample Linear Regression Exercise
Learning Logistic Regression Theory
Implementing a Logistic Regression Code
Working on a Sample Logistic Regression Exercise
Understanding Random Forests and Decision Trees
Learning Tree Methods Theory
Implementing Decision Trees and Random Forest Codes
Working on a Sample Random Forest Classification Exercise
Working with K-means Clustering
Understanding K-means Clustering Theory
Implementing a K-means Clustering Code
Working on a Sample Clustering Exercise
Working with Recommender Systems
Implementing Natural Language Processing
Understanding Natural Language Processing (NLP)
Overview of NLP Tools
Working on a Sample NLP Exercise
Streaming with Spark on Python
Overview Streaming with Spark
Sample Spark Streaming Exercise
Closing Remarks
[language] => en
[duration] => 21
[status] => published
[changed] => 1715349940
[source_title] => Python and Spark for Big Data (PySpark)
[source_language] => en
[cert_code] =>
[weight] => -998
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => sparkpython
)
[graphcomputing] => stdClass Object
(
[course_code] => graphcomputing
[hr_nid] => 278402
[title] => Introduction to Graph Computing
[requirements] =>
An undersanding of Java programming and frameworks
A general understanding of Python is helpful but not required
A general understanding of database concepts
Audience
Developers
[overview] =>
Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set of tools and processes -- these tools and processes can be referred to as Graph Computing (also known as Graph Analytics).
In this instructor-led, live training, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics and Distributed Graph Processing) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
Format of the course
Part lecture, part discussion, exercises and heavy hands-on practice
[category_overview] =>
In this instructor-led, live training in <loc>, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
Understand how graph data is persisted and traversed.
Select the best framework for a given task (from graph databases to batch processing frameworks.)
Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
View real-world big data problems in terms of graphs, processes and traversals.
[outline] =>
Introduction
Graph databases and libraries
Understanding Graph Data
The graph as a data structure
Using vertices (dots) and edges (lines) to model real-world scenarios
Using Graph Databases to Model, Persist and Process Graph Data
Local graph algorithms/traversals
neo4j, OrientDB and Titan
Exercise: Modeling Graph Data with neo4j
Whiteboard data modeling
Beyond Graph Databases: Graph Computing
Understanding the property graph
Graph modeling different scenarios (software graph, discussion graph, concept graph)
Solving Real-World Problems with Traversals
Algorithmic/directed walk over the graph
Determining circular cependencies
Case Study: Ranking Discussion Contributors
Ranking by number and depth of contributed discussions
Leveraging Hadoop for storage (HDFS) and processing (MapReduce)
Overview of iterative algorithms
Hama, Giraph, and GraphLab
Graph Computing: Graph-Parallel Computation
Unifying ETL, exploratory analysis, and iterative graph computation within a single system
GraphX
Setup and Installation
Hadoop and Spark
GraphX Operators
Property, structural, join, neighborhood aggregation, caching and uncaching
Iterating with Pregel API
Passing arguments for sending, receiving and computing
Building a Graph
Using vertices and edges in an RDD or on disk
Designing Scalable Algorithms
GraphX Optimization
Accessing Additional Algorithms
PageRank, Connected Components, Triangle Counting
Exercis: Page Rank and Top Users
Building and processing graph data using text files as input
Deploying to Production
Closing Remarks
[language] => en
[duration] => 28
[status] => published
[changed] => 1715349940
[source_title] => Introduction to Graph Computing
[source_language] => en
[cert_code] =>
[weight] => 0
[excluded_sites] =>
[use_mt] => stdClass Object
(
[field_overview] =>
[field_course_outline] =>
[field_prerequisits] =>
[field_overview_in_category] =>
)
[cc] => graphcomputing
)
[aitech] => stdClass Object
(
[course_code] => aitech
[hr_nid] => 199320
[title] => Artificial Intelligence - the most applied stuff - Data Analysis + Distributed AI + NLP
[requirements] =>
[overview] =>
This course is aimed at developers and data scientists who wish to understand and implement AI within their applications. Special focus is given to Data Analysis, Distributed AI and NLP.
[category_overview] =>
[outline] =>
Distribution big data
Data mining methods (training single systems + distributed prediction: traditional machine learning algorithms + Mapreduce distributed prediction)
Apache Spark MLlib
Recommendations and Advertising:
Natural language
Text clustering, text categorization (labeling), synonyms
User profile restore, labeling system
Recommended algorithms
Insuring the accuracy of "lift" between and within categories
How to create closed loops for recommendation algorithms
Logical regression, RankingSVM,
Feature recognition (deep learning and automatic feature recognition for graphics)
Natural language
Chinese word segmentation
Theme model (text clustering)
Text classification
Extract keywords
Semantic analysis, semantic parser, word2vec (vector to word)
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
It divides into two packages:
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
Audience
This course is directed at engineers and developers seeking to utilize a built in Machine Library for Apache Spark
[category_overview] =>
[outline] =>
spark.mllib: data types, algorithms, and utilities
Data types
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
streaming significance testing
random data generation
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. Real-life applications for this data mining technique include marketing, fraud detection, telecommunication and manufacturing.
In this instructor-led, live course, we introduce the processes involved in KDD and carry out a series of exercises to practice the implementation of those processes.
Audience
Data analysts or anyone interested in learning how to interpret data to solve problems
Format of the Course
After a theoretical discussion of KDD, the instructor will present real-life cases which call for the application of KDD to solve a problem. Participants will prepare, select and cleanse sample data sets and use their prior knowledge about the data to propose solutions based on the results of their observations.