Started first section of Course 4
This commit is contained in:
parent
c8a0bcabef
commit
c51d10402b
281
Part_4.md
Normal file
281
Part_4.md
Normal file
@ -0,0 +1,281 @@
|
|||||||
|
Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)
|
||||||
|
===============================================================================================
|
||||||
|
|
||||||
|
### Introduction
|
||||||
|
|
||||||
|
#### About the Course and Learning Path
|
||||||
|
|
||||||
|
Make better software, faster
|
||||||
|
|
||||||
|
#### Milestone: Getting Started
|
||||||
|
|
||||||
|
### Understanding Operations in Context
|
||||||
|
|
||||||
|
#### Section Introduction
|
||||||
|
|
||||||
|
#### What Is Ops?
|
||||||
|
|
||||||
|
- GCP Defined
|
||||||
|
|
||||||
|
- _"Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"_
|
||||||
|
|
||||||
|
- Logging Management
|
||||||
|
- Gather Logs, metrics and traces everywhere
|
||||||
|
- Audit, Platform, User logs
|
||||||
|
- Export, Discard, Ingest
|
||||||
|
|
||||||
|
- Error Reporting
|
||||||
|
- So much data, How do you pick out the important indicators?
|
||||||
|
- A centralized error management interface that shows current & past errors
|
||||||
|
- Identify your app's top or new errors at a glance, in a dedicated dashboard
|
||||||
|
- Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace
|
||||||
|
|
||||||
|
|
||||||
|
- Across-the-board Monitoring
|
||||||
|
- Dashboards for built-in and customizable visualisations
|
||||||
|
- Monitoring Features
|
||||||
|
- Visual Dashboards
|
||||||
|
- Health Monitoring
|
||||||
|
- Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
|
||||||
|
- Service Monitoring
|
||||||
|
- Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)
|
||||||
|
|
||||||
|
- SRE Tracking
|
||||||
|
- Monitoring is critical for SRE
|
||||||
|
- Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
|
||||||
|
- Pinpoint SLI's and develop and SLO on top of it
|
||||||
|
|
||||||
|
- Operational Management
|
||||||
|
- Debugging
|
||||||
|
- Inspects the state of your application at any code location in production without stopping or slowing down requests
|
||||||
|
- Latency Management
|
||||||
|
- Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
|
||||||
|
- Performance Management
|
||||||
|
- Offers continuous profiling of resource consumption in your production applications along with cost management
|
||||||
|
- Security Management
|
||||||
|
- With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud
|
||||||
|
|
||||||
|
**What is Ops: Key Takeaways**
|
||||||
|
|
||||||
|
- Ops Defined: Watch, learn and fix
|
||||||
|
- Primary services: Monitoring and Logging
|
||||||
|
- Monitoring dashboads for all metrics, including health and services (SLOs)
|
||||||
|
- Logs can be exported, discarded, or ingested
|
||||||
|
- SRE depends on ops
|
||||||
|
- Error alerting pinpoints problems, quickly
|
||||||
|
|
||||||
|
Scratch:
|
||||||
|
- Metric query and tracing analysis
|
||||||
|
- Establish performance and reliability indicators
|
||||||
|
- Trigger alerts and error reporting
|
||||||
|
- Logging Features
|
||||||
|
- Error Reporting
|
||||||
|
- SRE Tracking (SLI/SLO)
|
||||||
|
- Performance Management
|
||||||
|
|
||||||
|
|
||||||
|
#### Clarifying the Stackdriver/Operations Connection
|
||||||
|
|
||||||
|
- 2012 - Stackdriver Created
|
||||||
|
- 2014 - Stackdriver Acquired by Google
|
||||||
|
- 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
|
||||||
|
- 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name
|
||||||
|
|
||||||
|
**Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs** (formerly all called Stackdriver <service>)
|
||||||
|
|
||||||
|
"StackDriver" lives on - in the exam only
|
||||||
|
|
||||||
|
Integration + Upgrades
|
||||||
|
|
||||||
|
- Complete UI Integrations
|
||||||
|
- All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
|
||||||
|
- Dashboard API
|
||||||
|
- New API added to allow creation and sharing of dashoards across projects
|
||||||
|
- Log Retention Increased
|
||||||
|
- Logs can now be retained for up to **10 years** and you have control over the time specified
|
||||||
|
- Metrics Enhancement
|
||||||
|
- In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
|
||||||
|
- Advanced Alert Routing
|
||||||
|
- Alerts can now be routed to independent systems that support Cloud Pub/Sub
|
||||||
|
|
||||||
|
#### Operations and SRE: How Do They Relate?
|
||||||
|
|
||||||
|
- Lots of questions in Exam on SRE
|
||||||
|
|
||||||
|
What is SRE? - _"SRE is what happens when a software engineer is tasked with what used to be called operations"_ (Founder Google SRE Team)
|
||||||
|
|
||||||
|
**Pillars of DevOps**
|
||||||
|
|
||||||
|
- Accept failure as normal:
|
||||||
|
- Try to anticipate, but...
|
||||||
|
- Incidents bound to occur
|
||||||
|
- Failures help team learn
|
||||||
|
|
||||||
|
- No-fault postmortems & SLOs:
|
||||||
|
- No two failures the same
|
||||||
|
- Track incidents (SLIs)
|
||||||
|
- Map to Objectives (SLOs)
|
||||||
|
|
||||||
|
- Implement gradual change:
|
||||||
|
- Small updates are better
|
||||||
|
- Easier to review
|
||||||
|
- Easier to rollback
|
||||||
|
|
||||||
|
- Reduce costs of failures:
|
||||||
|
- Limited "canary" rollouts
|
||||||
|
- Impact fewest users
|
||||||
|
- Automate where possible
|
||||||
|
|
||||||
|
- Measure everything:
|
||||||
|
- Critical guage of sucess
|
||||||
|
- CI/CD needs full monitoring
|
||||||
|
- Synthetic, proactive monitoring
|
||||||
|
|
||||||
|
- Measure toil and reliability:
|
||||||
|
- Key to SLOs and SLAs
|
||||||
|
- Reduce toil, up engineering
|
||||||
|
- Monitor all over time
|
||||||
|
|
||||||
|
<hr style="height:2px;border-width:0;color:gray;background-color:gray">
|
||||||
|
|
||||||
|
SLI: _"A carefully defined <u>quantitative</u> measure of some aspect of the level of service that is provided"_
|
||||||
|
|
||||||
|
SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing
|
||||||
|
|
||||||
|
Example SLIs:
|
||||||
|
- Request Latency: How long it takes to return a response to a request
|
||||||
|
- Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
|
||||||
|
- Batch Throughput - Proportion of time = data processing rate > than a threshold
|
||||||
|
|
||||||
|
**Commit to Memory - Google's 4x Golden Signals!**
|
||||||
|
|
||||||
|
- Latency
|
||||||
|
- The time is takes for your service to fulfill a request
|
||||||
|
- Errors
|
||||||
|
- The rate at which your service fails
|
||||||
|
- Traffic
|
||||||
|
- How much demand is directed at your service
|
||||||
|
- Saturation
|
||||||
|
- A measure of how close to fully utilized the services' resources are
|
||||||
|
|
||||||
|
> **LETS**
|
||||||
|
|
||||||
|
<hr style="height:2px;border-width:0;color:gray;background-color:gray">
|
||||||
|
|
||||||
|
SLO: _"Service level objectives (SLOs) specify a target level for the reliability of your service"_ - The site reliability workbook
|
||||||
|
|
||||||
|
SLOs are tied to you SLIs
|
||||||
|
- Measured by SLLI
|
||||||
|
- Can be a single target value or range of values
|
||||||
|
- SLIs <= SLO
|
||||||
|
- or
|
||||||
|
- (lower bound <= SLI <= upper bound) = SLO
|
||||||
|
- Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)
|
||||||
|
|
||||||
|
SLI - Metric over time which detail the health of a service
|
||||||
|
- example: `Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile`
|
||||||
|
|
||||||
|
SLO - Agreed-upon bounds how often SLIs must be met
|
||||||
|
- example: `95% percentile homepage SLI will suceed 99.9% of the time over the next year`
|
||||||
|
|
||||||
|
Phases of Service Lifetime
|
||||||
|
|
||||||
|
SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:
|
||||||
|
- Measure and track SLIs (Measuring increasing performance)
|
||||||
|
- Evaluate reliability
|
||||||
|
- Define SLOs
|
||||||
|
- Build capacity models
|
||||||
|
- Establish incident response, shared with dev team
|
||||||
|
|
||||||
|
General Availability Phase
|
||||||
|
- After Production Readiness Review passed
|
||||||
|
- SREs handle majority of op work
|
||||||
|
- Incident responses
|
||||||
|
- Track operational load and SLOs
|
||||||
|
|
||||||
|
**Ops & SRE: Key Takeaways**
|
||||||
|
- SRE: Operations from a software engineer
|
||||||
|
- Many shared pillars between DevOps/SRE
|
||||||
|
- SLIs are quantitative metrics over time
|
||||||
|
- Remember the 4x Google Golden Signals (LETS)
|
||||||
|
- SLOs are a target objective for reliability
|
||||||
|
- SLIs are lower then SLO - or - in-between upper and lower bound
|
||||||
|
- SREs are most active in limited availability and general availability phases
|
||||||
|
|
||||||
|
|
||||||
|
#### Operation Services at a Glance
|
||||||
|
|
||||||
|
#### Section Review
|
||||||
|
|
||||||
|
#### Milestone: The Weight of the World (Teamwork, Not Superheroes)
|
||||||
|
|
||||||
|
|
||||||
|
Monitoring Your Operations
|
||||||
|
Section Introduction
|
||||||
|
Cloud Monitoring Concepts
|
||||||
|
Monitoring Workspaces Concepts
|
||||||
|
Monitoring Workspaces
|
||||||
|
Perspective: Workspaces in Context
|
||||||
|
What Are Metrics?
|
||||||
|
Exploring Workspace and Metrics
|
||||||
|
Monitoring Agent Concepts
|
||||||
|
Installing the Monitoring Agent
|
||||||
|
Collecting Monitoring Agent Metrics
|
||||||
|
Integration with Monitoring API
|
||||||
|
Create Dashboards with Command Line
|
||||||
|
GKE Metrics
|
||||||
|
Perspective: What's Up, Doc?
|
||||||
|
Uptime Checks
|
||||||
|
Establishing Human-Actionable and Automated Alerts
|
||||||
|
Section Review
|
||||||
|
Milestone: Spies Everywhere! (Check Those Vitals!)
|
||||||
|
Hands-On Lab:
|
||||||
|
Install and Configure Monitoring Agent with Google Cloud Monitoring
|
||||||
|
Logging Activities
|
||||||
|
Section Introduction
|
||||||
|
Cloud Logging Fundamentals
|
||||||
|
Log Types and Mechanics
|
||||||
|
Cloud Logging Tour
|
||||||
|
Logging Agent Concepts
|
||||||
|
Install Logging Agent and Collect Agent Logs
|
||||||
|
Logging Filters
|
||||||
|
Hands-On with Advanced Filters
|
||||||
|
VPC Flow Logs
|
||||||
|
Firewall Logs
|
||||||
|
VPC Flow Logs and Firewall Logs Demo
|
||||||
|
Routing and Exporting Logs
|
||||||
|
Export Logs to BigQuery
|
||||||
|
Logs-Based Metrics
|
||||||
|
Section Review
|
||||||
|
Milestone: Let the Record Show
|
||||||
|
Hands-On Lab:
|
||||||
|
Install and Configure Logging Agent on Google Cloud
|
||||||
|
SRE and Alerting Policies
|
||||||
|
SLOs and Alerting Strategy
|
||||||
|
Service Monitoring
|
||||||
|
Milestone: Come Together, Right Now, SRE
|
||||||
|
Optimize Performance with Trace/Profiler
|
||||||
|
Section Introduction
|
||||||
|
What the Services Do and Why They Matter
|
||||||
|
Tracking Latency with Cloud Trace
|
||||||
|
Accessing the Cloud Trace APIs
|
||||||
|
Setting Up Your App with Cloud Profiler
|
||||||
|
Analyzing Cloud Profiler Data
|
||||||
|
Section Review
|
||||||
|
Milestone: It All Adds Up!
|
||||||
|
Hands-On Lab:
|
||||||
|
Discovering Latency with Google Cloud Trace
|
||||||
|
Identifying Application Errors with Debug/Error Reporting
|
||||||
|
Section Introduction
|
||||||
|
Troubleshooting with Cloud Debugger
|
||||||
|
Establishing Error Reporting for Your App
|
||||||
|
Managing Errors and Handling Notifications
|
||||||
|
Section Review
|
||||||
|
Milestone: Come Together - Reprise (Debug Is De Solution)
|
||||||
|
Hands-On Lab:
|
||||||
|
Correcting Code with Cloud Debugger
|
||||||
|
Course Conclusion
|
||||||
|
Milestone: Are We There, Yet?
|
||||||
|
landscape
|
||||||
|
Practice Exam / Quiz:
|
||||||
|
Google Certified Professional Cloud DevOps Engineer Exam Prep
|
||||||
Loading…
Reference in New Issue
Block a user