282 lines
9.5 KiB
Markdown
282 lines
9.5 KiB
Markdown
|
|
Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)
|
||
|
|
===============================================================================================
|
||
|
|
|
||
|
|
### Introduction
|
||
|
|
|
||
|
|
#### About the Course and Learning Path
|
||
|
|
|
||
|
|
Make better software, faster
|
||
|
|
|
||
|
|
#### Milestone: Getting Started
|
||
|
|
|
||
|
|
### Understanding Operations in Context
|
||
|
|
|
||
|
|
#### Section Introduction
|
||
|
|
|
||
|
|
#### What Is Ops?
|
||
|
|
|
||
|
|
- GCP Defined
|
||
|
|
|
||
|
|
- _"Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"_
|
||
|
|
|
||
|
|
- Logging Management
|
||
|
|
- Gather Logs, metrics and traces everywhere
|
||
|
|
- Audit, Platform, User logs
|
||
|
|
- Export, Discard, Ingest
|
||
|
|
|
||
|
|
- Error Reporting
|
||
|
|
- So much data, How do you pick out the important indicators?
|
||
|
|
- A centralized error management interface that shows current & past errors
|
||
|
|
- Identify your app's top or new errors at a glance, in a dedicated dashboard
|
||
|
|
- Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace
|
||
|
|
|
||
|
|
|
||
|
|
- Across-the-board Monitoring
|
||
|
|
- Dashboards for built-in and customizable visualisations
|
||
|
|
- Monitoring Features
|
||
|
|
- Visual Dashboards
|
||
|
|
- Health Monitoring
|
||
|
|
- Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
|
||
|
|
- Service Monitoring
|
||
|
|
- Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)
|
||
|
|
|
||
|
|
- SRE Tracking
|
||
|
|
- Monitoring is critical for SRE
|
||
|
|
- Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
|
||
|
|
- Pinpoint SLI's and develop and SLO on top of it
|
||
|
|
|
||
|
|
- Operational Management
|
||
|
|
- Debugging
|
||
|
|
- Inspects the state of your application at any code location in production without stopping or slowing down requests
|
||
|
|
- Latency Management
|
||
|
|
- Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
|
||
|
|
- Performance Management
|
||
|
|
- Offers continuous profiling of resource consumption in your production applications along with cost management
|
||
|
|
- Security Management
|
||
|
|
- With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud
|
||
|
|
|
||
|
|
**What is Ops: Key Takeaways**
|
||
|
|
|
||
|
|
- Ops Defined: Watch, learn and fix
|
||
|
|
- Primary services: Monitoring and Logging
|
||
|
|
- Monitoring dashboads for all metrics, including health and services (SLOs)
|
||
|
|
- Logs can be exported, discarded, or ingested
|
||
|
|
- SRE depends on ops
|
||
|
|
- Error alerting pinpoints problems, quickly
|
||
|
|
|
||
|
|
Scratch:
|
||
|
|
- Metric query and tracing analysis
|
||
|
|
- Establish performance and reliability indicators
|
||
|
|
- Trigger alerts and error reporting
|
||
|
|
- Logging Features
|
||
|
|
- Error Reporting
|
||
|
|
- SRE Tracking (SLI/SLO)
|
||
|
|
- Performance Management
|
||
|
|
|
||
|
|
|
||
|
|
#### Clarifying the Stackdriver/Operations Connection
|
||
|
|
|
||
|
|
- 2012 - Stackdriver Created
|
||
|
|
- 2014 - Stackdriver Acquired by Google
|
||
|
|
- 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
|
||
|
|
- 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name
|
||
|
|
|
||
|
|
**Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs** (formerly all called Stackdriver <service>)
|
||
|
|
|
||
|
|
"StackDriver" lives on - in the exam only
|
||
|
|
|
||
|
|
Integration + Upgrades
|
||
|
|
|
||
|
|
- Complete UI Integrations
|
||
|
|
- All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
|
||
|
|
- Dashboard API
|
||
|
|
- New API added to allow creation and sharing of dashoards across projects
|
||
|
|
- Log Retention Increased
|
||
|
|
- Logs can now be retained for up to **10 years** and you have control over the time specified
|
||
|
|
- Metrics Enhancement
|
||
|
|
- In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
|
||
|
|
- Advanced Alert Routing
|
||
|
|
- Alerts can now be routed to independent systems that support Cloud Pub/Sub
|
||
|
|
|
||
|
|
#### Operations and SRE: How Do They Relate?
|
||
|
|
|
||
|
|
- Lots of questions in Exam on SRE
|
||
|
|
|
||
|
|
What is SRE? - _"SRE is what happens when a software engineer is tasked with what used to be called operations"_ (Founder Google SRE Team)
|
||
|
|
|
||
|
|
**Pillars of DevOps**
|
||
|
|
|
||
|
|
- Accept failure as normal:
|
||
|
|
- Try to anticipate, but...
|
||
|
|
- Incidents bound to occur
|
||
|
|
- Failures help team learn
|
||
|
|
|
||
|
|
- No-fault postmortems & SLOs:
|
||
|
|
- No two failures the same
|
||
|
|
- Track incidents (SLIs)
|
||
|
|
- Map to Objectives (SLOs)
|
||
|
|
|
||
|
|
- Implement gradual change:
|
||
|
|
- Small updates are better
|
||
|
|
- Easier to review
|
||
|
|
- Easier to rollback
|
||
|
|
|
||
|
|
- Reduce costs of failures:
|
||
|
|
- Limited "canary" rollouts
|
||
|
|
- Impact fewest users
|
||
|
|
- Automate where possible
|
||
|
|
|
||
|
|
- Measure everything:
|
||
|
|
- Critical guage of sucess
|
||
|
|
- CI/CD needs full monitoring
|
||
|
|
- Synthetic, proactive monitoring
|
||
|
|
|
||
|
|
- Measure toil and reliability:
|
||
|
|
- Key to SLOs and SLAs
|
||
|
|
- Reduce toil, up engineering
|
||
|
|
- Monitor all over time
|
||
|
|
|
||
|
|
<hr style="height:2px;border-width:0;color:gray;background-color:gray">
|
||
|
|
|
||
|
|
SLI: _"A carefully defined <u>quantitative</u> measure of some aspect of the level of service that is provided"_
|
||
|
|
|
||
|
|
SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing
|
||
|
|
|
||
|
|
Example SLIs:
|
||
|
|
- Request Latency: How long it takes to return a response to a request
|
||
|
|
- Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
|
||
|
|
- Batch Throughput - Proportion of time = data processing rate > than a threshold
|
||
|
|
|
||
|
|
**Commit to Memory - Google's 4x Golden Signals!**
|
||
|
|
|
||
|
|
- Latency
|
||
|
|
- The time is takes for your service to fulfill a request
|
||
|
|
- Errors
|
||
|
|
- The rate at which your service fails
|
||
|
|
- Traffic
|
||
|
|
- How much demand is directed at your service
|
||
|
|
- Saturation
|
||
|
|
- A measure of how close to fully utilized the services' resources are
|
||
|
|
|
||
|
|
> **LETS**
|
||
|
|
|
||
|
|
<hr style="height:2px;border-width:0;color:gray;background-color:gray">
|
||
|
|
|
||
|
|
SLO: _"Service level objectives (SLOs) specify a target level for the reliability of your service"_ - The site reliability workbook
|
||
|
|
|
||
|
|
SLOs are tied to you SLIs
|
||
|
|
- Measured by SLLI
|
||
|
|
- Can be a single target value or range of values
|
||
|
|
- SLIs <= SLO
|
||
|
|
- or
|
||
|
|
- (lower bound <= SLI <= upper bound) = SLO
|
||
|
|
- Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)
|
||
|
|
|
||
|
|
SLI - Metric over time which detail the health of a service
|
||
|
|
- example: `Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile`
|
||
|
|
|
||
|
|
SLO - Agreed-upon bounds how often SLIs must be met
|
||
|
|
- example: `95% percentile homepage SLI will suceed 99.9% of the time over the next year`
|
||
|
|
|
||
|
|
Phases of Service Lifetime
|
||
|
|
|
||
|
|
SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:
|
||
|
|
- Measure and track SLIs (Measuring increasing performance)
|
||
|
|
- Evaluate reliability
|
||
|
|
- Define SLOs
|
||
|
|
- Build capacity models
|
||
|
|
- Establish incident response, shared with dev team
|
||
|
|
|
||
|
|
General Availability Phase
|
||
|
|
- After Production Readiness Review passed
|
||
|
|
- SREs handle majority of op work
|
||
|
|
- Incident responses
|
||
|
|
- Track operational load and SLOs
|
||
|
|
|
||
|
|
**Ops & SRE: Key Takeaways**
|
||
|
|
- SRE: Operations from a software engineer
|
||
|
|
- Many shared pillars between DevOps/SRE
|
||
|
|
- SLIs are quantitative metrics over time
|
||
|
|
- Remember the 4x Google Golden Signals (LETS)
|
||
|
|
- SLOs are a target objective for reliability
|
||
|
|
- SLIs are lower then SLO - or - in-between upper and lower bound
|
||
|
|
- SREs are most active in limited availability and general availability phases
|
||
|
|
|
||
|
|
|
||
|
|
#### Operation Services at a Glance
|
||
|
|
|
||
|
|
#### Section Review
|
||
|
|
|
||
|
|
#### Milestone: The Weight of the World (Teamwork, Not Superheroes)
|
||
|
|
|
||
|
|
|
||
|
|
Monitoring Your Operations
|
||
|
|
Section Introduction
|
||
|
|
Cloud Monitoring Concepts
|
||
|
|
Monitoring Workspaces Concepts
|
||
|
|
Monitoring Workspaces
|
||
|
|
Perspective: Workspaces in Context
|
||
|
|
What Are Metrics?
|
||
|
|
Exploring Workspace and Metrics
|
||
|
|
Monitoring Agent Concepts
|
||
|
|
Installing the Monitoring Agent
|
||
|
|
Collecting Monitoring Agent Metrics
|
||
|
|
Integration with Monitoring API
|
||
|
|
Create Dashboards with Command Line
|
||
|
|
GKE Metrics
|
||
|
|
Perspective: What's Up, Doc?
|
||
|
|
Uptime Checks
|
||
|
|
Establishing Human-Actionable and Automated Alerts
|
||
|
|
Section Review
|
||
|
|
Milestone: Spies Everywhere! (Check Those Vitals!)
|
||
|
|
Hands-On Lab:
|
||
|
|
Install and Configure Monitoring Agent with Google Cloud Monitoring
|
||
|
|
Logging Activities
|
||
|
|
Section Introduction
|
||
|
|
Cloud Logging Fundamentals
|
||
|
|
Log Types and Mechanics
|
||
|
|
Cloud Logging Tour
|
||
|
|
Logging Agent Concepts
|
||
|
|
Install Logging Agent and Collect Agent Logs
|
||
|
|
Logging Filters
|
||
|
|
Hands-On with Advanced Filters
|
||
|
|
VPC Flow Logs
|
||
|
|
Firewall Logs
|
||
|
|
VPC Flow Logs and Firewall Logs Demo
|
||
|
|
Routing and Exporting Logs
|
||
|
|
Export Logs to BigQuery
|
||
|
|
Logs-Based Metrics
|
||
|
|
Section Review
|
||
|
|
Milestone: Let the Record Show
|
||
|
|
Hands-On Lab:
|
||
|
|
Install and Configure Logging Agent on Google Cloud
|
||
|
|
SRE and Alerting Policies
|
||
|
|
SLOs and Alerting Strategy
|
||
|
|
Service Monitoring
|
||
|
|
Milestone: Come Together, Right Now, SRE
|
||
|
|
Optimize Performance with Trace/Profiler
|
||
|
|
Section Introduction
|
||
|
|
What the Services Do and Why They Matter
|
||
|
|
Tracking Latency with Cloud Trace
|
||
|
|
Accessing the Cloud Trace APIs
|
||
|
|
Setting Up Your App with Cloud Profiler
|
||
|
|
Analyzing Cloud Profiler Data
|
||
|
|
Section Review
|
||
|
|
Milestone: It All Adds Up!
|
||
|
|
Hands-On Lab:
|
||
|
|
Discovering Latency with Google Cloud Trace
|
||
|
|
Identifying Application Errors with Debug/Error Reporting
|
||
|
|
Section Introduction
|
||
|
|
Troubleshooting with Cloud Debugger
|
||
|
|
Establishing Error Reporting for Your App
|
||
|
|
Managing Errors and Handling Notifications
|
||
|
|
Section Review
|
||
|
|
Milestone: Come Together - Reprise (Debug Is De Solution)
|
||
|
|
Hands-On Lab:
|
||
|
|
Correcting Code with Cloud Debugger
|
||
|
|
Course Conclusion
|
||
|
|
Milestone: Are We There, Yet?
|
||
|
|
landscape
|
||
|
|
Practice Exam / Quiz:
|
||
|
|
Google Certified Professional Cloud DevOps Engineer Exam Prep
|