ࡱ> T( / 00DTimes New Roman@b@bLb0pbpb 0DTahomaew Roman@b@bLb0pbpb 0" DMonotype Sorts@b@bLb0pbpb 00DCourier Newts@b@bLb0pbpb 01  @n?" dd@  @@`` hl&3      %&'6!"#$*+,.2 )6(/0 1-3c $ 333f33@g46d6d|b 0tbf ppp@  <4!d!d@b@b<4BdBd@b@bnrj___PPT9L/ 000@ ,-*"#H,&946?  O =J"Where Is Database Research Headed?'Jeffrey D. Ullman DASFAA March 26, 2003 1OutlineCore values for database research.  Biggest data. Query optimization. Some interesting directions.@#$u#$2New DirectionsvInformation integration. Stream processing. Semistructured data and XML. Peer-to-peer and grid databases. Data mining.,<3Core Database ValuesObvious: we deal with the largest amount of data possible. Less obvious: very-high-level languages. Big data must be dealt with in uniform ways. Least obvious: query optimization essential for success. Compare APL (failure) with SQL (success).Ld-9*d-9*4New DirectionsvInformation integration. Stream processing. Semistructured data and XML. Peer-to-peer and grid databases. Data mining.w`,<5Information IntegrationRelated sources of data need to be viewed as one whole. Applications: catalogs (seeing products from many suppliers), digital libraries, scientific databases, enterprise-wide information resources, etc., etc.6Local and Global SchemasSources each have their own local schema = ways their data is stored, organized, and represented. Integration requires a global schema and mechanisms to translate between the global schema and each local schema.6 R O7Two ApproachesWarehousing : Collect data from sources into a  warehouse periodically. Do queries at the warehouse, while the sources execute transactions invisibly. Mediation : Virtual warehouse processes queries by translating between common schema and local schemas at sources.z" " " g"  g 8Warehouse Diagram9  A Mediator: Two Mediation ApproachesQuery-centric : Mediator processes queries into steps executed at sources. Enosys sells first example as BEA s  liquid data. View-centric : Sources are defined in terms of global relations; mediator finds all ways to build query from views.NK3t >3 h,K; Very Simple ExampleSuppose Dell wants to buy a bus and a disk that share the same protocol. Global schema: Buses(manf,model,protocol) Disks(manf,model,protocol) Local schemas: each bus or disk manufacturer has a (model,protocol) relation --- manf is implied.@I 7 U>_b < Example: Query-CentricMediator might start by querying each bus manufacturer for model-protocol pairs. The wrapper would turn them into triples by adding the manf component. Then, for each protocol returned, mediator queries disk manufacturers for disks with that protocol. Again, wrapper adds manf component.LQGd$QGd$, = Example: View-CentricSources capabilities are defined in terms of the global predicates. E.g., Hitachi s disk database could be defined by HitachiView(M,P) = Disks( Hitachi ,M,P). Mediator discovers all combinations of a bus and disk  view, equijoined on the protocol components. Theory:  answering queries using views --- like fitting puzzle pieces together.LE[eQE[eQ,w \ nb3 ComparisonQuery-centric is simpler to implement. Lets you have control of what the mediator does. View centric is more extensible. Same query engine works for any number of sources. Add a new source simply by defining what it contributes as a view of the global schema.L'1!'1!>Research IssuesOptimization, optimization, optimization. In query centric systems: how do we choose a plan? E.g., is it better to ask about buses first, or disks? In view-centric systems, how do we select a sufficient set of solutions to get most or all of the possible answers?N*37t*37tCNew DirectionsvInformation integration. Stream processing. Semistructured data and XML. Peer-to-peer and grid databases. Data mining.$wM,<?Stream Management SystemsAdds to the relation a stream datatype = infinite sequence of tuples that arrive at a port one-at-a-time. Applications: Telecom billing, intrusion detection, monitoring Web hits, sensor networks, etc., etc.$, @Stream-DBMS ArchitectureD"Stanford Approach (Widom, Motwani),Central idea is the window, a relation that is formed from a stream by some rule. Examples:  last 10 tuples,  all tuples in the past 24 hours. Query language is SQL-like, with diction for converting a stream to a window to a relation.HR?\8?\,esEExample:SELECT & FROM Stream1 [last 10] AS Window1,& WHERE Window1.a = 5 AND & GGG+MIT-Centered Approach (Stonebraker, others)  Define and implement common operations on streams. Query language is algebraic: a sequence of operations to be applied to source streams and the results of other operations.c4Research ChallengesFAgain, optimization is central. New language constructs and data types make old ideas less useful. Semantics is not 100% clear. Example: when you join two windows created with different time limits, what does the result represent in terms of the original streams? It matters if you want to apply algebraic laws to expressions.j ZCZZZ?Z C? Z-New DirectionsvInformation integration. Stream processing. Semistructured data and XML. Peer-to-peer and grid databases. Data mining.$w,0,<[.Semistructured DataThis data model uses trees or graphs instead of relations. Key application: information integration, where global data is perceived as  flexible objects, with a variety of fields and structures. Evolved into XML, XSL, XPATH, XQUERY, etc.Z\/"Example: Semistructured Data Graph##(  ]0XML and Semistructured Data~XML (Extensible Markup Language) uses a semistructured data model to represent documents. Example: <BARDOC><BAR><NAME>Joe s</NAME> <ADDR>Maple St.</ADDR></BAR> <BAR> & </BAR> & </BARDOC>(d\d\(^1XML ApplicationsCurrently used as the document format for many systems that exchange information. These documents rarely appear on the Web, so XML appears to be unused. XML documents may become the standard element in database systems. Relation is a special case.rRHCR12_2Querying XML DataXQUERY is new standard for querying XML documents. Very-high-level, similar to SQL. Research just beginning on how to optimize queries about XML documents. Successful techniques not like those applied successfully in SQL systems.`3!HJ3!H1`*New DirectionsvInformation integration. Stream processing. Semistructured data and XML. Peer-to-peer and grid databases. Data mining.$wI,<Q!Peer-to-Peer and GridsPeer-to-peer systems are application-level attempts to share information and/or processes. Grid computing is an attempt to bring P2P support to the operating-system level.R"P2P ApplicationsFile sharing as in Napster, Kazaa, etc. Specific scientific applications: Seti@home, Folding@home. Distributed databases, e.g., digital libraries. Replication within an intranet for high availability.>){S#Additional Grid GoalsScientific applications routinely solved using a network of workstations. Reselling of unused cycles. Global resources, e.g., buy your storage over the Internet rather than manage your own local disks. Massive multiplayer games.T$(Grid Pro s and Con s+ Possibly a good architecture for scientific computing. + Cross-platform support may lead to more P2P applications. -- Businesses involving trade in resources among untrusted players is unlikely to win converts. %HPeer-to-Peer Databases~Data is distributed among independent sources. Similar to information-integration, but much looser constraints on cooperation.IP2P DBMS ArchitectureJP2P Research IssuesStrategies for trading storage. How do I accept bids for someone to make a copy of my data? Will they keep it forever? Storage auction strategies? Query and search strategies. How far to search? How to manage competing requests? Use of localized indexes?n " Zt" Z" ZO" Z tO d5Search Problem --- ContinuedNapster was a completely centralized index. Kazaa is a completely distributed index --- you can only find things by searching neighbors recursively. Optimum is undoubtedly some compromise, where nodes know about data at some, but not all, others.$$a,New DirectionsvInformation integration. Stream processing. Semistructured data and XML. Peer-to-peer and grid databases. Data mining.$wj ,<U% Data MiningMeans different things to different communities. Underlying theme: build models that represent data approximately. Examples: decision trees, clustering, hidden Markov models, Bayesian models, frequent itemsets (association rules) etc. etc.V&The Database ViewMain research problem is how to implement very complicated queries on very large data efficiently. Invention of new algorithms or algorithms adapted to non-main-memory data. Can it be done in SQL? How do you optimize these queries?&ccW'Example --- Frequent PairsItems={milk, coke, pepsi, beer, juice}. Support = pair appears in at least 3  baskets. B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} Frequent pairs: {m, b}, {c, b}, {j, c}. jXZ{Z(ZZZ(){*g9 ApplicationsdStores use frequent itemsets to plan layout of store, sale strategies. Example, run sale on hamburger; raise the price of ketchup. Looked at correctly ( item = document,  basket = sentence), frequent pairs = plagiarized documents. Correlated pairs useful for on-line sellers to predict what you will buy.@GZ<ZZG< X(Frequent-Pair AlgorithmslModel: baskets in a file;  passes stream the file, while main-memory is used to process in some way. Simplest idea: count all pairs in memory. Limited by size-of-memory > O(items2).<'#e6A-Priori AlgorithmOn first pass, count only the number of times each item appears. Determine which items occur at least as many times as the support threshold s. On second pass, count only pairs of items that both appear alone at least s times.JZTwZL f7Picture of A-Priorih8More Frequent-Pair AlgorithmsHashed-based improvements take advantage of the fact that on the first pass, most of main memory is unused. Correlated-pair algorithms find rare, but correlated events, e.g., books bought by similar, small sets of customers.  Min-hashing is key idea.&Y)Research QuestionsHow fast can you compute frequent pairs with limited main memory? What is the best SQL query when data is stored as Baskets(bID, itemID), rather than as a list of basket contents? Similar questions in many other mining areas, e.g., clustering.,|mj:The EndOThank you very much for listening. Now go out and solve some of these problems!  ` ̙33` ` ff3333f` 333MMM` f` f` 3>?" dd@,?udd@ w " @ ` n?" dd@   @@``PR    @ ` ` p>> 3+(    6' P  T Click to edit Master title style! !  0d'   RClick to edit Master text styles Second level Third level Fourth level Fifth level!     S  0' ``  W*  0$( `   Y*  0( `   Y*Z  By޽h @ ? ̙33 Default Design 0 0.(    0! P    Y*   0!     [* d  c $ ?    0D"  @  RClick to edit Master text styles Second level Third level Fourth level Fifth level!     S  6" `P   Y*   6# `   [* H  0޽h ? ̙33 @ $(   r  S $$p  r  S $ `    H  0޽h ? ̙33&  Pf(  r  S ĂP     S $<$ 0  "p`PpH  0޽h ? ̙33&  `f(  r  S P     S <$ 0  "p`PpH  0޽h ? ̙33  pP(  r  S DP     S <$ 0  H  0޽h ? ̙332  r(  x  c $P     c $<$ 0  "p`PpH  0޽h ? ̙33(  h(  ~  s *DP     s *<$ 0  H  0޽h ? ff̙ff(  h(  ~  s *$P     s *<$ 0  H  0޽h ? ff̙ff8   x(  ~  s *$jP     c $0P<$ 0  "p`PpH  0޽h ? ff̙ff    M(  ~  s *DP   b  6p  ; Warehouse    0d   9Wrapper  06   9Wrapperb  06`  :Source 1  b  06`  :Source 2  RB  s *D  RB  s *D  RB  s *D RB  s *D0  H  0޽h ? ̙33  RJ(  ~  s *6P   b  66p  :Mediator    06   9Wrapper  0t6   9Wrapperb  046`  :Source 1  b  06`  :Source 2  z &$   $& ,$D 0ZB   s *D P    <6&$  < User query z @    @ ,$D 0ZB  B s *D0` ZB   s *D  ZB  s *D0 0 ZB  s *D@ @   <6 7Query  <t6 ,  7Query  <6    7Query  046 :  7Query P  # P ,$D 0ZB  s *D  ZB  s *D  ZB  s *DPP ZB  s *D    <66\ 8Result  <6   8Result  <T6  8Result  <6  8Resultz      ,$D 0ZB  s *D P   < 6  8ResultH  0޽h ? ̙33&  f(  r  S t 6P     S  6P0<$ 0  "p`PpH  0޽h ? ̙33  $(  r  S  6P   r  S E6  H  0޽h ? ̙33  P(  r  S tF6P  6   S F6 <$ 0 6 H  0޽h ? ̙33  P(  r  S I6   6   S tI6PP0<$ 0 6 H  0޽h ? ̙33  P(  r  S K6P  6   S L6`<$ 0 6 H  0޽h ? ̙33   P(  r  S TN6P  6   S N60<$ 0 6 H  0޽h ? ̙332  0r(  x  c $O6P  6   c $4P6<$ 0 6 "p`PpH  0޽h ? ̙33(  @h(  ~  s *P6P  6   s *P6<$ 0 6 H  0޽h ? ff̙ff#   Pc(  ~  s *4s6P  6 B  0s6 p   FConventional relationsB  0t6̙ PP  ?Scratch storage  6tu6p @ ?Query processorRB  s *D RB  s *D RB  s *D RB  s *D  pRB  s *D P P pRB  s *D@ RB  s *D@   <4v6 ! Stream inputs  <v6) "Stream outputsXB  0D XB @ 0Dp   <w6`n  "Ad-hoc queries  <tx6v  @Standing queriesH  0޽h ? ff̙ff  ` P(   r   S x6P  6    S 4y6<$ 0 6 H   0޽h ? ̙33  p$(  r  S {6P  6 r  S t{6` 6 H  0޽h ? ̙33  P(  r  S {6P  6   S 4|6<$ 0 6 H  0޽h ? ̙33  \(  x  c $T}6P  6   c $}6 `<$ 0 6 H  0޽h ? ̙332  tr(  tx t c $$6P  6  t c $6<$ 0 6 "p`PpH t 0޽h ? ̙33(  xh(  x~ x s *6P  6  x s *D6<$ 0 6 H x 0޽h ? ff̙ff  um55|(  |x | c $d6  6 R2 | s * p pR2 | s *  R2 | s *R2 | s *R2 | s *2 | 06  7Bud2  | 0$6P p 8A.B.2  | 06    8Gold2  | 06 `  819952  | 06 0  = Maple St.  2  | 06 P  > Joe s2 | 0d6 `   X M lobRB |@ s *D@ RB | s *D@p RB | s *D@RB | s *D`P RB |@ s *D  RB |@ s *D` RB | s *D`RB |@ s *D  RB | s *Dp` RB |@ s *D` RB |@ s *D 0 RB | s *D   | <6  6beer | <$6`  6beer | B CPDEF xHP@  ` RB |@ s *D 0`  | <6@0: 5bar  | <60p J * Pmanf !| <D60 * Pmanf "| <6   TservedAt  #| <d6P J  6name $| <6 z 6name %| <$6  6name &| <6 Pz  Paddr '| <6` Z 7prize (| <D6pj  6year )| <6pj  7awardRB *| s *D   +| <N7  6rootz `  ,|  ` ,$D 0 -| <O7  j8The bar object for Joe s BarfB .| 6DfԔ`  z Z  /| Z ,$D 0 0| <tP7 Z  IThe beer object for BudfB 1| 6DfԔ  z `  2| ` ,$D 0`" 3| 0̙P  4| <4Q7`` ENotice unusual dataZB 5|B s *DPH | 0޽h ? ̙33  6(  ~  s *Q7P  7 x  c $Q7` 7 H  0޽h ? ff̙ff(  h(  ~  s *S7P  7   s *tS7<$ 0 7 H  0޽h ? ff̙ff(  h(  ~  s *T7P  7   s *T7<$ 0 7 H  0޽h ? ff̙ff2  hr(  hx h c $U7P  7  h c $V7<$ 0 7 "p`PpH h 0޽h ? ̙33  DP(  Dr D S tV7P  7  D S V7<$ 0 7 H D 0޽h ? ̙33&   Hf(  Hr H S W7P  7  H S W7<$ 0 7 "p`PpH H 0޽h ? ̙33&  0Lf(  Lr L S Y7P  7  L S tY7<$ 0 7 "p`PpH L 0޽h ? ̙33  @PD(  Pl P C Z7P  7  P C $7<$ 0 7 H P 0޽h ? ̙33(  P h(   ~   s *D7P  7    s *7p<$ 0 7 H   0޽h ? ff̙ff     `$= (  $~ $ s *7P  7 2 $ 6$7 p  2MeR2 $ s * R2 $ s *R2 $ s *  XB $ 0D0 XB $ 0D @ XB  $ 0D  XB  $ 0D@ p XB  $ 0D0` XB  $ 0DXB  $ 0D   $ <7v PeersB $ 07  ` 7My dataXB $ 0D Tz : @  $ : @ ,$D 0`B $ 0DPP  `B $B 0D @ p@ `B $B 0DP   $ <7: @   My clientsIz v  $ v ,$D 0ZB $ s *D  ZB $ s *D p  $ <d7    My requestsZB $B s *D3 ZB $B s *D3 p p  $ <$7 p v  (Requests from othersH $ 0޽h ? ff̙ff8  p(x(  (~ ( s *7P  7  ( c $70P<$ 0 7 "p`PpH ( 0޽h ? ff̙ff  P(  r  S 7P  K   S 7<$ 0 K H  0޽h ? ̙332  pr(  px p c $7P  K  p c $$7<$ 0 K "p`PpH p 0޽h ? ̙33  TP(  Tr T S 7P  K  T S 7<$ 0 K H T 0޽h ? ̙33&  Xf(  Xr X S 7P  K  X S d7<$ 0 K "p`PpH X 0޽h ? ̙33  \0(  \x \ c $TcKP   K x \ c $cKPp K H \ 0޽h ? ff̙ff&  f(  r  S dKP  K   S 4eK`<$ 0 K "p`PpH  0޽h ? ̙33  `P(  `r ` S fKP  7  ` S gK<$ 0 K H ` 0޽h ? ̙33&  f(  r  S 4hKP  K   S hK<$ 0 K "p`PpH  0޽h ? ̙33  3+ (  ~  s *TiKP  K   0iKp  :   0jK   :   0jKP  ; Item counts    <4kK  Pass 1  <kK j Pass 2  0TlK P >Frequent itemsXB   0D    <mK` L  0Counts of candidate pairsXB   0D H  0޽h ? ff̙ff  P(  r  S tmKP  K   S mK<$ 0 K H  0޽h ? ̙33   dP(  dr d S dKP  K  d S īK<$ 0 K H d 0޽h ? ̙33  0P(  r  S KP  K   S DK<$  0 K H  0޽h ? ̙33r T^c1eh4jLlnpr&u;{GɝC0_ G@QIwJb 4nki/ jpjb Oh+'0\ `h  CS206 --- Electronic Commercet Jeff Ullman Jeff Ullman80fMicrosoft PowerPointCom@w@}J@ +EGDoM  .& &&#TNPPp0a & TNPP &&TNPP    && "--&&- $0Z- $Z- $>- $>h- $h- $L- $Lv- $v- $Z*- $*Z- $- $h8- $8h- $- $vF- $Fv&&& "- & $&8d8&-&& &&-&&&&- $0Z- $Z- $>- $>h- $h- $L- $Lv- $v- $Z*- $*Z- $- $h8- $8h- $- $vF- $Fv&- "--&&&y& - Times New Roman- . 2 e1 .--iyH-- "Tahoma- ..2 jWhere Is Database Research5!(!$!. .2 fpHeaded?(! .--Q1-- "Tahoma- .2 > Jeffrey D.   . .2 Ullman $. .2 DASFAA. .2 BNMarch 26, 2003!  .--"System-&TNPP & ՜.+,D՜.+,    ( :On-screen ShowStanford University, CS Dept.Re1j 6Times New RomanTahomaMonotype Sorts Courier NewDefault Design#Where Is Database Research Headed?OutlineNew DirectionsCore Database ValuesNew DirectionsInformation IntegrationLocal and Global SchemasTwo ApproachesWarehouse Diagram A MediatorTwo Mediation ApproachesVery Simple ExampleExample: Query-CentricExample: View-Centric ComparisonResearch IssuesNew DirectionsStream Management SystemsStream-DBMS Architecture#Stanford Approach (Widom, Motwani) Example:,MIT-Centered Approach (Stonebraker, others)Research ChallengesNew DirectionsSemistructured Data#Example: Semistructured Data GraphXML and Semistructured DataXML ApplicationsQuerying XML DataNew DirectionsPeer-to-Peer and GridsP2P ApplicationsAdditional Grid GoalsGrid Pros and ConsPeer-to-Peer DatabasesP2P DBMS ArchitectureP2P Research IssuesSearch Problem --- ContinuedNew Directions Data MiningThe Database ViewExample --- Frequent Pairs ApplicationsFrequent-Pair AlgorithmsA-Priori AlgorithmPicture of A-PrioriMore Frequent-Pair AlgorithmsResearch QuestionsThe End  Fonts UsedDesign Template Slide Titles1 6> _PID_GUIDAN{58796F60-5535-11D7-BC66-009027B46048}Root EntrydO)pXCurrent UserSSummaryInformation(PowerPoint Document()_eToshiyuki AMAGASAToshiyuki AMAGASARoot EntrydO)0Q@Current User8SummaryInformation(PowerPoint Document( _9bluechrybluechryiyuki AMAGASA  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Root EntrydO)Current UserSummaryInformation(PowerPoint Document(DocumentSummaryInformation8